This article introduces Principal Components Regression and demonstrates its implementation process through examples.
Given p predictors and response variables, multiple linear regression uses a method such as least squares to obtain the smallest sum of squared errors (RSS):
RSS = Σ ( y i – y ^ i ) 2 {Σ(y_i – ŷ_i)^2} Σ(yi–y^i)2
Σ: summation symbol

y
i
{y_i}
yi: the actual response value of the ith observation

y
^
i
{ŷ_i}
y^i: Predicted values based on multiple linear regression models
However, when the predictors are highly correlated, multicollinearity issues arise, resulting in unreliable, high variance estimates of model coefficients. One way to avoid this problem is to use principal component regression analysis — find M linear combinations (principal components) from p predictors, and then fit a linear regression model using least squares using the principal components as predictors. The following example will take you to implement principal component regression analysis.
Load the necessary toolkit
The easiest way to perform a principal component regression analysis is to use the pls package:
# Installation package install.packages("pls") # load package library(pls)
fit model
For simplicity and convenience, we use the builtin mtcars dataset, which includes data from different brands of cars:
head(mtcars) # mpg cyl disp hp drat wt qsec vs am gear carb # Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 # Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 # Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 # Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 # Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 # Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
The principal component regression analysis in this example uses hp as the response variable and the following variables as the predictors:
 mpg (Miles/(US) gallon)
 disp (Displacement)
 drat (Rear axle ratio)
 wt (Weight)
 qsec (1/4 mile time)
The following code fits a model using the above data, with two parameters explained as follows:

scale = TRUE Each predictor is standardized to have a mean of 0 and a standard deviation of 1, which avoids the effects of using different units of measure for the predictors in the model.

validation = "CV": Evaluate model performance using Kfold crossvalidation, default K is 10. The LOOCV parameter can also be used.
# yes example is reproducible set.seed(1) #fit PCR model model < pcr(hp~mpg+disp+drat+wt+qsec, data=mtcars, scale=TRUE, validation="CV")
Choose principal components
Based on the obtained model, we now need to decide to keep the number of principal components. The following code is the test root mean square error (RMSE) calculated by k crossvalidation:
summary(model) # Data: X dimension: 25 5 # Y dimension: 25 1 # Fit method: svdpc # Number of components considered: 5 # # VALIDATION: RMSEP # Crossvalidated using 10 random segments. # (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps # CV 59.98 26.51 26.53 25.91 27.35 30.49 # adjCV 59.98 26.44 26.25 25.59 26.91 29.82 # # TRAINING: % variance explained # 1 comps 2 comps 3 comps 4 comps 5 comps # X 73.45 89.53 95.68 99.04 100.00 # hp 81.11 85.56 87.53 87.65 88.25
We interpret the two tables in the output:
 VALIDATION: RMSEP
This table tells us the test RMSE of the kfold crosstest, dismissed as follows:
 Using only the intercept term in the model, the test RMSE was 69.66.
 If the first principal component is added, the test RMSE decreases to 44.56.
 If the second principal component is added, the test RMSE decreases to 35.64.
We see that continuing to increase the principal components leads to an increase in RMSE, which seems to indicate that using only two principal components is the optimal model.
 TRAINING: % variance explained
The table results represent the percentage of variance explained by the principal components of the response variable:
 By using only the first principal component, the explained variance in the response variable is 69.83%.
 By adding the second principal component, the explained variance in the response variable is 89.35%.
It is always possible to explain more variance by using more principal components, but we see that adding more than two principal components does not actually increase the proportion of variance explained significantly. Use the validationplot() function to visualize the RMSE for a more intuitive decision to choose the number of principal components.
# 2, 3 Two pictures are in the first row, and one picture occupies two columns in the second row (2 rows and 2 columns in total) layout(matrix(c(2,3,1,1),2,2, byrow = TRUE)) validationplot(model, val.type="R2") validationplot(model, val.type="MSEP") validationplot(model)
In the figure above we can see that the fit of the model improves when two principal components are added, but worse when more principal components are added. Therefore the optimal model includes only the first two principal components.
Use the final model to make predictions
We test new observations using two principal component models. The following code divides the original data into training set and test set, and uses the PCR model to predict on the test set:
# Define training and test sets train < mtcars[1:25, c("hp", "mpg", "disp", "drat", "wt", "qsec")] y_test < mtcars[26:nrow(mtcars), c("hp")] test < mtcars[26:nrow(mtcars), c("mpg", "disp", "drat", "wt", "qsec")] # test on the test set model < pcr(hp~mpg+disp+drat+wt+qsec, data=train, scale=TRUE, validation="CV") pcr_pred < predict(model, test, ncomp=2) # Calculate RMSE sqrt(mean((pcr_pred  y_test)^2)) # [1] 56.86549
We see that the test RMSE is 56.86549, which is the mean deviation between the predicted and observed values of the variable hp in the test set.