Principal Component Regression Analysis Practical Tutorial

This article introduces Principal Components Regression and demonstrates its implementation process through examples.

Given p predictors and response variables, multiple linear regression uses a method such as least squares to obtain the smallest sum of squared errors (RSS):

RSS = Σ ( y i – y ^ i ) 2 {Σ(y_i – ŷ_i)^2} Σ(yi​–y^​i​)2

-Σ: summation symbol
- y i {y_i} yi​: the actual response value of the ith observation
- y ^ i {ŷ_i} y^​i​: Predicted values ​​based on multiple linear regression models

However, when the predictors are highly correlated, multicollinearity issues arise, resulting in unreliable, high variance estimates of model coefficients. One way to avoid this problem is to use principal component regression analysis — find M linear combinations (principal components) from p predictors, and then fit a linear regression model using least squares using the principal components as predictors. The following example will take you to implement principal component regression analysis.

Load the necessary toolkit

The easiest way to perform a principal component regression analysis is to use the pls package:

# Installation package

# load package

fit model

For simplicity and convenience, we use the built-in mtcars dataset, which includes data from different brands of cars:


#                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
# Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
# Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
# Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
# Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
# Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
# Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

The principal component regression analysis in this example uses hp as the response variable and the following variables as the predictors:

  • mpg (Miles/(US) gallon)
  • disp (Displacement)
  • drat (Rear axle ratio)
  • wt (Weight)
  • qsec (1/4 mile time)

The following code fits a model using the above data, with two parameters explained as follows:

  • scale = TRUE Each predictor is standardized to have a mean of 0 and a standard deviation of 1, which avoids the effects of using different units of measure for the predictors in the model.

  • validation = "CV": Evaluate model performance using K-fold cross-validation, default K is 10. The LOOCV parameter can also be used.

# yes example is reproducible

#fit PCR model
model <- pcr(hp~mpg+disp+drat+wt+qsec, data=mtcars, scale=TRUE, validation="CV")

Choose principal components

Based on the obtained model, we now need to decide to keep the number of principal components. The following code is the test root mean square error (RMSE) calculated by k cross-validation:


# Data: 	X dimension: 25 5 
# 	Y dimension: 25 1
# Fit method: svdpc
# Number of components considered: 5
# Cross-validated using 10 random segments.
#        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps
# CV           59.98    26.51    26.53    25.91    27.35    30.49
# adjCV        59.98    26.44    26.25    25.59    26.91    29.82
# TRAINING: % variance explained
#     1 comps  2 comps  3 comps  4 comps  5 comps
# X     73.45    89.53    95.68    99.04   100.00
# hp    81.11    85.56    87.53    87.65    88.25

We interpret the two tables in the output:


This table tells us the test RMSE of the k-fold cross-test, dismissed as follows:

  • Using only the intercept term in the model, the test RMSE was 69.66.
  • If the first principal component is added, the test RMSE decreases to 44.56.
  • If the second principal component is added, the test RMSE decreases to 35.64.

We see that continuing to increase the principal components leads to an increase in RMSE, which seems to indicate that using only two principal components is the optimal model.

  1. TRAINING: % variance explained

The table results represent the percentage of variance explained by the principal components of the response variable:

  • By using only the first principal component, the explained variance in the response variable is 69.83%.
  • By adding the second principal component, the explained variance in the response variable is 89.35%.

It is always possible to explain more variance by using more principal components, but we see that adding more than two principal components does not actually increase the proportion of variance explained significantly. Use the validationplot() function to visualize the RMSE for a more intuitive decision to choose the number of principal components.

# 2, 3 Two pictures are in the first row, and one picture occupies two columns in the second row (2 rows and 2 columns in total)
layout(matrix(c(2,3,1,1),2,2, byrow = TRUE))

validationplot(model, val.type="R2")
validationplot(model, val.type="MSEP")

In the figure above we can see that the fit of the model improves when two principal components are added, but worse when more principal components are added. Therefore the optimal model includes only the first two principal components.

Use the final model to make predictions

We test new observations using two principal component models. The following code divides the original data into training set and test set, and uses the PCR model to predict on the test set:

# Define training and test sets
train <- mtcars[1:25, c("hp", "mpg", "disp", "drat", "wt", "qsec")]
y_test <- mtcars[26:nrow(mtcars), c("hp")]
test <- mtcars[26:nrow(mtcars), c("mpg", "disp", "drat", "wt", "qsec")]
# test on the test set
model <- pcr(hp~mpg+disp+drat+wt+qsec, data=train, scale=TRUE, validation="CV")
pcr_pred <- predict(model, test, ncomp=2)

# Calculate RMSE
sqrt(mean((pcr_pred - y_test)^2))

# [1] 56.86549

We see that the test RMSE is 56.86549, which is the mean deviation between the predicted and observed values ​​of the variable hp in the test set.

Tags: R Language Machine Learning

Posted by quercus on Sun, 15 May 2022 00:24:16 +0300