1, Experimental purpose
1. Understand the basic principle of linear regression and master the basic formula derivation.
2. Be able to manually implement fit and predict functions in LinearRegression by using formulas.
3. Be able to use the LinearRegression realized by yourself and the LinearRegression in sklearn to predict the house price in Boston, and compare the results of the two models.
2, Experimental content
2.1 implementation of LinearRegression
According to the following formula, the weight w can be obtained by using the training set
w
=
(
x
T
x
)
−
1
x
T
y
w=(x^Tx)^{-1}x^Ty
w=(xTx)−1xTy
code:
class LinearRegression_writing(): def __init__(self, l): self.w = None self.l = l def fit(self, x, y): t = x.shape[1] x = np.insert(x, len(x[0]), 1, 1) t_1 = np.dot(np.transpose(x), x) + self.l*np.eye(t+1) u, s, v = np.linalg.svd(t_1, full_matrices=False) inv = np.matmul(v.T*1/s, u.T) t_2 = np.dot(inv, np.transpose(x)) self.w = np.dot(t_2, y) def predict(self, test_x): test_x = np.insert(test_x, len(test_x[0]), 1, 1) pred_y = np.dot(test_x, self.w) return pred_y
2.2 Boston house price forecast
2.2.1 import module
First, we import the necessary libraries:
import numpy as np import pandas as pd from sklearn.model_selection import KFold from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt from sklearn import preprocessing from sklearn.metrics import mean_squared_error, r2_score
2.2.2 loading data sets
We read csv files directly from github
dataset = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv", thousands=',')
2.2.3 data preprocessing
1. Draw and display each characteristic whose correlation coefficient with house price is greater than 0.5
dataset.corr()['medv'] plt.figure(facecolor='gray') corr = dataset.corr() corr = corr['medv'] corr[abs(corr) > 0.5].sort_values().plot.bar()
2. Select only the features with correlation coefficient greater than 0.5 and remove the remaining features
dataset = dataset[['lstat','ptratio','rm','medv']] y = dataset['medv'] x = dataset.drop('medv', axis=1)
3. Use KFold cross validation to divide the data set and test set
Here, we choose to divide the number of layers to 5, while preserving the order dependency in the sorting of data sets
kfolds = KFold(n_splits=5, shuffle=False) for i, j in kfolds.split(x, y): X_train, X_test = x.iloc[i], x.iloc[j] y_train, y_test = y.iloc[i], y.iloc[j] X_train = np.array(X_train) X_test = np.array(X_test) y_train = np.array(y_train) y_test = np.array(y_test)
4. Data normalization
Because the value range of each dimension attribute is very different, we need to normalize the data. Here we use a very common operation method: subtract the mean value and then divide it by the original value range.
There are at least three reasons for normalization:
- Too large or too small a range of values can cause floating-point overflow or underflow during calculation.
- Different numerical ranges will lead to different importance of different attributes to the model, and this implicit assumption is often unreasonable. This will cause difficulties in the optimization process and greatly prolong the training time.
- Many machine learning skills / models (such as L1 and L2 regular terms, Vector Space Model) are based on the assumption that all attribute values are almost 0 as the mean value and the value range is similar.
# Initialize normalizer min_max_scaler = preprocessing.MinMaxScaler() # The characteristics and target values of training and test data are standardized respectively X_train = min_max_scaler.fit_transform(X_train) y_train = min_max_scaler.fit_transform(y_train.reshape(-1,1)) # reshape(-1,1) refers to converting it into 1 column, and the row is automatically determined X_test = min_max_scaler.fit_transform(X_test) y_test = min_max_scaler.fit_transform(y_test.reshape(-1,1))
2.2.4 training model and prediction
Self implemented algorithm:
model = LinearRegression_writing(0.1) model.fit(X_train, y_train) y_test_pred_writing = model.predict(X_test)
linear_model:
lr = LinearRegression() lr.fit(X_train, y_train) MSETest = mean_squared_error(y_test, y_test_pred)
2.2.5 evaluation model
Self implemented algorithm:
linear_model:
MSE | MAE | R^2 | |
---|---|---|---|
me_model | 0.03320945721623632 | 0.14872295006769706 | 0.23180001594027266 |
linear_model | 0.0333506675540496 | 0.14902746967046934 | 0.2285335434246003 |
3, Experimental summary
During the experiment, some difficulties were encountered in the process of implementing fit and predict functions. The reason is that we are not proficient in various uses of numpy. We should strengthen our proficiency in the future. In the process of implementation, the regularization penalty factor is added. From the evaluation indexes of MSE, MAE and R^2, the self implemented model is better than linear_model results are better. Through this experiment, we have strengthened the understanding of linear regression and deeply understood its internal principle. We hope to learn more methods in the later experiments.