Linear regression prediction of Boston house prices

1, Experimental purpose

1. Understand the basic principle of linear regression and master the basic formula derivation.

2. Be able to manually implement fit and predict functions in LinearRegression by using formulas.

3. Be able to use the LinearRegression realized by yourself and the LinearRegression in sklearn to predict the house price in Boston, and compare the results of the two models.

2, Experimental content

2.1 implementation of LinearRegression

According to the following formula, the weight w can be obtained by using the training set
w = ( x T x ) − 1 x T y w=(x^Tx)^{-1}x^Ty w=(xTx)−1xTy
code:

class LinearRegression_writing():
    def __init__(self, l):
        self.w = None
        self.l = l
    def fit(self, x, y):
        t = x.shape[1]
        x = np.insert(x, len(x[0]), 1, 1)
        t_1 = np.dot(np.transpose(x), x) + self.l*np.eye(t+1)
        u, s, v = np.linalg.svd(t_1, full_matrices=False)
        inv = np.matmul(v.T*1/s, u.T)
        t_2 = np.dot(inv, np.transpose(x))
        self.w = np.dot(t_2, y)

    def predict(self, test_x):
        test_x = np.insert(test_x, len(test_x[0]), 1, 1)
        pred_y = np.dot(test_x, self.w)
        return pred_y

2.2 Boston house price forecast

2.2.1 import module

First, we import the necessary libraries:

import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error, r2_score

2.2.2 loading data sets

We read csv files directly from github

dataset = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv", thousands=',')

2.2.3 data preprocessing

1. Draw and display each characteristic whose correlation coefficient with house price is greater than 0.5

dataset.corr()['medv']
plt.figure(facecolor='gray')
corr = dataset.corr()
corr = corr['medv']
corr[abs(corr) > 0.5].sort_values().plot.bar()

2. Select only the features with correlation coefficient greater than 0.5 and remove the remaining features

dataset = dataset[['lstat','ptratio','rm','medv']]
y = dataset['medv']
x = dataset.drop('medv', axis=1)

3. Use KFold cross validation to divide the data set and test set

Here, we choose to divide the number of layers to 5, while preserving the order dependency in the sorting of data sets

kfolds = KFold(n_splits=5, shuffle=False)
for i, j in kfolds.split(x, y):
    X_train, X_test = x.iloc[i], x.iloc[j]
    y_train, y_test = y.iloc[i], y.iloc[j]
X_train = np.array(X_train)
X_test = np.array(X_test)
y_train = np.array(y_train)
y_test = np.array(y_test)

4. Data normalization

Because the value range of each dimension attribute is very different, we need to normalize the data. Here we use a very common operation method: subtract the mean value and then divide it by the original value range.

There are at least three reasons for normalization:

  • Too large or too small a range of values can cause floating-point overflow or underflow during calculation.
  • Different numerical ranges will lead to different importance of different attributes to the model, and this implicit assumption is often unreasonable. This will cause difficulties in the optimization process and greatly prolong the training time.
  • Many machine learning skills / models (such as L1 and L2 regular terms, Vector Space Model) are based on the assumption that all attribute values are almost 0 as the mean value and the value range is similar.
# Initialize normalizer
min_max_scaler = preprocessing.MinMaxScaler()
# The characteristics and target values of training and test data are standardized respectively
X_train = min_max_scaler.fit_transform(X_train)
y_train = min_max_scaler.fit_transform(y_train.reshape(-1,1)) # reshape(-1,1) refers to converting it into 1 column, and the row is automatically determined
X_test = min_max_scaler.fit_transform(X_test)
y_test = min_max_scaler.fit_transform(y_test.reshape(-1,1))

2.2.4 training model and prediction

Self implemented algorithm:

model = LinearRegression_writing(0.1)
model.fit(X_train, y_train)
y_test_pred_writing = model.predict(X_test)

linear_model:

lr = LinearRegression()
lr.fit(X_train, y_train)
MSETest = mean_squared_error(y_test, y_test_pred)

2.2.5 evaluation model

Self implemented algorithm:

linear_model:

MSEMAER^2
me_model0.033209457216236320.148722950067697060.23180001594027266
linear_model0.03335066755404960.149027469670469340.2285335434246003

3, Experimental summary

During the experiment, some difficulties were encountered in the process of implementing fit and predict functions. The reason is that we are not proficient in various uses of numpy. We should strengthen our proficiency in the future. In the process of implementation, the regularization penalty factor is added. From the evaluation indexes of MSE, MAE and R^2, the self implemented model is better than linear_model results are better. Through this experiment, we have strengthened the understanding of linear regression and deeply understood its internal principle. We hope to learn more methods in the later experiments.

Tags: Python Machine Learning Data Mining

Posted by murpe on Sun, 15 May 2022 22:10:37 +0300