Machine learning python -- univariate linear regression model (Wu Enda's homework)

I First of all, the library to be introduced

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['font.sans-serif']=['SimHei']#The purpose of displaying Chinese in bold is that the axis label and title of the figure made by matplotlib can be expressed in Chinese

II Import file

df=pd.read_csv('C:/Users/Administrator/AppData/Local/Temp/Temp2_machine-learning-ex1.zip/machine-learning-ex1/ex1/ex1data1.txt',header=None) #Import the file with column names 0 and 1
df.rename(columns={0:'city A population size',1:'city A Snack bar profit'},inplace=True) #Change column name
df.head()#Show first five lines

Output results
First, make descriptive statistics (summary statistics) on the data to have a preliminary understanding of the data:

df.describe()

The result is

III Visualize data using scatter charts

plt.scatter(df['city A population size'],df['city A Snack bar profit'])

# Set the chart title and label the axis
plt.title('According to the urban population, the profit of snack bars in the city is predicted')
plt.xlabel('Urban population')
plt.ylabel('Profit of urban snack bar')

plt.show()

The output result is:
Now the gradient descent algorithm is used to realize linear regression and minimize the cost function

IV Define cost function

Formula of cost function:
Define with code:

def computeCost(x,y,theta):
    inner = np.sum(np.power((x*theta-y),2))
    return inner/(2*len(x)) #theta is the parameter

*1. np.power(a,2) refers to the power of array a element by element
2. np.sum(a) refers to summing the elements of array a

Because theta is a 2 * 1-dimensional vector and X is a len(x) * 1-dimensional vector, in order to operate, add a column before column 0 in X with all values of 1, so that x becomes a len(x) *2-dimensional matrix. At this point, x * theta is the hypothetical function.
The code is as follows:

df.insert(0,'ones',1)#It means to add a column before column 0 of df. The name of this column is' one ', and its values are all 1
df.head()

V Define input and output variables and parameters

x=df.loc[:,['ones','city A population size']]
y=df.loc[:,'city A Snack bar profit']
X=np.matrix(x.values)
y=np.matrix(y.values) 
theta=np.matrix([0,0])



Why use NP What about matrix
Because for matrix, point multiplication is directly X*Y, while for array, point multiplication is NP dot(X,Y)
When we defined the cost function earlier, we wrote x * theta instead of NP dot(x,theta)

x.values returns array

Vi Check the dimension of input and output variables and parameter vectors

X.shape 
y.shape 
theta.shape


So it needs to be adjusted

y=y.T  #De transpose
theta=theta.T

VII Define gradient descent algorithm

First, calculate the initial cost function

computeCost(X,y,theta)  # 32.072733877455676

Define gradient algorithm:

def gradientDescent(x,y,theta,alpha,iters):#alpha is the learning rate and iters is the number of iterations
    temp = np.matrix(np.zeros(theta.shape))#initialization
    cost = np.zeros(iters) # initialization
    for i in range(iters):
        temp = theta - ((alpha/len(x))*(x*theta-y).T *x).T #Use vector form
        theta = temp  # Update at the same time
        cost[i]=computeCost(x,y,theta)
   
    return theta,cost

1.np.zeros() generates an array with all values of 0

VIII Initialization learning rate and number of iterations

alpha=0.01
iters=1000  # I took 1000 here

IX Call gradient descent algorithm

finally_theta,cost=gradientDescent(X,Y,theta,alpha,iters)


As can be seen from the picture, the value of the cost function has been decreasing with the increase of the number of iterations
The cost function of the training model is calculated with the fitted parameters

computeCost(X,y,finally_theta) #4.515955503078912

X Draw and visualize the linear model

x=np.linspace(df['city A population size'].min(),df['city A population size'].max(),100)#Select the minimum and maximum values in the column of population of city A, and generate 100 numbers at equal intervals between the two values
f=finally_theta[0,0]+finally_theta[1,0]*x 
fig,ax=plt.subplots(figsize=(8,4))  #Usage of subplots()
ax.plot(x,f,'r',label='forecast')
ax.scatter(df['city A population size'],df['city A Snack bar profit'],label='Training set')
ax.legend(loc=2)   #It is indicated on the figure to illustrate the text display of each curve
ax.set_xlabel('city A population size')
ax.set_ylabel('city A Snack bar profit')
ax.set_title('According to the city A Population forecast city A Snack bar profit')
plt.show()

The result is:

Supplement:
1. np.linspace() returns evenly spaced numbers within a specified interval
2. Pay attention to the difference between the two indexes:

3.plt.subplots() generates subgraphs

Xi Draw the curve of cost function

x=np.arange(1000)
fig,ax=plt.subplots(figsize=(8,4))
ax.plot(x,cost,'r')
ax.set_xlabel('Number of iterations')
ax.set_ylabel('Cost function value')
ax.set_title('gradient descent ')

The result is:
The square error cost function is a quadratic function of parameters, in line with!

Tags: Python Machine Learning

Posted by TreColl on Mon, 09 May 2022 23:22:18 +0300