Application case of machine learning logistic regression algorithm

Application case of machine learning logistic regression algorithm

Time: September 12, 2020
source: https://www.kesci.com/home/project/5bfe39b3954d6e0010681cd1
Note: for beginners of logical regression, I followed the blog boss's article and left a record for later reading, which is also for everyone to learn..

1. Data

This data is Kaggle's famous public data set, the disaster of the tantenic.
Data source address: https://www.kaggle.com/c/titanic
If you are entering kaggle for the first time, you need to register an account to download the data set. After entering the link, please click the following figure to find the data.


You can download the data only if you are logged in. If you don't have a Kaggle account, you can create an account through Kaggle. After all, such a strong website loses too much without creating an account, However, it seems that VPN must be used to create this account, otherwise man-machine verification cannot be carried out (I am so).
If you really don't have an account and don't want to register, you can go to the original author's GitHub: https://github.com/HanXiaoyang/Kaggle_Titanic

2. Data analysis and processing

Now let's see what the data looks like

import pandas as pd #Data analysis
import numpy as np #Scientific calculation
from pandas import Series,DataFrame
data_train = pd.read_csv("D:/material/titanic/train.csv")
data_train

Out:

This is the typical dataframe format. We can see that there are a total of 12 columns, of which the Survived field indicates whether the passenger was rescued, and the rest are the passenger's personal information, including:

Passengerid = > passenger ID
Pclass = > passenger class (Class 1 / 2 / 3)
Name = > passenger name
Sex = > gender
Age = > age
Sibsp = > number of cousins / sisters
Parch = > number of parents and children
Ticket = > ticket information
Fare = > fare
Cabin = > cabin
Embarked = > port of embarkation

After that, let dataframe tell us some information by itself, as follows:

data_train.info()

Out:

It can be seen from the results that there are a total of 891 passengers in the training data, but unfortunately, the data of some attributes are incomplete, such as:
The Age attribute has only 714 passengers recorded
Only 204 passengers are known
We use the following methods to obtain some distributions of numerical data (because some attributes, such as name, are text type, while others, such as port of embarkation, are category type. We can't see these with the following functions):

data_train.describe()

Out:

What further information can we see from the above? The mean field tells us that about 0.3838 people were finally rescued, the number of people in the second-class cabin is more than that in the first-class cabin, and the average age of passengers is about 29.7 years old (calculated that there will be no records at this time), etc

After understanding the data information, we begin to have a deeper understanding of these data and analyze the data more clearly through visual data.
The code is as follows:

#Data visualization analysis
%matplotlib inline 
import matplotlib.pyplot as plt

#Specifies the default font
plt.rcParams['font.sans-serif'] = ['SimHei'] 
plt.rcParams['font.family']='sans-serif'
#Solve the problem that the minus sign '-' is displayed as a square
plt.rcParams['axes.unicode_minus'] = False 


fig = plt.figure()
fig.set(alpha=0.2)  # Set chart color alpha parameter

plt.subplot2grid((2,3),(0,0))             # Break down several small pictures in a large picture
data_train.Survived.value_counts().plot(kind='bar')# Histogram 
plt.title(u"Rescue situation (1 To be rescued)") # title
plt.ylabel(u"Number of people")  

plt.subplot2grid((2,3),(0,1))
data_train.Pclass.value_counts().plot(kind="bar")
plt.ylabel(u"Number of people")
plt.title(u"Passenger class distribution")

plt.subplot2grid((2,3),(0,2))
plt.scatter(data_train.Survived, data_train.Age)
plt.ylabel(u"Age")                         # Set ordinate name
plt.grid(b=True, which='major', axis='y') 
plt.title(u"Distribution of rescued by age (1 To be rescued)")


plt.subplot2grid((2,3),(1,0), colspan=2)
data_train.Age[data_train.Pclass == 1].plot(kind='kde')   
data_train.Age[data_train.Pclass == 2].plot(kind='kde')
data_train.Age[data_train.Pclass == 3].plot(kind='kde')
plt.xlabel(u"Age")# plots an axis lable
plt.ylabel(u"density") 
plt.title(u"Age distribution of passengers at all levels")
plt.legend((u'First class', u'2 First class',u'3 First class'),loc='best') # sets our legend for our graph.


plt.subplot2grid((2,3),(1,2))
data_train.Embarked.value_counts().plot(kind='bar')
plt.title(u"Number of people on board at each boarding port")
plt.ylabel(u"Number of people")  
plt.show()

Out:

We can see from the picture that more than 300 people were saved, less than half of them; There are many passengers in class 3; The ages of those killed and rescued seem to span a wide range; The general trend of the ages of the three different classes seems to be the same, with the largest number of passengers aged more than 20 in the second-class class class and about 40 in the first-class class class; The number of boarding ports decreases according to s, C and Q, and S is much more than the other two ports.
Through these data, we can see some information, and we will have some thoughts in our mind. Put the ideas of the original author:
1) Different classes / passenger classes may be related to wealth / status, and the probability of being rescued may be different
2) Age must also have an impact on the probability of being rescued. After all, as mentioned earlier, the vice captain also said "children and women go first"
3) Is it related to the port of embarkation? Maybe the port of embarkation is different and people's origin and status are different?
Empty talk is useless. Let's take a look at the statistical distribution of these attribute values.

Next, let's look at the relationship between the rescue of passengers and the passenger level Pclass
The code is as follows:

#Look at the rescue of each passenger level
fig = plt.figure()
fig.set(alpha=0.2)  # Set chart color alpha parameter

Survived_0 = data_train.Pclass[data_train.Survived == 0].value_counts()
Survived_1 = data_train.Pclass[data_train.Survived == 1].value_counts()
df=pd.DataFrame({u'Not rescued':Survived_0,u'be rescued':Survived_1})
df.plot(kind='bar', stacked=True)
plt.title(u"Rescue situation of each passenger level")
plt.xlabel(u"Passenger class") 
plt.ylabel(u"Number of people") 
plt.show()

Out:

Let's look at the relationship between the rescued passengers and Sex sex Sex
The code is as follows:

#Look at the rescue of each sex
fig = plt.figure()
fig.set(alpha=0.2)  # Set chart color alpha parameter

Survived_m = data_train.Survived[data_train.Sex == 'male'].value_counts()
Survived_f = data_train.Survived[data_train.Sex == 'female'].value_counts()
print(Survived_m)#Here I output these data for my understanding
print(Survived_f)
df=pd.DataFrame({u'Male':Survived_m, u'female sex':Survived_f})
df.plot(kind='bar', stacked=True)
plt.title(u"Rescue by sex")
plt.xlabel(u"be rescued") 
plt.ylabel(u"Number of people")
plt.show()

Out:

Then let's take a look at the rescue of each gender under various cabin levels
The code is as follows:

#Then let's take a look at the rescue of each gender under various cabin levels
fig=plt.figure()
fig.set(alpha=0.65) # Set image transparency, it doesn't matter
plt.title(u"Rescue according to cabin class and gender")

ax1=fig.add_subplot(141)
data_train.Survived[data_train.Sex == 'female'][data_train.Pclass != 3].value_counts().sort_index().plot(kind='bar', label="female highclass", color='#FA2479')
ax1.set_xticks([0,1])
print(ax1)
ax1.set_xticklabels([u"Not rescued", u"be rescued"], rotation=0)
ax1.legend([u"female sex/Advanced cabin"], loc='best')

ax2=fig.add_subplot(142, sharey=ax1)
data_train.Survived[data_train.Sex == 'female'][data_train.Pclass == 3].value_counts().sort_index().plot(kind='bar', label='female, low class', color='pink')
ax2.set_xticklabels([u"Not rescued", u"be rescued"], rotation=0)
plt.legend([u"female sex/Lower cabin"], loc='best')

ax3=fig.add_subplot(143, sharey=ax1)
data_train.Survived[data_train.Sex == 'male'][data_train.Pclass != 3].value_counts().sort_index().plot(kind='bar', label='male, high class',color='lightblue')
ax3.set_xticklabels([u"Not rescued", u"be rescued"], rotation=0)
plt.legend([u"Male/Advanced cabin"], loc='best')

ax4=fig.add_subplot(144, sharey=ax1)
data_train.Survived[data_train.Sex == 'male'][data_train.Pclass == 3].value_counts().sort_index().plot(kind='bar', label='male low class', color='steelblue')
ax4.set_xticklabels([u"Not rescued", u"be rescued"], rotation=0)
plt.legend([u"Male/Lower cabin"], loc='best')

plt.show()

Out:

Let's take a look at the rescue of each boarding port
The code is as follows:

fig = plt.figure()
fig.set(alpha=0.2)  # Set chart color alpha parameter

Survived_0 = data_train.Embarked[data_train.Survived == 0].value_counts()
Survived_1 = data_train.Embarked[data_train.Survived == 1].value_counts()
df=pd.DataFrame({u'Not rescued':Survived_0,u'be rescued':Survived_1})
df.plot(kind='bar', stacked=True)
plt.title(u"Rescue of passengers at each boarding port")
plt.xlabel(u"Boarding port") 
plt.ylabel(u"Number of people") 

plt.show()

Out:

Look at the influence of cousins / sisters, children / parents on whether they were rescued
The code is as follows:

#Look at the influence of cousins / sisters, children / parents on whether they were rescued
gg = data_train.groupby(['SibSp','Survived'])
df = pd.DataFrame(gg.count()['PassengerId'])
print(df)

gp = data_train.groupby(['Parch','Survived'])
df = pd.DataFrame(gp.count()['PassengerId'])
print(df)

Out:

Analysis of tickets cabin
Ticket is the ticket number and should be unique. It has nothing to do with the final result. It will not be included in the feature category first
cabin has only 204 passengers. Let's take a look at its distribution first

data_train.Cabin.value_counts()

Out:

The key is that the ghost attribute of bin should be counted as category type. Originally, there are many missing values and it is so not concentrated, which is bound to be a tricky thing... The first feeling is that if it is processed directly according to the category characteristics, it is too scattered, and it is estimated that each factorized feature will not get any weight. In addition, there are so many missing values. Why don't we first take the absence of Cabin as a condition (although this part of information may not be unregistered, but may is just lost, so it may not be appropriate to do so). First look at the situation of Survived on the coarse granularity of whether there is Cabin information.
The code is as follows:

fig = plt.figure()
fig.set(alpha=0.2)  # Set chart color alpha parameter

Survived_cabin = data_train.Survived[pd.notnull(data_train.Cabin)].value_counts()
Survived_nocabin = data_train.Survived[pd.isnull(data_train.Cabin)].value_counts()
df=pd.DataFrame({u'have':Survived_cabin, u'nothing':Survived_nocabin}).transpose()
df.plot(kind='bar', stacked=True)
plt.title(u"Press Cabin Did you see the rescue")
plt.xlabel(u"Cabin Yes no") 
plt.ylabel(u"Number of people")
plt.show()

Out:

After looking at the visual correlation analysis of so many data, let's start data preprocessing to prepare for logistic regression modeling..

3. Data preprocessing

It's better to put the author's original words directly here. After all, it's good:
Let's start with the most prominent data attributes. Yes, the loss of data in bin and Age will have a great impact on the next step.

Let's talk about bin first. For the time being, let's process this attribute into two types: Yes and No according to whether there is data in bin.

Besides Age: we usually have several common ways to deal with missing values

If the samples with missing values account for a very high proportion of the total number, we may directly abandon them. If they are added as features, they may instead bring in noise and affect the final result. If the samples with missing values are moderate and the attribute is a discontinuous value feature attribute (such as Category attribute), then add NaN as a new category to the category feature. If the samples with missing values are moderate and the attribute is a continuous value feature attribute, Sometimes we will consider giving a step (for example, age here, we can consider a step every 2 / 3 years), discretize it, and then add NaN as a type to the attribute category. In some cases, we can try to fit multiple values according to the missing data. In this case, the latter two processing methods should be feasible. Let's try to fit and complete them first (although there are not many backgrounds for us to fit, it's not necessarily a good choice)

Here we use RandomForest in scikit learn to fit the missing age data (Note: RandomForest is a machine learning algorithm used to do different sampling in the original data, establish multiple decisiontrees, and then average to reduce the over fitting phenomenon and improve the results)
The code is as follows:

from sklearn.ensemble import RandomForestRegressor

### Use RandomForestClassifier to fill in the missing age attribute
def set_missing_ages(df):

    # Take out the existing numerical features and throw them into Random Forest Regressor
    age_df = df[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]

    # Passengers are divided into known age and unknown age
    known_age = age_df[age_df.Age.notnull()].values
    unknown_age = age_df[age_df.Age.isnull()].values

    # y is the target age
    y = known_age[:, 0]

    # X is the characteristic attribute value
    X = known_age[:, 1:]

    # fit into RandomForestRegressor
    rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)
    rfr.fit(X, y)

    # The obtained model is used to predict the unknown age results
    predictedAges = rfr.predict(unknown_age[:, 1::])

    # Fill in the original missing data with the obtained prediction results
    df.loc[ (df.Age.isnull()), 'Age' ] = predictedAges 

    return df, rfr

def set_Cabin_type(df):
    df.loc[ (df.Cabin.notnull()), 'Cabin' ] = "Yes"
    df.loc[ (df.Cabin.isnull()), 'Cabin' ] = "No"
    return df

data_train, rfr = set_missing_ages(data_train)
data_train = set_Cabin_type(data_train)

Then output a wave of data to have a look

data_train.head(10)

Out:

Because the features that need to be input in logistic regression modeling are numerical features, we usually factor the category features first. What is factorization? for instance:

Take bin as an example. It is originally an attribute dimension, because its value can be ['yes',' no '], and its flat expansion is' bin'_ yes’,’Cabin_no 'two attributes

If the original value of bin is yes, here is "bin"_ The value under "yes" is 1, and in "bin"_ The value under "no" is 0. If the original value of bin is no, here is "bin"_ The value under "yes" is 0, and in "bin"_ The value under "no" is 1. We use "get" of pandas_ Dummies "to complete this work and splice it in the original" data "_ Train ", as shown below.

#The taxonomic characteristics were factored through the "get" of pandas_ "Dummies" to finish the work
dummies_Cabin = pd.get_dummies(data_train['Cabin'], prefix= 'Cabin')

dummies_Embarked = pd.get_dummies(data_train['Embarked'], prefix= 'Embarked')

dummies_Sex = pd.get_dummies(data_train['Sex'], prefix= 'Sex')

dummies_Pclass = pd.get_dummies(data_train['Pclass'], prefix= 'Pclass')

df = pd.concat([data_train, dummies_Cabin, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)
df.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True)
df.head()

Out:

We have to do some processing. Take a closer look at the two attributes of Age and Fare. The numerical range of passengers changes greatly!! If you understand logistic regression and gradient descent, you will know that the scale gap between attribute values is too large, which will cause tens of thousands of damage to the convergence speed! Not even convergence! (▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔9620.

#Let's use the preprocessing module in scikit learn to scale these two goods. The so-called scaling is actually to turn some features with large changes into [- 1,1].
import sklearn.preprocessing as preprocessing
scaler = preprocessing.StandardScaler()
age_scale_param = scaler.fit(df['Age'].values.reshape(-1,1))
df['Age_scaled'] = scaler.fit_transform(df['Age'].values.reshape(-1,1), age_scale_param)
fare_scale_param = scaler.fit(df['Fare'].values.reshape(-1,1))
df['Fare_scaled'] = scaler.fit_transform(df['Fare'].values.reshape(-1,1), fare_scale_param)
df.head()

Out:

Now we have basically processed the training data and can start modeling.

4. Logistic regression modeling

We take out the required feature field, convert it into numpy format, and use logistic regression in scikit learn for modeling.
The code is as follows:

from sklearn import linear_model

# Use regular to get the attribute value we want
train_df = df.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')
train_np = train_df.values

# y is column 0: Survival results
y = train_np[:, 0]

# X is column 1 and beyond: characteristic attribute value
X = train_np[:, 1:]

# fit into LogisticRegression
clf = linear_model.LogisticRegression(solver='liblinear',C=1.0, penalty='l1', tol=1e-6)
clf.fit(X, y)

clf

Out:

In this way, a logistic regression model has been trained. It can be seen that the key of machine learning still lies in the analysis and processing of data (learning machine learning for the first time, only personal point of view), but we can't predict the test data yet. We need to process the test data in the same format. It's also simple. Just follow the above.
The code is as follows:

#Now make the same change to the test file
data_test = pd.read_csv("D:/material/titanic/test.csv")
data_test.loc[ (data_test.Fare.isnull()), 'Fare' ] = 0
# Then we talk about test_data and train_ Consistent feature transformation in data
# First, fill in the missing age with the same randomforest regressor model
tmp_df = data_test[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]
null_age = tmp_df[data_test.Age.isnull()].values
# Predict the age according to the characteristic attribute X and add
X = null_age[:, 1:]
predictedAges = rfr.predict(X)
data_test.loc[ (data_test.Age.isnull()), 'Age' ] = predictedAges

data_test = set_Cabin_type(data_test)
dummies_Cabin = pd.get_dummies(data_test['Cabin'], prefix= 'Cabin')
dummies_Embarked = pd.get_dummies(data_test['Embarked'], prefix= 'Embarked')
dummies_Sex = pd.get_dummies(data_test['Sex'], prefix= 'Sex')
dummies_Pclass = pd.get_dummies(data_test['Pclass'], prefix= 'Pclass')


df_test = pd.concat([data_test, dummies_Cabin, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)
df_test.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True)
df_test['Age_scaled'] = scaler.fit_transform(df_test['Age'].values.reshape(-1,1), age_scale_param)
df_test['Fare_scaled'] = scaler.fit_transform(df_test['Fare'].values.reshape(-1,1), fare_scale_param)
df_test.head()

Out:

Now you can predict the data
The code is as follows:

test = df_test.filter(regex='Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')
predictions = clf.predict(test)
result = pd.DataFrame({'PassengerId':data_test['PassengerId'].values, 'Survived':predictions.astype(np.int32)})
result.head()

Out:

5. Summary

In this way, the application of the most basic logistic regression algorithm is completed. According to the original author, this is only the first step of the long march
The original author also has some optimization of the logistic regression system to improve its accuracy. I won't record it here. If readers need it, they can go directly to the original blog: https://blog.csdn.net/han_xiaoyang/article/details/49797143
What is written later is really good. It is recommended to study.
This is my first blog. Although most of them are copied and pasted, I have learned to write a blog. Later, I will try to write more articles, take notes for myself and share knowledge with you. It's very good.

Tags: Python Machine Learning Data Analysis

Posted by savj14 on Wed, 18 May 2022 07:40:03 +0300