Kaggle Titanic Survival Prediction Challenge - Model Building, Model Tuning, Fusion

Kaggle Titanic Survival Prediction Challenge

    This is the Prediction Competition of Getting Started on kaggle. It is also a relatively entry-level and simple rookie competition. My best score seems to be in the top 8%. Taking advantage of this learning activity of Datawhale, I used the content of the data for the data analysis of the competition questions. , I will review and consolidate this game again, I will divide it into three parts:

prior knowledge

  • numpy
  • pandas
  • matplotlib
  • seaborn
  • sklearn

Competition address: Titanic: Machine Learning from Disaster

Sinking of the Titanic

On April 15, 1912, on her maiden voyage, the RMS Titanic, widely believed to be "sinking," sank after colliding with an iceberg.

Unfortunately, there were not enough lifeboats on board for everyone, resulting in the deaths of 1,502 of the 2,224 passengers and crew. While there is some luck in surviving, it seems that some people are more likely to survive than others.

In this challenge, we ask you to build a predictive model to answer the question: "What kind of person is more likely to survive?" using passenger data (i.e. name, age, gender, socioeconomic class, etc.)

Task Analysis: This is a classification task, building a model to predict survivors

Model establishment, model parameter adjustment, fusion

Divide the data into training and test sets

train_data = data_all[data_all['train']==1]
test_x = data_all[data_all['train']==0]
## Label
label=train_data ['Survived'].values 
train_data.drop(['train','Survived'],axis=1,inplace=True)
test_x.drop(['train','Survived'],axis=1,inplace=True)
np.save('./result/label',y)

Divide the data into training set and test set: the ratio is 8:2

from sklearn.model_selection import train_test_split
feature=train_data.values

X_train, X_test, Y_train, Y_test = train_test_split(features, y, test_size=0.2, random_state=seed)

Model building

Custom encapsulate a function
def Titanicmodel(clf,features,test_data,y,model_name):
    if model_name =='LinearSVC':
            num_classes = 1 #Number of categories
    else:
            num_classes = 2 #Number of categories
    num_fold = 10  #10% off
    fold_len = features.shape[0] // num_fold #The amount of data for each fold
    skf_indices = []
    skf = StratifiedKFold(n_splits=num_fold, shuffle=True, random_state=seed) #Divide the training set into 10 folds
    for i, (train_idx, valid_idx) in enumerate(skf.split(np.ones(features.shape[0]), y)):
        skf_indices.extend(valid_idx.tolist())
    
    train_pred = np.zeros((features.shape[0], num_classes)) #Predicted results on the training set (train_samples,classes)
    test_pred = np.zeros((test_data.shape[0], num_classes))#Predicted results on the test set (test_samples,classes)


    for fold in tqdm(range(num_fold)):


        fold_start = fold * fold_len
        fold_end = (fold + 1) * fold_len
        if fold == num_fold - 1:
            fold_end = train_data.shape[0]
        #10% off training part index
        train_indices = skf_indices[:fold_start] + skf_indices[fold_end:]
        # 10% off verification partial index
        test_indices = skf_indices[fold_start:fold_end]

        #10% off training data
        train_x = features[train_indices]
        train_y = y[train_indices]
        #10% off for validating some data
        cv_test_x = features[test_indices]

        clf.fit(train_x, train_y) #train

        if model_name =='LinearSVC':
            pred = clf.decision_function(cv_test_x) #Validate on the validation part of the data
            train_pred[test_indices] = (pred).reshape(len(pred),1) #First convert the prediction result to a probability distribution (normalization) through softmax and assign it to the corresponding position of the verification part. At the end of the loop, the prediction result on the entire training set will be obtained
            pred = clf.decision_function(test_data) #Get the prediction results of the currently trained model on the test set
            test_pred += pred.reshape(len(pred),1) / num_fold#The prediction results of each model on the test set are first converted into probability distributions through softmax, and then directly averaged (10 folds will have 10 results)
            
        else:
            pred = clf.predict_proba(cv_test_x) #Validate on the validation part of the data
            train_pred[test_indices] = pred   #Assign the prediction result to the position corresponding to the validation part, and the end of the loop will get the prediction result on the entire training set
            pred = clf.predict_proba(test_data)  #Get the prediction results of the currently trained model on the test set
            test_pred += pred / num_fold  #Directly average the prediction results of each model on the test set (10 folds will have 10 results)
            y_pred = np.argmax(train_pred, axis=1) #Take the maximum value of the prediction results on the training set by row to get the predicted label
            

    if model_name =='LinearSVC':
        y_pred = (train_pred>0).astype(np.int32).reshape(len(train_pred))
        pre = (test_pred>0).astype(np.int32).reshape(len(test_pred))
    else:
        pre = np.argmax(test_pred,axis=1)
    score = accuracy_score(y, y_pred) #The true label accuracy_score corresponding to the training set
    print('accuracy_score:',score)
    #Save the prediction results of the logistic regression model on the training set and test set
    np.save('./result/{0}'.format(model_name)+'train',train_pred)
    np.save('./result/{0}'.format(model_name)+'test',test_pred)
    
    submit = pd.DataFrame({'PassengerId':np.array(range(892,1310)),'Survived':pre.astype(np.int32)})
    submit.to_csv('{0}_submit.csv'.format(model_name),index=False)
    return clf,score

Logistic Regression (LR)

pipe=Pipeline([('select',PCA(n_components=0.95)), 
               ('classify', LogisticRegression(random_state = seed, solver = 'liblinear'))])
param = {
        'classify__penalty':['l1','l2'],  
        'classify__C':[0.001, 0.01, 0.1, 1, 5,7,8,9,10,]}
LR_grid = GridSearchCV(estimator =pipe, param_grid = param, scoring='roc_auc', cv=5)
LR_grid.fit(features,y)
print(LR_grid.best_params_, LR_grid.best_score_)
C=LR_grid.best_params_['classify__C']
penalty = LR_grid.best_params_['classify__penalty']
LR_classify=LogisticRegression(C=C,penalty=penalty,random_state = seed, solver = 'liblinear')
LR_select =  PCA(n_components=0.95)
LR_pipeline = make_pipeline(LR_select, LR_classify)
lr_model,lr_score = Titanicmodel(LR_pipeline,feature,test_data,y,'LR')

Support Vector Machine (SVM)

pipe=Pipeline([('select',SelectKBest(k=20)), 
               ('classify',LinearSVC(random_state=seed))])
param = {
        'select__k':list(range(20,40,2)),
        'classify__penalty':['l1','l2'],  
        'classify__C':[0.001, 0.01, 0.1, 1, 5,7,8,9,10,50,100]}
SVC_grid=GridSearchCV(estimator=pipe,param_grid=param,cv=5,scoring='roc_auc')
SVC_grid.fit(features,y)
print(SVC_grid.best_params_, SVC_grid.best_score_)
C=SVC_grid.best_params_['classify__C']
k=SVC_grid.best_params_['select__k']
penalty = SVC_grid.best_params_['classify__penalty']
SVC_classify=LinearSVC(C=C,penalty=penalty,random_state = seed)
SVC_select =  PCA(n_components=0.95)
SVC_pipeline = make_pipeline(SVC_select, SVC_classify)
SVC_model,LinearSVC_score = Titanicmodel(SVC_pipeline,feature,test_data,y,'LinearSVC')

RandomForestClassifier

pipe=Pipeline([('select',SelectKBest(k=34)), 
               ('classify', RandomForestClassifier(criterion='gini',
                                                   random_state = seed,
                                                   min_samples_split=4,
                                                   min_samples_leaf=5, 
                                                   max_features = 'sqrt',
                                                  n_jobs=-1,
                                                   ))])

param = {
            'classify__n_estimators':list(range(40,50,2)),  
            'classify__max_depth':list(range(10,25,2))}
rfc_grid = GridSearchCV(estimator = pipe, param_grid = param, scoring='roc_auc', cv=10)
rfc_grid.fit(features,y)
print(rfc_grid.best_params_, rfc_grid.best_score_)
n_estimators=rfc_grid.best_params_['classify__n_estimators']
max_depth = rfc_grid.best_params_['classify__max_depth']
rfc_classify=RandomForestClassifier(criterion='gini',
                                        n_estimators= n_estimators,
                                        max_depth=max_depth,
                                       random_state = seed,
                                       min_samples_split=4,
                                       min_samples_leaf=5, 
                                       max_features = 'sqrt')
rfc_select =  PCA(n_components=0.95)
rfc_pipeline = make_pipeline(rfc_select, rfc_classify)
rfc_model,rfc_score = Titanicmodel(rfc_pipeline,feature,test_data,y,'rfc')

LightGBM

pipe=Pipeline([('select',SelectKBest(k=34)), 
               ('classify', lgb.LGBMClassifier(random_state=seed,learning_rate=0.12,n_estimators=88,max_depth=16,
                                           min_child_samples=28,
                                            min_child_weight=0.0,
                                           classify__colsample_bytree= 0.8,
                                               colsample_bytree=0.4,
                                               objective='binary'
                                           
                                              ) )])

param = {'select__k':[i for i in range(20,40)]
#            'classify__learning_rate':[i/100 for i in range(20)]    
}
lgb_grid = GridSearchCV(estimator = pipe, param_grid = param, scoring='roc_auc', cv=10)
lgb_grid.fit(features,y)
print(lgb_grid.best_params_, lgb_grid.best_score_)
lgb_classify= lgb.LGBMClassifier(random_state=seed,
                                 learning_rate=0.12,
                                 n_estimators=88,
                                 max_depth=16,
                                 min_child_samples=28,
                                 min_child_weight=0.0,
                                 classify__colsample_bytree= 0.8,
                                 colsample_bytree=0.4,
                                 objective='binary'
                                )
lgb_select = PCA(n_components=0.96)
lgb_pipeline = make_pipeline(lgb_select, lgb_classify)
lgb_model,lgb_score = Titanicmodel(lgb_pipeline,feature,test_data,y,'lgb')

Xgboost

pipe=Pipeline([('select',SelectKBest(k=34)), 
               ('classify', xgb.XGBClassifier(random_state=seed,
                                              learning_rate=0.12,
                                              n_estimators=80,
                                              max_depth=8,
                                              min_child_weight=3,
                                              subsample=0.8,
                                              colsample_bytree=0.8,
                                              gamma=0.2,
                                              reg_alpha=0.2,
                                              reg_lambda=0.1,
                                             )
               )])
param = {  'select__k':[i for i in range(20,40)
           'classify__learning_rate':[i/100 for i in range(10,20)],
}
xgb_grid = GridSearchCV(estimator = pipe, param_grid = param, scoring='roc_auc', cv=10)
xgb_grid.fit(features,y)
print(xgb_grid.best_params_, xgb_grid.best_score_)
xgb_classify= xgb.XGBClassifier(random_state=seed,
                                              learning_rate=0.12,
                                              n_estimators=80,
                                              max_depth=8,
                                              min_child_weight=3,
                                              subsample=0.8,
                                              colsample_bytree=0.8,
                                              gamma=0.2,
                                              reg_alpha=0.2,
                                              reg_lambda=0.1,
                                             )
xgb_select =  SelectKBest(k = 34)
xgb_pipeline = make_pipeline(xgb_select, xgb_classify)
xgb_model,xgb_score = Titanicmodel(xgb_pipeline,'xgb')

model fusion

LR_train = np.load('./result/LRtrain.npy')
LR_test = np.load('./result/LRtest.npy')
LinearSVC_train = np.load('./result/LinearSVCtrain.npy')
LinearSVC_test = np.load('./result/LinearSVCtest.npy')
rfc_train = np.load('./result/rfctrain.npy')
rfc_test = np.load('./result/rfctest.npy')
xgb_train = np.load('./result/xgbtrain.npy')
xgb_test = np.load('./result/xgbtest.npy')
lgb_train = np.load('./result/lgbtrain.npy')
lgb_test= np.load('./result/lgbtest.npy')
label = np.load('./result/label.npy')
train_data = ( LR_train, rfc_train, LinearSVC_train,xgb_train, lgb_train)
test_x = ( LR_test, rfc_test, LinearSVC_test,xgb_test, lgb_test)
train_data = np.hstack(train_data)
test_x = np.hstack(test_x)
model = LogisticRegression(random_state=seed)
lgbm_7leaves_model,lgbm_7leaves_score = Titanicmodel(model,features=train_data,test_data=test_x,y=label,model_name='lr_stacking')

Tags: Machine Learning

Posted by ioop on Fri, 20 May 2022 12:18:57 +0300