Kaggle Titanic Survival Prediction Challenge
This is the Prediction Competition of Getting Started on kaggle. It is also a relatively entry-level and simple rookie competition. My best score seems to be in the top 8%. Taking advantage of this learning activity of Datawhale, I used the content of the data for the data analysis of the competition questions. , I will review and consolidate this game again, I will divide it into three parts:
- Kaggle Titanic Survival Prediction Challenge - Data Analysis
- Kaggle Titanic Survival Prediction Challenge - Simple Feature Engineering
- Kaggle Titanic Survival Prediction Challenge - Model Building, Model Tuning, Fusion
prior knowledge
- numpy
- pandas
- matplotlib
- seaborn
- sklearn
Competition address: Titanic: Machine Learning from Disaster
Sinking of the Titanic
On April 15, 1912, on her maiden voyage, the RMS Titanic, widely believed to be "sinking," sank after colliding with an iceberg.
Unfortunately, there were not enough lifeboats on board for everyone, resulting in the deaths of 1,502 of the 2,224 passengers and crew. While there is some luck in surviving, it seems that some people are more likely to survive than others.
In this challenge, we ask you to build a predictive model to answer the question: "What kind of person is more likely to survive?" using passenger data (i.e. name, age, gender, socioeconomic class, etc.)
Task Analysis: This is a classification task, building a model to predict survivors
Model establishment, model parameter adjustment, fusion
Divide the data into training and test sets
train_data = data_all[data_all['train']==1] test_x = data_all[data_all['train']==0] ## Label label=train_data ['Survived'].values train_data.drop(['train','Survived'],axis=1,inplace=True) test_x.drop(['train','Survived'],axis=1,inplace=True) np.save('./result/label',y)
Divide the data into training set and test set: the ratio is 8:2
from sklearn.model_selection import train_test_split feature=train_data.values X_train, X_test, Y_train, Y_test = train_test_split(features, y, test_size=0.2, random_state=seed)
Model building
Custom encapsulate a function
def Titanicmodel(clf,features,test_data,y,model_name): if model_name =='LinearSVC': num_classes = 1 #Number of categories else: num_classes = 2 #Number of categories num_fold = 10 #10% off fold_len = features.shape[0] // num_fold #The amount of data for each fold skf_indices = [] skf = StratifiedKFold(n_splits=num_fold, shuffle=True, random_state=seed) #Divide the training set into 10 folds for i, (train_idx, valid_idx) in enumerate(skf.split(np.ones(features.shape[0]), y)): skf_indices.extend(valid_idx.tolist()) train_pred = np.zeros((features.shape[0], num_classes)) #Predicted results on the training set (train_samples,classes) test_pred = np.zeros((test_data.shape[0], num_classes))#Predicted results on the test set (test_samples,classes) for fold in tqdm(range(num_fold)): fold_start = fold * fold_len fold_end = (fold + 1) * fold_len if fold == num_fold - 1: fold_end = train_data.shape[0] #10% off training part index train_indices = skf_indices[:fold_start] + skf_indices[fold_end:] # 10% off verification partial index test_indices = skf_indices[fold_start:fold_end] #10% off training data train_x = features[train_indices] train_y = y[train_indices] #10% off for validating some data cv_test_x = features[test_indices] clf.fit(train_x, train_y) #train if model_name =='LinearSVC': pred = clf.decision_function(cv_test_x) #Validate on the validation part of the data train_pred[test_indices] = (pred).reshape(len(pred),1) #First convert the prediction result to a probability distribution (normalization) through softmax and assign it to the corresponding position of the verification part. At the end of the loop, the prediction result on the entire training set will be obtained pred = clf.decision_function(test_data) #Get the prediction results of the currently trained model on the test set test_pred += pred.reshape(len(pred),1) / num_fold#The prediction results of each model on the test set are first converted into probability distributions through softmax, and then directly averaged (10 folds will have 10 results) else: pred = clf.predict_proba(cv_test_x) #Validate on the validation part of the data train_pred[test_indices] = pred #Assign the prediction result to the position corresponding to the validation part, and the end of the loop will get the prediction result on the entire training set pred = clf.predict_proba(test_data) #Get the prediction results of the currently trained model on the test set test_pred += pred / num_fold #Directly average the prediction results of each model on the test set (10 folds will have 10 results) y_pred = np.argmax(train_pred, axis=1) #Take the maximum value of the prediction results on the training set by row to get the predicted label if model_name =='LinearSVC': y_pred = (train_pred>0).astype(np.int32).reshape(len(train_pred)) pre = (test_pred>0).astype(np.int32).reshape(len(test_pred)) else: pre = np.argmax(test_pred,axis=1) score = accuracy_score(y, y_pred) #The true label accuracy_score corresponding to the training set print('accuracy_score:',score) #Save the prediction results of the logistic regression model on the training set and test set np.save('./result/{0}'.format(model_name)+'train',train_pred) np.save('./result/{0}'.format(model_name)+'test',test_pred) submit = pd.DataFrame({'PassengerId':np.array(range(892,1310)),'Survived':pre.astype(np.int32)}) submit.to_csv('{0}_submit.csv'.format(model_name),index=False) return clf,score
Logistic Regression (LR)
pipe=Pipeline([('select',PCA(n_components=0.95)), ('classify', LogisticRegression(random_state = seed, solver = 'liblinear'))]) param = { 'classify__penalty':['l1','l2'], 'classify__C':[0.001, 0.01, 0.1, 1, 5,7,8,9,10,]} LR_grid = GridSearchCV(estimator =pipe, param_grid = param, scoring='roc_auc', cv=5) LR_grid.fit(features,y) print(LR_grid.best_params_, LR_grid.best_score_) C=LR_grid.best_params_['classify__C'] penalty = LR_grid.best_params_['classify__penalty'] LR_classify=LogisticRegression(C=C,penalty=penalty,random_state = seed, solver = 'liblinear') LR_select = PCA(n_components=0.95) LR_pipeline = make_pipeline(LR_select, LR_classify) lr_model,lr_score = Titanicmodel(LR_pipeline,feature,test_data,y,'LR')
Support Vector Machine (SVM)
pipe=Pipeline([('select',SelectKBest(k=20)), ('classify',LinearSVC(random_state=seed))]) param = { 'select__k':list(range(20,40,2)), 'classify__penalty':['l1','l2'], 'classify__C':[0.001, 0.01, 0.1, 1, 5,7,8,9,10,50,100]} SVC_grid=GridSearchCV(estimator=pipe,param_grid=param,cv=5,scoring='roc_auc') SVC_grid.fit(features,y) print(SVC_grid.best_params_, SVC_grid.best_score_) C=SVC_grid.best_params_['classify__C'] k=SVC_grid.best_params_['select__k'] penalty = SVC_grid.best_params_['classify__penalty'] SVC_classify=LinearSVC(C=C,penalty=penalty,random_state = seed) SVC_select = PCA(n_components=0.95) SVC_pipeline = make_pipeline(SVC_select, SVC_classify) SVC_model,LinearSVC_score = Titanicmodel(SVC_pipeline,feature,test_data,y,'LinearSVC')
RandomForestClassifier
pipe=Pipeline([('select',SelectKBest(k=34)), ('classify', RandomForestClassifier(criterion='gini', random_state = seed, min_samples_split=4, min_samples_leaf=5, max_features = 'sqrt', n_jobs=-1, ))]) param = { 'classify__n_estimators':list(range(40,50,2)), 'classify__max_depth':list(range(10,25,2))} rfc_grid = GridSearchCV(estimator = pipe, param_grid = param, scoring='roc_auc', cv=10) rfc_grid.fit(features,y) print(rfc_grid.best_params_, rfc_grid.best_score_) n_estimators=rfc_grid.best_params_['classify__n_estimators'] max_depth = rfc_grid.best_params_['classify__max_depth'] rfc_classify=RandomForestClassifier(criterion='gini', n_estimators= n_estimators, max_depth=max_depth, random_state = seed, min_samples_split=4, min_samples_leaf=5, max_features = 'sqrt') rfc_select = PCA(n_components=0.95) rfc_pipeline = make_pipeline(rfc_select, rfc_classify) rfc_model,rfc_score = Titanicmodel(rfc_pipeline,feature,test_data,y,'rfc')
LightGBM
pipe=Pipeline([('select',SelectKBest(k=34)), ('classify', lgb.LGBMClassifier(random_state=seed,learning_rate=0.12,n_estimators=88,max_depth=16, min_child_samples=28, min_child_weight=0.0, classify__colsample_bytree= 0.8, colsample_bytree=0.4, objective='binary' ) )]) param = {'select__k':[i for i in range(20,40)] # 'classify__learning_rate':[i/100 for i in range(20)] } lgb_grid = GridSearchCV(estimator = pipe, param_grid = param, scoring='roc_auc', cv=10) lgb_grid.fit(features,y) print(lgb_grid.best_params_, lgb_grid.best_score_) lgb_classify= lgb.LGBMClassifier(random_state=seed, learning_rate=0.12, n_estimators=88, max_depth=16, min_child_samples=28, min_child_weight=0.0, classify__colsample_bytree= 0.8, colsample_bytree=0.4, objective='binary' ) lgb_select = PCA(n_components=0.96) lgb_pipeline = make_pipeline(lgb_select, lgb_classify) lgb_model,lgb_score = Titanicmodel(lgb_pipeline,feature,test_data,y,'lgb')
Xgboost
pipe=Pipeline([('select',SelectKBest(k=34)), ('classify', xgb.XGBClassifier(random_state=seed, learning_rate=0.12, n_estimators=80, max_depth=8, min_child_weight=3, subsample=0.8, colsample_bytree=0.8, gamma=0.2, reg_alpha=0.2, reg_lambda=0.1, ) )]) param = { 'select__k':[i for i in range(20,40) 'classify__learning_rate':[i/100 for i in range(10,20)], } xgb_grid = GridSearchCV(estimator = pipe, param_grid = param, scoring='roc_auc', cv=10) xgb_grid.fit(features,y) print(xgb_grid.best_params_, xgb_grid.best_score_) xgb_classify= xgb.XGBClassifier(random_state=seed, learning_rate=0.12, n_estimators=80, max_depth=8, min_child_weight=3, subsample=0.8, colsample_bytree=0.8, gamma=0.2, reg_alpha=0.2, reg_lambda=0.1, ) xgb_select = SelectKBest(k = 34) xgb_pipeline = make_pipeline(xgb_select, xgb_classify) xgb_model,xgb_score = Titanicmodel(xgb_pipeline,'xgb')
model fusion
LR_train = np.load('./result/LRtrain.npy') LR_test = np.load('./result/LRtest.npy') LinearSVC_train = np.load('./result/LinearSVCtrain.npy') LinearSVC_test = np.load('./result/LinearSVCtest.npy') rfc_train = np.load('./result/rfctrain.npy') rfc_test = np.load('./result/rfctest.npy') xgb_train = np.load('./result/xgbtrain.npy') xgb_test = np.load('./result/xgbtest.npy') lgb_train = np.load('./result/lgbtrain.npy') lgb_test= np.load('./result/lgbtest.npy') label = np.load('./result/label.npy') train_data = ( LR_train, rfc_train, LinearSVC_train,xgb_train, lgb_train) test_x = ( LR_test, rfc_test, LinearSVC_test,xgb_test, lgb_test) train_data = np.hstack(train_data) test_x = np.hstack(test_x)
model = LogisticRegression(random_state=seed) lgbm_7leaves_model,lgbm_7leaves_score = Titanicmodel(model,features=train_data,test_data=test_x,y=label,model_name='lr_stacking')