# Kaggle Titanic Survival Prediction Challenge - Model Building, Model Tuning, Fusion

## Kaggle Titanic Survival Prediction Challenge

This is the Prediction Competition of Getting Started on kaggle. It is also a relatively entry-level and simple rookie competition. My best score seems to be in the top 8%. Taking advantage of this learning activity of Datawhale, I used the content of the data for the data analysis of the competition questions. , I will review and consolidate this game again, I will divide it into three parts:

### prior knowledge

• numpy
• pandas
• matplotlib
• seaborn
• sklearn

Competition address: Titanic: Machine Learning from Disaster

## Sinking of the Titanic

On April 15, 1912, on her maiden voyage, the RMS Titanic, widely believed to be "sinking," sank after colliding with an iceberg.

Unfortunately, there were not enough lifeboats on board for everyone, resulting in the deaths of 1,502 of the 2,224 passengers and crew. While there is some luck in surviving, it seems that some people are more likely to survive than others.

In this challenge, we ask you to build a predictive model to answer the question: "What kind of person is more likely to survive?" using passenger data (i.e. name, age, gender, socioeconomic class, etc.)

Task Analysis: This is a classification task, building a model to predict survivors

# Model establishment, model parameter adjustment, fusion

#### Divide the data into training and test sets

```train_data = data_all[data_all['train']==1]
test_x = data_all[data_all['train']==0]
## Label
label=train_data ['Survived'].values
train_data.drop(['train','Survived'],axis=1,inplace=True)
test_x.drop(['train','Survived'],axis=1,inplace=True)
np.save('./result/label',y)
```

#### Divide the data into training set and test set: the ratio is 8:2

```from sklearn.model_selection import train_test_split
feature=train_data.values

X_train, X_test, Y_train, Y_test = train_test_split(features, y, test_size=0.2, random_state=seed)
```

## Model building

##### Custom encapsulate a function
```def Titanicmodel(clf,features,test_data,y,model_name):
if model_name =='LinearSVC':
num_classes = 1 #Number of categories
else:
num_classes = 2 #Number of categories
num_fold = 10  #10% off
fold_len = features.shape[0] // num_fold #The amount of data for each fold
skf_indices = []
skf = StratifiedKFold(n_splits=num_fold, shuffle=True, random_state=seed) #Divide the training set into 10 folds
for i, (train_idx, valid_idx) in enumerate(skf.split(np.ones(features.shape[0]), y)):
skf_indices.extend(valid_idx.tolist())

train_pred = np.zeros((features.shape[0], num_classes)) #Predicted results on the training set (train_samples,classes)
test_pred = np.zeros((test_data.shape[0], num_classes))#Predicted results on the test set (test_samples,classes)

for fold in tqdm(range(num_fold)):

fold_start = fold * fold_len
fold_end = (fold + 1) * fold_len
if fold == num_fold - 1:
fold_end = train_data.shape[0]
#10% off training part index
train_indices = skf_indices[:fold_start] + skf_indices[fold_end:]
# 10% off verification partial index
test_indices = skf_indices[fold_start:fold_end]

#10% off training data
train_x = features[train_indices]
train_y = y[train_indices]
#10% off for validating some data
cv_test_x = features[test_indices]

clf.fit(train_x, train_y) #train

if model_name =='LinearSVC':
pred = clf.decision_function(cv_test_x) #Validate on the validation part of the data
train_pred[test_indices] = (pred).reshape(len(pred),1) #First convert the prediction result to a probability distribution (normalization) through softmax and assign it to the corresponding position of the verification part. At the end of the loop, the prediction result on the entire training set will be obtained
pred = clf.decision_function(test_data) #Get the prediction results of the currently trained model on the test set
test_pred += pred.reshape(len(pred),1) / num_fold#The prediction results of each model on the test set are first converted into probability distributions through softmax, and then directly averaged (10 folds will have 10 results)

else:
pred = clf.predict_proba(cv_test_x) #Validate on the validation part of the data
train_pred[test_indices] = pred   #Assign the prediction result to the position corresponding to the validation part, and the end of the loop will get the prediction result on the entire training set
pred = clf.predict_proba(test_data)  #Get the prediction results of the currently trained model on the test set
test_pred += pred / num_fold  #Directly average the prediction results of each model on the test set (10 folds will have 10 results)
y_pred = np.argmax(train_pred, axis=1) #Take the maximum value of the prediction results on the training set by row to get the predicted label

if model_name =='LinearSVC':
y_pred = (train_pred>0).astype(np.int32).reshape(len(train_pred))
pre = (test_pred>0).astype(np.int32).reshape(len(test_pred))
else:
pre = np.argmax(test_pred,axis=1)
score = accuracy_score(y, y_pred) #The true label accuracy_score corresponding to the training set
print('accuracy_score:',score)
#Save the prediction results of the logistic regression model on the training set and test set
np.save('./result/{0}'.format(model_name)+'train',train_pred)
np.save('./result/{0}'.format(model_name)+'test',test_pred)

submit = pd.DataFrame({'PassengerId':np.array(range(892,1310)),'Survived':pre.astype(np.int32)})
submit.to_csv('{0}_submit.csv'.format(model_name),index=False)
return clf,score

```

#### Logistic Regression (LR)

```pipe=Pipeline([('select',PCA(n_components=0.95)),
('classify', LogisticRegression(random_state = seed, solver = 'liblinear'))])
param = {
'classify__penalty':['l1','l2'],
'classify__C':[0.001, 0.01, 0.1, 1, 5,7,8,9,10,]}
LR_grid = GridSearchCV(estimator =pipe, param_grid = param, scoring='roc_auc', cv=5)
LR_grid.fit(features,y)
print(LR_grid.best_params_, LR_grid.best_score_)
C=LR_grid.best_params_['classify__C']
penalty = LR_grid.best_params_['classify__penalty']
LR_classify=LogisticRegression(C=C,penalty=penalty,random_state = seed, solver = 'liblinear')
LR_select =  PCA(n_components=0.95)
LR_pipeline = make_pipeline(LR_select, LR_classify)
lr_model,lr_score = Titanicmodel(LR_pipeline,feature,test_data,y,'LR')
```

### Support Vector Machine (SVM)

```pipe=Pipeline([('select',SelectKBest(k=20)),
('classify',LinearSVC(random_state=seed))])
param = {
'select__k':list(range(20,40,2)),
'classify__penalty':['l1','l2'],
'classify__C':[0.001, 0.01, 0.1, 1, 5,7,8,9,10,50,100]}
SVC_grid=GridSearchCV(estimator=pipe,param_grid=param,cv=5,scoring='roc_auc')
SVC_grid.fit(features,y)
print(SVC_grid.best_params_, SVC_grid.best_score_)
C=SVC_grid.best_params_['classify__C']
k=SVC_grid.best_params_['select__k']
penalty = SVC_grid.best_params_['classify__penalty']
SVC_classify=LinearSVC(C=C,penalty=penalty,random_state = seed)
SVC_select =  PCA(n_components=0.95)
SVC_pipeline = make_pipeline(SVC_select, SVC_classify)
SVC_model,LinearSVC_score = Titanicmodel(SVC_pipeline,feature,test_data,y,'LinearSVC')
```

### RandomForestClassifier

```pipe=Pipeline([('select',SelectKBest(k=34)),
('classify', RandomForestClassifier(criterion='gini',
random_state = seed,
min_samples_split=4,
min_samples_leaf=5,
max_features = 'sqrt',
n_jobs=-1,
))])

param = {
'classify__n_estimators':list(range(40,50,2)),
'classify__max_depth':list(range(10,25,2))}
rfc_grid = GridSearchCV(estimator = pipe, param_grid = param, scoring='roc_auc', cv=10)
rfc_grid.fit(features,y)
print(rfc_grid.best_params_, rfc_grid.best_score_)
n_estimators=rfc_grid.best_params_['classify__n_estimators']
max_depth = rfc_grid.best_params_['classify__max_depth']
rfc_classify=RandomForestClassifier(criterion='gini',
n_estimators= n_estimators,
max_depth=max_depth,
random_state = seed,
min_samples_split=4,
min_samples_leaf=5,
max_features = 'sqrt')
rfc_select =  PCA(n_components=0.95)
rfc_pipeline = make_pipeline(rfc_select, rfc_classify)
rfc_model,rfc_score = Titanicmodel(rfc_pipeline,feature,test_data,y,'rfc')
```

### LightGBM

```pipe=Pipeline([('select',SelectKBest(k=34)),
('classify', lgb.LGBMClassifier(random_state=seed,learning_rate=0.12,n_estimators=88,max_depth=16,
min_child_samples=28,
min_child_weight=0.0,
classify__colsample_bytree= 0.8,
colsample_bytree=0.4,
objective='binary'

) )])

param = {'select__k':[i for i in range(20,40)]
#            'classify__learning_rate':[i/100 for i in range(20)]
}
lgb_grid = GridSearchCV(estimator = pipe, param_grid = param, scoring='roc_auc', cv=10)
lgb_grid.fit(features,y)
print(lgb_grid.best_params_, lgb_grid.best_score_)
lgb_classify= lgb.LGBMClassifier(random_state=seed,
learning_rate=0.12,
n_estimators=88,
max_depth=16,
min_child_samples=28,
min_child_weight=0.0,
classify__colsample_bytree= 0.8,
colsample_bytree=0.4,
objective='binary'
)
lgb_select = PCA(n_components=0.96)
lgb_pipeline = make_pipeline(lgb_select, lgb_classify)
lgb_model,lgb_score = Titanicmodel(lgb_pipeline,feature,test_data,y,'lgb')
```

### Xgboost

```pipe=Pipeline([('select',SelectKBest(k=34)),
('classify', xgb.XGBClassifier(random_state=seed,
learning_rate=0.12,
n_estimators=80,
max_depth=8,
min_child_weight=3,
subsample=0.8,
colsample_bytree=0.8,
gamma=0.2,
reg_alpha=0.2,
reg_lambda=0.1,
)
)])
param = {  'select__k':[i for i in range(20,40)
'classify__learning_rate':[i/100 for i in range(10,20)],
}
xgb_grid = GridSearchCV(estimator = pipe, param_grid = param, scoring='roc_auc', cv=10)
xgb_grid.fit(features,y)
print(xgb_grid.best_params_, xgb_grid.best_score_)
xgb_classify= xgb.XGBClassifier(random_state=seed,
learning_rate=0.12,
n_estimators=80,
max_depth=8,
min_child_weight=3,
subsample=0.8,
colsample_bytree=0.8,
gamma=0.2,
reg_alpha=0.2,
reg_lambda=0.1,
)
xgb_select =  SelectKBest(k = 34)
xgb_pipeline = make_pipeline(xgb_select, xgb_classify)
xgb_model,xgb_score = Titanicmodel(xgb_pipeline,'xgb')
```

### model fusion

```LR_train = np.load('./result/LRtrain.npy')
```model = LogisticRegression(random_state=seed)