Theoretical learning
Understanding ROC
This week's game evaluation mentioned the concept of ROC, which is actually very interesting. If a model is evaluated according to the accuracy rate, for example, there are 90 negatives and 10 positives. The first model thinks that everyone is negative, the second model recognizes 5 positives and the others are negative, the accuracy rate of the two models is only 5%, but the actual effect is very different. In addition, the cost of not identifying positive and not identifying negative in many problems is also different (such as cancer detection), Therefore, the concept of ROC is introduced from the medical field.
True positive TP is correctly predicted to be positive, false positive FP is incorrectly predicted to be positive (actually negative), FN is incorrectly predicted to be negative (actually positive), TN is correctly predicted to be negative.
ROC has two coordinate axes. The vertical axis is TPR, which is the ratio of true positive to all positive, the horizontal axis is FPR, and the ratio of false positive to all negative.
It passes through (0,0) (1,1). The larger the area of y=x, the better the evaluation. In other words, it is better to be an isosceles triangle, which can be understood as the false positive rate and the true positive rate is 1 (that is, the problem needs to find out every positive). Changing the upper and lower limits of the model can correspond to each point on the curve, which can be understood as the rise of the false positive rate (more strict), The true Yang rate is also rising. The area enclosed by the curve and y=x indicates the effect of the model. The larger the better.
Decision tree and random forest
Because I haven't repaired data mining before, I'd better supplement the basic knowledge.
First, X has n independent random events, and its information is,
log2 is used in this place because the information in the general computer is expressed to the power of 2, which is easier to calculate.
Information entropy H (X) is expressed as, and p is the probability of occurrence
Why can this formula express information entropy? Entropy refers to the degree of confusion. Here, it refers to the degree of uncertainty of an event. If an event is certain to happen and is certain not to happen, its entropy should be small, while only the events with uncertain occurrence, that is, the events with probability not at both ends (0 or 1) should have greater entropy.
If log is negative, then the curve is [0,1] (infinite, 0) (p is a tree greater than 0 and less than 1), then the entropy with occurrence probability of 0 and 1 is 0, and the entropy of those events in the middle is large.
Decision tree is a way of thinking commonly used in our daily life, as shown in the figure below.
So we can know that there are many decisions in the decision tree. After these decisions, we classify the data. Then which decision to choose as the root node needs entropy.
Random forest can be vividly understood as democratic voting by many decision trees. In order to make their output inconsistent, it is necessary to use put back sampling.
practice
In the process of officially participating in the introduction competition this week, I realized the importance of EDA. It can be said that if it is done well, it will not be so difficult to put the model behind. It is the key to determine success or failure. Several key points are to complete the default value, discretize continuous data, correct outliers and regularize. I also learned a very effective method is to divide boxes. A special point is the mapping of time information.
We have preliminarily practiced xgboost and catboost, which will be further studied next week.
Submit results
import pandas as pd import os import gc import lightgbm as lgb import xgboost as xgb from catboost import CatBoostRegressor from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge from sklearn.preprocessing import MinMaxScaler import math import numpy as np from tqdm import tqdm from sklearn.model_selection import StratifiedKFold, KFold from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss import matplotlib.pyplot as plt import time import warnings warnings.filterwarnings('ignore') train = pd.read_csv('train.csv') testA = pd.read_csv('testA.csv') data = pd.concat([train, testA], axis=0, ignore_index=True) data['employmentLength'].replace(to_replace='10+ years', value='10 years', inplace=True) data['employmentLength'].replace('< 1 year', '0 years', inplace=True) def employmentLength_to_int(s): if pd.isnull(s): return s else: return np.int8(s.split()[0]) data['employmentLength'] = data['employmentLength'].apply(employmentLength_to_int) #print(data['employmentLength'].value_counts(dropna=False).sort_index()) data['earliesCreditLine'] = data['earliesCreditLine'].apply(lambda s: int(s[-4:])) #print(data['earliesCreditLine'].describe()) cate_features = ['grade', 'subGrade', 'employmentTitle', 'homeOwnership', 'verificationStatus', 'purpose', 'postCode', 'regionCode', \ 'applicationType', 'initialListStatus', 'title', 'policyCode'] # for f in cate_features: # print(f, 'number of types:', data[f].nunique()) data = pd.get_dummies(data, columns=['grade', 'subGrade', 'homeOwnership', 'verificationStatus', 'purpose', 'regionCode'], drop_first=True) for f in ['employmentTitle', 'postCode', 'title']: data[f+'_cnts'] = data.groupby([f])['id'].transform('count') data[f+'_rank'] = data.groupby([f])['id'].rank(ascending=False).astype(int) del data[f] features = [f for f in data.columns if f not in ['id','issueDate','isDefault']] train = data[data.isDefault.notnull()].reset_index(drop=True) test = data[data.isDefault.isnull()].reset_index(drop=True) x_train = train[features] x_test = test[features] y_train = train['isDefault'] def cv_model(clf, train_x, train_y, test_x, clf_name): folds = 5 seed = 2020 kf = KFold(n_splits=folds, shuffle=True, random_state=seed) train = np.zeros(train_x.shape[0]) test = np.zeros(test_x.shape[0]) cv_scores = [] for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)): print('************************************ {} ************************************'.format(str(i + 1))) trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], \ train_y[valid_index] if clf_name == "lgb": train_matrix = clf.Dataset(trn_x, label=trn_y) valid_matrix = clf.Dataset(val_x, label=val_y) params = { 'boosting_type': 'gbdt', 'objective': 'binary', 'metric': 'auc', 'min_child_weight': 5, 'num_leaves': 2 ** 5, 'lambda_l2': 10, 'feature_fraction': 0.8, 'bagging_fraction': 0.8, 'bagging_freq': 4, 'learning_rate': 0.1, 'seed': 2020, 'nthread': 28, 'n_jobs': 24, 'silent': True, 'verbose': -1, } model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200, early_stopping_rounds=200) val_pred = model.predict(val_x, num_iteration=model.best_iteration) test_pred = model.predict(test_x, num_iteration=model.best_iteration) # print(list(sorted(zip(features, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:20]) if clf_name == "xgb": train_matrix = clf.DMatrix(trn_x, label=trn_y) valid_matrix = clf.DMatrix(val_x, label=val_y) test_matrix = clf.DMatrix(test_x) params = {'booster': 'gbtree', 'objective': 'binary:logistic', 'eval_metric': 'auc', 'gamma': 1, 'min_child_weight': 1.5, 'max_depth': 5, 'lambda': 10, 'subsample': 0.7, 'colsample_bytree': 0.7, 'colsample_bylevel': 0.7, 'eta': 0.04, 'tree_method': 'exact', 'seed': 2020, 'nthread': 36, "silent": True, } watchlist = [(train_matrix, 'train'), (valid_matrix, 'eval')] model = clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=200, early_stopping_rounds=200) val_pred = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit) test_pred = model.predict(test_matrix, ntree_limit=model.best_ntree_limit) if clf_name == "cat": params = {'learning_rate': 0.05, 'depth': 5, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli', 'od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False} model = clf(iterations=20000, **params) model.fit(trn_x, trn_y, eval_set=(val_x, val_y), cat_features=[], use_best_model=True, verbose=500) val_pred = model.predict(val_x) test_pred = model.predict(test_x) train[valid_index] = val_pred test = test_pred / kf.n_splits cv_scores.append(roc_auc_score(val_y, val_pred)) print(cv_scores) print("%s_scotrainre_list:" % clf_name, cv_scores) print("%s_score_mean:" % clf_name, np.mean(cv_scores)) print("%s_score_std:" % clf_name, np.std(cv_scores)) return train, test def lgb_model(x_train, y_train, x_test): lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb") return lgb_train, lgb_test def xgb_model(x_train, y_train, x_test): xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, "xgb") return xgb_train, xgb_test def cat_model(x_train, y_train, x_test): cat_train, cat_test = cv_model(CatBoostRegressor, x_train, y_train, x_test, "cat") return cat_train, cat_test lgb_train, lgb_test = lgb_model(x_train, y_train, x_test) xgb_train, xgb_test = xgb_model(x_train, y_train, x_test) cat_train, cat_test = cat_model(x_train, y_train, x_test) rh_test = lgb_test*0.5 + xgb_test*0.5 testA['isDefault'] = rh_test testA[['id','isDefault']].to_csv('test_sub.csv', index=False)