The second weekly report of data mining

Theoretical learning

Understanding ROC

This week's game evaluation mentioned the concept of ROC, which is actually very interesting. If a model is evaluated according to the accuracy rate, for example, there are 90 negatives and 10 positives. The first model thinks that everyone is negative, the second model recognizes 5 positives and the others are negative, the accuracy rate of the two models is only 5%, but the actual effect is very different. In addition, the cost of not identifying positive and not identifying negative in many problems is also different (such as cancer detection), Therefore, the concept of ROC is introduced from the medical field.

True positive TP is correctly predicted to be positive, false positive FP is incorrectly predicted to be positive (actually negative), FN is incorrectly predicted to be negative (actually positive), TN is correctly predicted to be negative.
ROC has two coordinate axes. The vertical axis is TPR, which is the ratio of true positive to all positive, the horizontal axis is FPR, and the ratio of false positive to all negative.

It passes through (0,0) (1,1). The larger the area of y=x, the better the evaluation. In other words, it is better to be an isosceles triangle, which can be understood as the false positive rate and the true positive rate is 1 (that is, the problem needs to find out every positive). Changing the upper and lower limits of the model can correspond to each point on the curve, which can be understood as the rise of the false positive rate (more strict), The true Yang rate is also rising. The area enclosed by the curve and y=x indicates the effect of the model. The larger the better.

Decision tree and random forest

Because I haven't repaired data mining before, I'd better supplement the basic knowledge.
First, X has n independent random events, and its information is,

log2 is used in this place because the information in the general computer is expressed to the power of 2, which is easier to calculate.
Information entropy H (X) is expressed as, and p is the probability of occurrence

Why can this formula express information entropy? Entropy refers to the degree of confusion. Here, it refers to the degree of uncertainty of an event. If an event is certain to happen and is certain not to happen, its entropy should be small, while only the events with uncertain occurrence, that is, the events with probability not at both ends (0 or 1) should have greater entropy.
If log is negative, then the curve is [0,1] (infinite, 0) (p is a tree greater than 0 and less than 1), then the entropy with occurrence probability of 0 and 1 is 0, and the entropy of those events in the middle is large.
Decision tree is a way of thinking commonly used in our daily life, as shown in the figure below.

So we can know that there are many decisions in the decision tree. After these decisions, we classify the data. Then which decision to choose as the root node needs entropy.

Random forest can be vividly understood as democratic voting by many decision trees. In order to make their output inconsistent, it is necessary to use put back sampling.


In the process of officially participating in the introduction competition this week, I realized the importance of EDA. It can be said that if it is done well, it will not be so difficult to put the model behind. It is the key to determine success or failure. Several key points are to complete the default value, discretize continuous data, correct outliers and regularize. I also learned a very effective method is to divide boxes. A special point is the mapping of time information.
We have preliminarily practiced xgboost and catboost, which will be further studied next week.

Submit results

import pandas as pd
import os
import gc
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler
import math
import numpy as np
from tqdm import tqdm
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, log_loss
import matplotlib.pyplot as plt
import time
import warnings

train = pd.read_csv('train.csv')
testA = pd.read_csv('testA.csv')

data = pd.concat([train, testA], axis=0, ignore_index=True)

data['employmentLength'].replace(to_replace='10+ years', value='10 years', inplace=True)
data['employmentLength'].replace('< 1 year', '0 years', inplace=True)

def employmentLength_to_int(s):
    if pd.isnull(s):
        return s
        return np.int8(s.split()[0])

data['employmentLength'] = data['employmentLength'].apply(employmentLength_to_int)

data['earliesCreditLine'] = data['earliesCreditLine'].apply(lambda s: int(s[-4:]))

cate_features = ['grade', 'subGrade', 'employmentTitle', 'homeOwnership', 'verificationStatus', 'purpose', 'postCode', 'regionCode', \
                 'applicationType', 'initialListStatus', 'title', 'policyCode']
# for f in cate_features:
#     print(f, 'number of types:', data[f].nunique())

data = pd.get_dummies(data, columns=['grade', 'subGrade', 'homeOwnership', 'verificationStatus', 'purpose', 'regionCode'], drop_first=True)

for f in ['employmentTitle', 'postCode', 'title']:
    data[f+'_cnts'] = data.groupby([f])['id'].transform('count')
    data[f+'_rank'] = data.groupby([f])['id'].rank(ascending=False).astype(int)
    del data[f]

features = [f for f in data.columns if f not in ['id','issueDate','isDefault']]

train = data[data.isDefault.notnull()].reset_index(drop=True)
test = data[data.isDefault.isnull()].reset_index(drop=True)

x_train = train[features]
x_test = test[features]

y_train = train['isDefault']

def cv_model(clf, train_x, train_y, test_x, clf_name):
    folds = 5
    seed = 2020
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)

    train = np.zeros(train_x.shape[0])
    test = np.zeros(test_x.shape[0])

    cv_scores = []

    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('************************************ {} ************************************'.format(str(i + 1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], \

        if clf_name == "lgb":
            train_matrix = clf.Dataset(trn_x, label=trn_y)
            valid_matrix = clf.Dataset(val_x, label=val_y)

            params = {
                'boosting_type': 'gbdt',
                'objective': 'binary',
                'metric': 'auc',
                'min_child_weight': 5,
                'num_leaves': 2 ** 5,
                'lambda_l2': 10,
                'feature_fraction': 0.8,
                'bagging_fraction': 0.8,
                'bagging_freq': 4,
                'learning_rate': 0.1,
                'seed': 2020,
                'nthread': 28,
                'n_jobs': 24,
                'silent': True,
                'verbose': -1,

            model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=200,
            val_pred = model.predict(val_x, num_iteration=model.best_iteration)
            test_pred = model.predict(test_x, num_iteration=model.best_iteration)

            # print(list(sorted(zip(features, model.feature_importance("gain")), key=lambda x: x[1], reverse=True))[:20])

        if clf_name == "xgb":
            train_matrix = clf.DMatrix(trn_x, label=trn_y)
            valid_matrix = clf.DMatrix(val_x, label=val_y)
            test_matrix = clf.DMatrix(test_x)

            params = {'booster': 'gbtree',
                      'objective': 'binary:logistic',
                      'eval_metric': 'auc',
                      'gamma': 1,
                      'min_child_weight': 1.5,
                      'max_depth': 5,
                      'lambda': 10,
                      'subsample': 0.7,
                      'colsample_bytree': 0.7,
                      'colsample_bylevel': 0.7,
                      'eta': 0.04,
                      'tree_method': 'exact',
                      'seed': 2020,
                      'nthread': 36,
                      "silent": True,

            watchlist = [(train_matrix, 'train'), (valid_matrix, 'eval')]

            model = clf.train(params, train_matrix, num_boost_round=50000, evals=watchlist, verbose_eval=200,
            val_pred = model.predict(valid_matrix, ntree_limit=model.best_ntree_limit)
            test_pred = model.predict(test_matrix, ntree_limit=model.best_ntree_limit)

        if clf_name == "cat":
            params = {'learning_rate': 0.05, 'depth': 5, 'l2_leaf_reg': 10, 'bootstrap_type': 'Bernoulli',
                      'od_type': 'Iter', 'od_wait': 50, 'random_seed': 11, 'allow_writing_files': False}

            model = clf(iterations=20000, **params)
  , trn_y, eval_set=(val_x, val_y),
                      cat_features=[], use_best_model=True, verbose=500)

            val_pred = model.predict(val_x)
            test_pred = model.predict(test_x)

        train[valid_index] = val_pred
        test = test_pred / kf.n_splits
        cv_scores.append(roc_auc_score(val_y, val_pred))


    print("%s_scotrainre_list:" % clf_name, cv_scores)
    print("%s_score_mean:" % clf_name, np.mean(cv_scores))
    print("%s_score_std:" % clf_name, np.std(cv_scores))
    return train, test

def lgb_model(x_train, y_train, x_test):
    lgb_train, lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
    return lgb_train, lgb_test

def xgb_model(x_train, y_train, x_test):
    xgb_train, xgb_test = cv_model(xgb, x_train, y_train, x_test, "xgb")
    return xgb_train, xgb_test

def cat_model(x_train, y_train, x_test):
    cat_train, cat_test = cv_model(CatBoostRegressor, x_train, y_train, x_test, "cat")
    return cat_train, cat_test

lgb_train, lgb_test = lgb_model(x_train, y_train, x_test)
xgb_train, xgb_test = xgb_model(x_train, y_train, x_test)
cat_train, cat_test = cat_model(x_train, y_train, x_test)

rh_test = lgb_test*0.5 + xgb_test*0.5

testA['isDefault'] = rh_test
testA[['id','isDefault']].to_csv('test_sub.csv', index=False)

Tags: Data Mining

Posted by gdhanasekar on Fri, 13 May 2022 17:30:59 +0300