Alibaba cloud's financial risk control - loan default forecast

1, Competition data

Data can be downloaded from the official website: https://tianchi.aliyun.com/competition/entrance/531830/information You need to sign up before you can download data

The task of the competition is to predict whether users default on their loans. The data set can be seen and downloaded after registration. The data comes from the loan records of A credit platform. The total amount of data is more than 120w, including 47 columns of variable information, of which 15 are anonymous variables. In order to ensure the fairness of the competition, 800000 will be selected as the training set, 200000 as the test set A and 200000 as the test set B. at the same time, the information such as employmentTitle, purpose, postCode and title will be desensitized.

Field table

Field Description
id Unique letter of credit identifier assigned to the loan list
loanAmnt Loan amount
term Loan term (year)
interestRate lending rate
installment Installment amount
grade Loan grade
subGrade Sub level of loan grade
employmentTitle Employment title
employmentLength Years of employment (years)
homeOwnership The ownership status of the house provided by the borrower at the time of registration
annualIncome annual income
verificationStatus Verification status
issueDate Month of loan issuance
purpose Loan purpose category of the borrower at the time of loan application
postCode The first three digits of the postal code provided by the borrower in the loan application
regionCode Area code
dti Debt to income ratio
delinquency_2years The number of default events in the borrower's credit file overdue for more than 30 days in the past two years
ficoRangeLow The lower limit of the borrower's fico at the time of loan issuance
ficoRangeHigh The upper limit range of fico of the borrower at the time of loan issuance
openAcc The number of open credit lines in the borrower's credit file
pubRec Number of derogatory public records
pubRecBankruptcies Number of public records cleared
revolBal Total credit turnover balance
revolUtil Revolving credit line utilization, or the amount of credit used by the borrower relative to all available revolving credit facilities
totalAcc Total current credit limit in the borrower's credit file
initialListStatus Initial list status of loans
applicationType Indicate whether the loan is an individual application or a joint application with two co borrowers
earliesCreditLine The month in which the borrower first reported the opening of the credit line
title Name of loan provided by the borrower
policyCode Publicly available policies_ Code = 1 new product not publicly available policy_ Code = 2
n-series anonymous features Anonymous feature n0-n14, which is the processing of counting features for some lender behaviors

 

2, Data analysis

2.1 main contents

  • Overall understanding of data:
    • Read the data set and understand the size of the data set and the original feature dimension;
    • Familiar with data types through info;
    • Roughly view the basic statistics of each feature in the data set;
  • Missing and unique values:
    • Check the missing data value
    • View unique value characteristics
  • Drill down data - view data types
    • Category data
    • Numerical data
      • Discrete numerical data
      • Continuous numerical data
  • Correlation between data
    • Features and relationships between features
    • Relationship between characteristics and target variables
  • Using pandas_ Generating data report

2.2 code example

2.2.1 import the database required for data analysis and visualization

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import warnings
warnings.filterwarnings('ignore')

2.2.2 reading files

data_train = pd.read_csv('F:/python/Alibaba cloud financial risk control-Loan default forecast/train.csv')
data_test_a = pd.read_csv('F:/python/Alibaba cloud financial risk control-Loan default forecast/testA.csv')

2.2.3 general understanding

data_test_a.shape  #(200000, 48)
data_train.shape  #(800000, 47)
data_train.columns
data_train.info()
data_train.describe()
data_train.head(3).append(data_train.tail(3))

 

 

2.2.4 view the missing value and unique value of features in the data set

print(f'There are {data_train.isnull().any().sum()} columns in train dataset with missing values.')
#There are 22 columns in train dataset with missing values.
# nan visualization
missing = data_train.isnull().sum()/len(data_train)
missing = missing[missing > 0]
missing.sort_values(inplace=True)
missing.plot.bar()

 

 

View features with only one value in the feature attribute in the training set and test set

one_value_fea = [col for col in data_train.columns if data_train[col].nunique() <= 1]
one_value_fea_test = [col for col in data_test_a.columns if data_test_a[col].nunique() <= 1]
print(one_value_fea,one_value_fea_test )
#['policyCode'] ['policyCode']
data_train['policyCode'].value_counts() 
#1.0    800000
#Name: policyCode, dtype: int64
#Can delete
data_train=data_train.drop(['policyCode'],axis=1)
data_test_a=data_test_a.drop(['policyCode'],axis=1)
print(data_train.shape,data_test_a.shape)
data_train.columns,data_test_a.columns

2.2.5 check the numerical types and object types of features

  • Features are generally composed of category features and numerical features
  • Category features sometimes have non numerical relationships and sometimes numerical relationships. For example, whether the grades A, B and C in 'grade' are just a simple classification or whether a is superior to others should be judged in combination with business.
  • Numerical features can be directly put into the mold, but often the risk control personnel need to divide them into boxes, convert them into WOE codes, and then do standard score cards and other operations. From the perspective of model effect, feature box is mainly to reduce the complexity of variables, reduce the impact of variable noise on the model, and improve the correlation between independent variables and dependent variables. So as to make the model more stable
data_train.info()
numerical_fea = list(data_train.select_dtypes(exclude=['object']).columns)
category_fea = list(filter(lambda x: x not in numerical_fea,list(data_train.columns)))

 

['id',
 'loanAmnt',
 'term',
 'interestRate',
 'installment',
 'employmentTitle',
 'homeOwnership',
 'annualIncome',
 'verificationStatus',
 'isDefault',
 'purpose',
 'postCode',
 'regionCode',
 'dti',
 'delinquency_2years',
 'ficoRangeLow',
 'ficoRangeHigh',
 'openAcc',
 'pubRec',
 'pubRecBankruptcies',
 'revolBal',
 'revolUtil',
 'totalAcc',
 'initialListStatus',
 'applicationType',
 'title',
 'n0',
 'n1',
 'n2',
 'n2.1',
 'n4',
 'n5',
 'n6',
 'n7',
 'n8',
 'n9',
 'n10',
 'n11',
 'n12',
 'n13',
 'n14']
['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']

Numerical variable analysis. Numerical variables must include continuous variables and discrete variables. Find them

#Filter numeric category features
def get_numerical_serial_fea(data,feas):
    '''
    Objective: to divide continuous variables and classified variables in numerical variables
    data:Data set to be divided
    feas:Name of the feature to be distinguished
    Return: of continuous variables and categorical variables list aggregate
    '''
    numerical_serial_fea = []
    numerical_noserial_fea = []
    for fea in feas:
        temp = data[fea].nunique()
        if temp <= 10:
            numerical_noserial_fea.append(fea)
            continue
        numerical_serial_fea.append(fea)
    return numerical_serial_fea,numerical_noserial_fea
numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(data_train,numerical_fea)
numerical_serial_fea,numerical_noserial_fea

 

(['id',
  'loanAmnt',
  'interestRate',
  'installment',
  'employmentTitle',
  'annualIncome',
  'purpose',
  'postCode',
  'regionCode',
  'dti',
  'delinquency_2years',
  'ficoRangeLow',
  'ficoRangeHigh',
  'openAcc',
  'pubRec',
  'pubRecBankruptcies',
  'revolBal',
  'revolUtil',
  'totalAcc',
  'title',
  'n0',
  'n1',
  'n2',
  'n2.1',
  'n4',
  'n5',
  'n6',
  'n7',
  'n8',
  'n9',
  'n10',
  'n13',
  'n14'],
 ['term',
  'homeOwnership',
  'verificationStatus',
  'isDefault',
  'initialListStatus',
  'applicationType',
  'n11',
  'n12'])

Look closely at the category variables of each numeric type

 

Posted by cneale on Wed, 18 May 2022 10:42:03 +0300