1, Competition data
Data can be downloaded from the official website: https://tianchi.aliyun.com/competition/entrance/531830/information You need to sign up before you can download data
The task of the competition is to predict whether users default on their loans. The data set can be seen and downloaded after registration. The data comes from the loan records of A credit platform. The total amount of data is more than 120w, including 47 columns of variable information, of which 15 are anonymous variables. In order to ensure the fairness of the competition, 800000 will be selected as the training set, 200000 as the test set A and 200000 as the test set B. at the same time, the information such as employmentTitle, purpose, postCode and title will be desensitized.
Field table
Field | Description |
---|---|
id | Unique letter of credit identifier assigned to the loan list |
loanAmnt | Loan amount |
term | Loan term (year) |
interestRate | lending rate |
installment | Installment amount |
grade | Loan grade |
subGrade | Sub level of loan grade |
employmentTitle | Employment title |
employmentLength | Years of employment (years) |
homeOwnership | The ownership status of the house provided by the borrower at the time of registration |
annualIncome | annual income |
verificationStatus | Verification status |
issueDate | Month of loan issuance |
purpose | Loan purpose category of the borrower at the time of loan application |
postCode | The first three digits of the postal code provided by the borrower in the loan application |
regionCode | Area code |
dti | Debt to income ratio |
delinquency_2years | The number of default events in the borrower's credit file overdue for more than 30 days in the past two years |
ficoRangeLow | The lower limit of the borrower's fico at the time of loan issuance |
ficoRangeHigh | The upper limit range of fico of the borrower at the time of loan issuance |
openAcc | The number of open credit lines in the borrower's credit file |
pubRec | Number of derogatory public records |
pubRecBankruptcies | Number of public records cleared |
revolBal | Total credit turnover balance |
revolUtil | Revolving credit line utilization, or the amount of credit used by the borrower relative to all available revolving credit facilities |
totalAcc | Total current credit limit in the borrower's credit file |
initialListStatus | Initial list status of loans |
applicationType | Indicate whether the loan is an individual application or a joint application with two co borrowers |
earliesCreditLine | The month in which the borrower first reported the opening of the credit line |
title | Name of loan provided by the borrower |
policyCode | Publicly available policies_ Code = 1 new product not publicly available policy_ Code = 2 |
n-series anonymous features | Anonymous feature n0-n14, which is the processing of counting features for some lender behaviors |
2, Data analysis
2.1 main contents
- Overall understanding of data:
- Read the data set and understand the size of the data set and the original feature dimension;
- Familiar with data types through info;
- Roughly view the basic statistics of each feature in the data set;
- Missing and unique values:
- Check the missing data value
- View unique value characteristics
- Drill down data - view data types
- Category data
- Numerical data
- Discrete numerical data
- Continuous numerical data
- Correlation between data
- Features and relationships between features
- Relationship between characteristics and target variables
- Using pandas_ Generating data report
2.2 code example
2.2.1 import the database required for data analysis and visualization
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import datetime import warnings warnings.filterwarnings('ignore')
2.2.2 reading files
data_train = pd.read_csv('F:/python/Alibaba cloud financial risk control-Loan default forecast/train.csv') data_test_a = pd.read_csv('F:/python/Alibaba cloud financial risk control-Loan default forecast/testA.csv')
2.2.3 general understanding
data_test_a.shape #(200000, 48) data_train.shape #(800000, 47) data_train.columns data_train.info() data_train.describe() data_train.head(3).append(data_train.tail(3))
2.2.4 view the missing value and unique value of features in the data set
print(f'There are {data_train.isnull().any().sum()} columns in train dataset with missing values.') #There are 22 columns in train dataset with missing values. # nan visualization missing = data_train.isnull().sum()/len(data_train) missing = missing[missing > 0] missing.sort_values(inplace=True) missing.plot.bar()
View features with only one value in the feature attribute in the training set and test set
one_value_fea = [col for col in data_train.columns if data_train[col].nunique() <= 1] one_value_fea_test = [col for col in data_test_a.columns if data_test_a[col].nunique() <= 1] print(one_value_fea,one_value_fea_test ) #['policyCode'] ['policyCode'] data_train['policyCode'].value_counts() #1.0 800000 #Name: policyCode, dtype: int64 #Can delete data_train=data_train.drop(['policyCode'],axis=1) data_test_a=data_test_a.drop(['policyCode'],axis=1) print(data_train.shape,data_test_a.shape) data_train.columns,data_test_a.columns
2.2.5 check the numerical types and object types of features
- Features are generally composed of category features and numerical features
- Category features sometimes have non numerical relationships and sometimes numerical relationships. For example, whether the grades A, B and C in 'grade' are just a simple classification or whether a is superior to others should be judged in combination with business.
- Numerical features can be directly put into the mold, but often the risk control personnel need to divide them into boxes, convert them into WOE codes, and then do standard score cards and other operations. From the perspective of model effect, feature box is mainly to reduce the complexity of variables, reduce the impact of variable noise on the model, and improve the correlation between independent variables and dependent variables. So as to make the model more stable
data_train.info() numerical_fea = list(data_train.select_dtypes(exclude=['object']).columns) category_fea = list(filter(lambda x: x not in numerical_fea,list(data_train.columns)))
['id', 'loanAmnt', 'term', 'interestRate', 'installment', 'employmentTitle', 'homeOwnership', 'annualIncome', 'verificationStatus', 'isDefault', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'initialListStatus', 'applicationType', 'title', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n11', 'n12', 'n13', 'n14']
['grade', 'subGrade', 'employmentLength', 'issueDate', 'earliesCreditLine']
Numerical variable analysis. Numerical variables must include continuous variables and discrete variables. Find them
#Filter numeric category features def get_numerical_serial_fea(data,feas): ''' Objective: to divide continuous variables and classified variables in numerical variables data:Data set to be divided feas:Name of the feature to be distinguished Return: of continuous variables and categorical variables list aggregate ''' numerical_serial_fea = [] numerical_noserial_fea = [] for fea in feas: temp = data[fea].nunique() if temp <= 10: numerical_noserial_fea.append(fea) continue numerical_serial_fea.append(fea) return numerical_serial_fea,numerical_noserial_fea numerical_serial_fea,numerical_noserial_fea = get_numerical_serial_fea(data_train,numerical_fea) numerical_serial_fea,numerical_noserial_fea
(['id', 'loanAmnt', 'interestRate', 'installment', 'employmentTitle', 'annualIncome', 'purpose', 'postCode', 'regionCode', 'dti', 'delinquency_2years', 'ficoRangeLow', 'ficoRangeHigh', 'openAcc', 'pubRec', 'pubRecBankruptcies', 'revolBal', 'revolUtil', 'totalAcc', 'title', 'n0', 'n1', 'n2', 'n2.1', 'n4', 'n5', 'n6', 'n7', 'n8', 'n9', 'n10', 'n13', 'n14'], ['term', 'homeOwnership', 'verificationStatus', 'isDefault', 'initialListStatus', 'applicationType', 'n11', 'n12'])
Look closely at the category variables of each numeric type