Alibaba Tianchi zero foundation introduction financial risk control - text processing of loan default prediction

text processing

Date processing

earliesCreditLine:

You can see that earliesCreditLine uses English months to years for storage.
Here we use the regular expression of '-' to separate the month and year, then use the judgment statement to convert the English month into a digital month, and finally unify the storage format of the two.
The code segment here is not used very well. It's best to use it

df.iloc['earliesCreditLine']

Obtain the earliesCreditLine column, because their column numbers may be different when processing test data later, so the portability of writing is reduced.
Separate code segment

issueDate_originaldata = df.iloc[:, -18]#Get the earliesCreditLine column
compiled_re = re.compile(r"-")#Use - separate year and year
i = 0
issueDate_afterdata = []#Separate data storage
for data in issueDate_originaldata:
    issueDate_afterdata.append(compiled_re.split(data))
    i = i + 1#Number of data obtained

Convert month code snippet

#Convert the storage format to year month format
issueDate_lastdata = []
for j in range(i):
    if issueDate_afterdata[j][0] == 'Jan':
        issueDate_lastdata.append(issueDate_afterdata[j][1] + '-' + '01')
    elif issueDate_afterdata[j][0] == 'Feb':
        issueDate_lastdata.append(issueDate_afterdata[j][1] + '-' + '02')
    elif issueDate_afterdata[j][0] == 'Mar':
        issueDate_lastdata.append(issueDate_afterdata[j][1] + '-' + '03')
    elif issueDate_afterdata[j][0] == 'Apr':
        issueDate_lastdata.append(issueDate_afterdata[j][1] + '-' + '04')
    elif issueDate_afterdata[j][0] == 'May':
        issueDate_lastdata.append(issueDate_afterdata[j][1] + '-' + '05')
    elif issueDate_afterdata[j][0] == 'Jun':
        issueDate_lastdata.append(issueDate_afterdata[j][1] + '-' + '06')
    elif issueDate_afterdata[j][0] == 'Jul':
        issueDate_lastdata.append(issueDate_afterdata[j][1] + '-' + '07')
    elif issueDate_afterdata[j][0] == 'Aug':
        issueDate_lastdata.append(issueDate_afterdata[j][1] + '-' + '08')
    elif issueDate_afterdata[j][0] == 'Sep':
        issueDate_lastdata.append(issueDate_afterdata[j][1] + '-' + '09')
    elif issueDate_afterdata[j][0] == 'Oct':
        issueDate_lastdata.append(issueDate_afterdata[j][1] + '-' + '10')
    elif issueDate_afterdata[j][0] == 'Nov':
        issueDate_lastdata.append(issueDate_afterdata[j][1] + '-' + '11')
    elif issueDate_afterdata[j][0] == 'Dec':
        issueDate_lastdata.append(issueDate_afterdata[j][1] + '-' + '12')
#Stored in DataFrame format
df_issueDate_lastdata = pd.DataFrame(
    issueDate_lastdata,
    columns=['issueDate_lastdata'])

issueDate


First, you can see that the format of the month of loan issuance corresponding to issuedDate is year month day
Here we use the same regular expression as above to separate the month, year and day
Then we can calculate the difference between the two dates
Here DF Iloc [J, 12] obtains a data, if DF is used LOC [, [column name]] obtains a dataframe and cannot operate on data directly.

print(df.iloc[1, 12])
print(df.loc[1, ['issueDate']])

Code segment for calculating the difference between dates

countday = []
for j in range(i):
    last_data = df.iloc[j, 12]#issuedDate column
    begin_data = df_issueDate_lastdata.iloc[j,0]#Processed earliesCreditLine column
    #############df_issueDate_lastdata.iloc[j,:] no, although there is only one line, the returned value is not a value, but a list
    last_data_number=compiled_re.split(last_data)
    begin_data_number = compiled_re.split(begin_data)
    mon = -int(begin_data_number[1]) + int(last_data_number[1])
    year = -int(begin_data_number[0]) + int(last_data_number[0])
    day = int(last_data_number[2])
    total = year * 365 + mon * 31 + day
    countday.append(total)
print(countday)
df['countday']=countday
df.drop(['issueDate'],axis=1,inplace=True)
df.drop(['earliesCreditLine'],axis=1,inplace=True)#If the pit is deleted, add inplace=True
print(df)
df.to_csv('./Result.csv',index=False)#De index when saving, otherwise the first column will be added

Grade processing



It can be found that the subGrade contains grade information, so the subGrade column is used for the specification level.
There is a judgment process because of the need for specification. Therefore, it is required to find the maximum value in grade and find the code segment of the maximum value of grade.

print("max=")
print(df['grade'].max())


When converting subgrades, because the letters and numbers of subgrades should be separated, two regular expressions should be used for segmentation.
Regular expressions for splitting letters and numbers

compiled_re2 = re.compile(r"[A-Za-z]")
compiled_re3 = re.compile(r"\d")

Convert subGrade snippet

gradescore=[]
for subGrade in df['subGrade']:
    a=compiled_re2.findall(subGrade)
    b=compiled_re3.findall(subGrade)

    if (a[0]=='A')or (a[0]=='a'):
        score=10
    elif (a[0]=='B')or (a[0]=='b'):
        score = 20
    elif (a[0]=='C')or (a[0]=='c'):
        score = 30
    elif (a[0]=='D')or (a[0]=='d'):
        score = 40
    elif (a[0]=='E')or (a[0]=='e'):
        score = 50
    elif (a[0]=='F')or (a[0]=='F'):
        score = 60
    elif (a[0]=='G')or (a[0]=='g'):
        score = 70
    else:
        score=0
    score+=int(b[0])#b is the list to add [0]
    gradescore.append(score)
df['gradescore']=gradescore#Add gradescore column
df.drop(['grade'],axis=1,inplace=True)
df.drop(['subGrade'],axis=1,inplace=True)
df.to_csv('./Result.csv',index=False)

Treatment of years of employment


The processing of this data is roughly the same as the level processing. The difference is that the data contains NAN type data. If it is not processed before regular matching, the program will report an error.
Code snippet for processing years of employment.

compiled_number = re.compile(r"\d")
compiled_sign=re.compile(r"\W")
number=0
employmentLength_array=[]
for employmentLength in df['employmentLength']:
    #print(employmentLength)
    if employmentLength!=employmentLength:#Judge whether the FLOAT value is empty (NAN)
        employmentLength=0
        number=0
    else:
        a = compiled_number.findall(employmentLength)
        b = compiled_sign.findall(employmentLength)
        digit = len(a)
        number = 0
        for i in range(digit):
            number=number*10+float(a[i])
            digit=digit-1
        if b[0]=='<':
            number=number-0.5
        elif b[0]=='>'or b[0]=='+':
            number=number+0.5
    employmentLength_array.append(number)
print(employmentLength_array)
df.drop(['employmentLength'],axis=1,inplace=True)#If the pit is deleted, add inplace=True
df['employmentLength']=employmentLength_array
df.to_csv('./Result.csv',index=False)

Delete rows with null values

Here, use nullcolumn=df[df.isnull().T.any()] to get the row with null value, and then use nullcolumn index. Values gets the index value, which ensures one-to-one correspondence during deletion.

########################################Delete null value
'''
df = pd.read_csv("./Result.csv")
nullcolumn=df[df.isnull().T.any()]  #Get rows with null values
print(nullcolumn)
nullcolumn_number=nullcolumn.index.values#Get the index value corresponding to null value
df.drop(index=nullcolumn_number,axis=0,inplace=True)
df.to_csv('./Result2.csv',index=False)

data normalization

##################data normalization 
df = pd.read_csv("./TESTResult2.csv")
#print(df.info())#Make sure they are all numeric types
df = (df-df.min())/(df.max()-df.min())#That is, simple standardization
df.drop(['id'],axis=1,inplace=True)#Delete ID column non decision attribute column
df.drop(['policyCode'],axis=1,inplace=True)#Delete the policyCode column. This column is meaningless
df.to_csv('./TESTResult3.csv',index=False)

Trample pit

issueDate_originaldata = df.iloc[:, -18]#Get the earliesCreditLine column

The code segment here is not used very well. It's best to use it

df.iloc['earliesCreditLine']

Obtain the earliesCreditLine column, because their column numbers may be different when processing test data later, so the portability of writing is reduced.
2.
df.iloc[j,12] obtains a data, if DF is used LOC [, [column name]] obtains a dataframe and cannot operate on data directly.

print(df.iloc[1, 12])
print(df.loc[1, ['issueDate']])
print(df.loc[1, ['issueDate']].values)
print(df.loc[1, ['issueDate']].values[0])


dataframe.values gets the row value and returns it as a list
3.
When deleting columns in dataframe, if you want to directly delete columns in the original table, add inplace=True:

df.drop(['employmentLength'],axis=1,inplace=True)#If the pit is deleted, add inplace=True

Otherwise, give it a df2, and then save df2 to csv. In this way, the corresponding column is deleted from the new csv, and the original table remains unchanged

DataFrame.drop(labels=None,axis=0, index=None, columns=None, inplace=False)
Parameter Description:
labels Is the name of the row and column to be deleted, given in the list
axis The default value is 0, which means that the row is deleted, so it is deleted columns To specify when axis=1;
index Directly specify the row to delete
columns Directly specify the column to delete
inplace=False,By default, the deletion operation does not change the original data, but returns a new data after the deletion operation dataframe;
inplace=True,The original data will be deleted directly and cannot be returned after deletion.

Therefore, there are two ways to delete rows and columns:
1)labels=Deleted Row and column names,axis=0 /1 Combination of
2)index or columns Directly specify the row or column to delete

When the Dataframe data is saved to the csv file, if index=False is not added, the index will be added to the first column

df.to_csv('./Result.csv',index=False)

Tags: Python Data Analysis Alibaba Cloud

Posted by weekenthe9 on Thu, 12 May 2022 16:24:24 +0300