Python data analysis practice: distribution analysis

The text and pictures of this article come from the Internet, only for learning and communication, and do not have any commercial purpose. If you have any questions, please contact us in time for handling.

The following article comes from the actual combat of statistics and data analysis, by Yan Xiaoxiang

 

preface

Distribution analysis method is generally an analysis method to group data and study the distribution law of each group according to the analysis purpose. There are two data grouping methods: equidistant or unequal distance grouping.

Distribution analysis is widely used in the actual data analysis practice. The common ones are user gender distribution, user age distribution, user consumption distribution and so on.

This article will explain the following knowledge points:

1. Modification of data type

2. New field generation method

3. Data validity verification

4. Gender and age distribution

distribution analysis

1. Import related library packages

import pandas as pd
import matplotlib.pyplot as plt
import math

 

2. Data processing

>>> df = pd.read_csv('UserInfo.csv')
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 4 columns):
UserId        1000000 non-null int64
CardId        1000000 non-null int64
LoginTime     1000000 non-null object
DeviceType    1000000 non-null object
dtypes: int64(2), object(2)
memory usage: 30.5+ MB

 

Next, we need to analyze the age distribution, but from the source data info() method, there is no age field and we need to generate it ourselves.

# To extract the date of birth, you need to convert the ID number into a string
>>> df['CardId'] = df['CardId'].astype('str')

# Extract the date of birth and generate a new field
>>> df['DateofBirth'] = df.CardId.apply(lambda x : x[6:10]+"-"+x[10:12]+"-"+x[12:14])

# Extract the gender and observe the gender distribution
>>> df['Gender'] = df['CardId'].map(lambda x : 'Male' if int(x[-2]) % 2 else 'Female')

>>> df.head()

 

 

3. Calculate age

Since the data comes from offline, the data validity is not verified. Before age calculation, the data is identified and verified.

# Date of birth: month and day
>>> df[['month','day']] = df['DateofBirth'].str.split('-',expand=True).loc[:,1:2]

# Extract Xiaoyue and check whether there is 31
>>> df_small_month = df[df['month'].isin(['02','04','06','09','11'])]

# Invalid data, as shown in the figure
>>> df_small_month[df_small_month['day']=='31']

# All deleted are invalid data
>>> df.drop(df_small_month[df_small_month['day']=='31'].index,inplace=True)

# Similarly, check in February
>>> df_2 = df[df['month']=='02']

# 2 You can be more careful in the verification of the month. First judge whether to embellish the year, and then delete it
>>> df_2[df_2['day'].isin(['29','30','31'])]

# Delete all
>>> df.drop(df_2[df_2['day'].isin(['29','30','31'])].index,inplace=True)

 

 

# Calculate age
# Method 1
>>> df['Age'] = df['DateofBirth'].apply(lambda x : math.floor((pd.datetime.now() - pd.to_datetime(x)).days/365))

# Method 2
>>> df['DateofBirth'].apply(lambda x : pd.datetime.now().year - pd.to_datetime(x).year)

 

4. Age distribution

# Check the age range and divide it
>>> df['Age'].max(),df['Age'].min()
# (45, 18)

>>> bins = [0,18,25,30,35,40,100]
>>> labels = ['18 Years old and under','19 Years old to 25 years old','26 Years old to 30 years old','31 Years old to 35 years old','36 Years old to 40 years old','41 Years old and over']

>>> df['Age stratification'] = pd.cut(df['Age'],bins, labels = labels)

 

Since this data records the user login information, there must be duplicate data. Python is so powerful that a nunique() method can perform de duplication statistics.

# Check for duplicate values
>>> df.duplicated('UserId').sum()    #47681

# Total data entry
>>> df.count()    #980954

 

 

Although the count() method can also calculate the distribution after grouping, it is only limited to the case of no duplicate data. Python is so invincible that it provides the nunique() method, which can be used to calculate cases with duplicate values

>>> df.groupby('Age stratification')['UserId'].count()
Age stratification
18 Years old and under      25262
19 Years old to 25 years old    254502
26 Years old to 30 years old    181751
31 Years old to 35 years old    181417
36 Years old to 40 years old    181589
41 Years old and over     156433
Name: UserId, dtype: int64

# By summing, we can see that duplicate data is also calculated
>>> df.groupby('Age stratification')['UserId'].count().sum()
# 980954

>>> df.groupby('Age stratification')['UserId'].nunique()
Age stratification
18 Years old and under      24014
19 Years old to 25 years old    242199
26 Years old to 30 years old    172832
31 Years old to 35 years old    172608
36 Years old to 40 years old    172804
41 Years old and over     148816
Name: UserId, dtype: int64


>>> df.groupby('Age stratification')['UserId'].nunique().sum()
# 933273  = 980954(Total)-47681((repeat)

# Calculate age distribution
>>> result = df.groupby('Age stratification')['UserId'].nunique()/df.groupby('Age stratification')['UserId'].nunique().sum()
>>> result

# result
Age stratification
18 Years old and under     0.025731
19 Years old to 25 years old    0.259516
26 Years old to 30 years old    0.185189
31 Years old to 35 years old    0.184949
36 Years old to 40 years old    0.185159
41 Years old and over     0.159456
Name: UserId, dtype: float64


# Format it
>>> result = round(result,4)*100
>>> result.map("{:.2f}%".format)

Age stratification
18 Years old and under      2.57%
19 Years old to 25 years old    25.95%
26 Years old to 30 years old    18.52%
31 Years old to 35 years old    18.49%
36 Years old to 40 years old    18.52%
41 Years old and over     15.95%
Name: UserId, dtype: object

 

 

According to the above results and distribution map, users aged 19 to 25 account for the highest proportion, 26%.

Posted by arctushar on Sat, 07 May 2022 22:56:38 +0300