The text and pictures of this article come from the Internet, only for learning and communication, and do not have any commercial purpose. If you have any questions, please contact us in time for handling.

The following article comes from the actual combat of statistics and data analysis, by Yan Xiaoxiang

# preface

Distribution analysis method is generally an analysis method to group data and study the distribution law of each group according to the analysis purpose. There are two data grouping methods: equidistant or unequal distance grouping.

Distribution analysis is widely used in the actual data analysis practice. The common ones are user gender distribution, user age distribution, user consumption distribution and so on.

This article will explain the following knowledge points:

1. Modification of data type

2. New field generation method

3. Data validity verification

4. Gender and age distribution

# distribution analysis

1. Import related library packages

import pandas as pd import matplotlib.pyplot as plt import math

2. Data processing

>>> df = pd.read_csv('UserInfo.csv') >>> df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000000 entries, 0 to 999999 Data columns (total 4 columns): UserId 1000000 non-null int64 CardId 1000000 non-null int64 LoginTime 1000000 non-null object DeviceType 1000000 non-null object dtypes: int64(2), object(2) memory usage: 30.5+ MB

Next, we need to analyze the age distribution, but from the source data info() method, there is no age field and we need to generate it ourselves.

# To extract the date of birth, you need to convert the ID number into a string >>> df['CardId'] = df['CardId'].astype('str') # Extract the date of birth and generate a new field >>> df['DateofBirth'] = df.CardId.apply(lambda x : x[6:10]+"-"+x[10:12]+"-"+x[12:14]) # Extract the gender and observe the gender distribution >>> df['Gender'] = df['CardId'].map(lambda x : 'Male' if int(x[-2]) % 2 else 'Female') >>> df.head()

3. Calculate age

Since the data comes from offline, the data validity is not verified. Before age calculation, the data is identified and verified.

# Date of birth: month and day >>> df[['month','day']] = df['DateofBirth'].str.split('-',expand=True).loc[:,1:2] # Extract Xiaoyue and check whether there is 31 >>> df_small_month = df[df['month'].isin(['02','04','06','09','11'])] # Invalid data, as shown in the figure >>> df_small_month[df_small_month['day']=='31'] # All deleted are invalid data >>> df.drop(df_small_month[df_small_month['day']=='31'].index,inplace=True) # Similarly, check in February >>> df_2 = df[df['month']=='02'] # 2 You can be more careful in the verification of the month. First judge whether to embellish the year, and then delete it >>> df_2[df_2['day'].isin(['29','30','31'])] # Delete all >>> df.drop(df_2[df_2['day'].isin(['29','30','31'])].index,inplace=True)

# Calculate age # Method 1 >>> df['Age'] = df['DateofBirth'].apply(lambda x : math.floor((pd.datetime.now() - pd.to_datetime(x)).days/365)) # Method 2 >>> df['DateofBirth'].apply(lambda x : pd.datetime.now().year - pd.to_datetime(x).year)

4. Age distribution

# Check the age range and divide it >>> df['Age'].max(),df['Age'].min() # (45, 18) >>> bins = [0,18,25,30,35,40,100] >>> labels = ['18 Years old and under','19 Years old to 25 years old','26 Years old to 30 years old','31 Years old to 35 years old','36 Years old to 40 years old','41 Years old and over'] >>> df['Age stratification'] = pd.cut(df['Age'],bins, labels = labels)

Since this data records the user login information, there must be duplicate data. Python is so powerful that a nunique() method can perform de duplication statistics.

# Check for duplicate values >>> df.duplicated('UserId').sum() #47681 # Total data entry >>> df.count() #980954

Although the count() method can also calculate the distribution after grouping, it is only limited to the case of no duplicate data. Python is so invincible that it provides the nunique() method, which can be used to calculate cases with duplicate values

>>> df.groupby('Age stratification')['UserId'].count() Age stratification 18 Years old and under 25262 19 Years old to 25 years old 254502 26 Years old to 30 years old 181751 31 Years old to 35 years old 181417 36 Years old to 40 years old 181589 41 Years old and over 156433 Name: UserId, dtype: int64 # By summing, we can see that duplicate data is also calculated >>> df.groupby('Age stratification')['UserId'].count().sum() # 980954 >>> df.groupby('Age stratification')['UserId'].nunique() Age stratification 18 Years old and under 24014 19 Years old to 25 years old 242199 26 Years old to 30 years old 172832 31 Years old to 35 years old 172608 36 Years old to 40 years old 172804 41 Years old and over 148816 Name: UserId, dtype: int64 >>> df.groupby('Age stratification')['UserId'].nunique().sum() # 933273 = 980954(Total)-47681((repeat) # Calculate age distribution >>> result = df.groupby('Age stratification')['UserId'].nunique()/df.groupby('Age stratification')['UserId'].nunique().sum() >>> result # result Age stratification 18 Years old and under 0.025731 19 Years old to 25 years old 0.259516 26 Years old to 30 years old 0.185189 31 Years old to 35 years old 0.184949 36 Years old to 40 years old 0.185159 41 Years old and over 0.159456 Name: UserId, dtype: float64 # Format it >>> result = round(result,4)*100 >>> result.map("{:.2f}%".format) Age stratification 18 Years old and under 2.57% 19 Years old to 25 years old 25.95% 26 Years old to 30 years old 18.52% 31 Years old to 35 years old 18.49% 36 Years old to 40 years old 18.52% 41 Years old and over 15.95% Name: UserId, dtype: object

According to the above results and distribution map, users aged 19 to 25 account for the highest proportion, 26%.