Python - data analysis - pandas data preprocessing Standardized data

Python - data analysis - pandas data preprocessing Standardized data

Different features often have different dimensions, and the resulting numerical differences may be very large. If they are not processed when it comes to spatial distance calculation or gradient descent method, the accuracy of data analysis results will be affected. In order to eliminate the possible influence caused by the difference of dimension and value range between features, the data needs to be standardized, which can also be called standardized processing.

catalog:

1, Deviation standardized data
2, Standard deviation standardized data
3, Decimal calibration standardized data

1, Deviation standardized data

Deviation standardization is a linear transformation of the original data. The result is to map the value of the original data to the [0,1] interval.
X∗=X−minmax−minX^*=\frac{X-min}{max-min} \quad X∗=max−minX−min​
Max is the maximum value of sample data, min is the minimum value of sample data, and max min is the range.

import pandas as pd
import numpy as np

# Read data
detail = pd.read_csv('data/detail.csv',index_col=0,encoding='gbk')

# Custom deviation normalization function
def minmaxscale(data):
    data=(data-data.min())/(data.max()-data.min())
    return data

# Standardize the deviation between the selling price and sales volume in the order table of dishes
data1 = minmaxscale(detail['counts'])
data2 = minmaxscale(detail['amounts'])
data3 = pd.concat([data1,data2],axis=1)

print('The top five behaviors of sales volume and selling price data before deviation Standardization:\n',detail[['counts','amounts']].head())
print('The top five behaviors of sales volume and selling price data after deviation Standardization:\n',data3.head())


By comparing the deviation before and after standardization, it can be found that the original value corresponds to the mapped value; The corresponding column of sales volume changes to 0 because as long as the data appearing in the order table are dishes ordered by the same order, and the minimum number of copies is 1, it will change to 0 when the data is equal to the minimum value; The difference between the price data of cut dishes after deviation standardization is very small, because the data range is too large.

Disadvantages:
When the range of data is too large, the deviation standardization tends to 0;
When the data is changed, the range of [min, max] shall be determined again to avoid system error.

2, Standard deviation standardized data

Standard deviation standardization is also called zero mean standardization or z-score standardization. The mean value of the data processed by this method is 0 and the standard deviation is 1.
X∗=X−X‾δX^*=\frac{X-\overline{X}}{\delta} \quad X∗=δX−X​
X ‾ is the average of the original data, δ Is the standard deviation of the original data. \overline{X} is the average value of the original data, and delta is the standard deviation of the original data. X is the average of the original data, δ Is the standard deviation of the original data.

##Custom standard deviation normalization function
def StandardScaler(data):
    data=(data-data.mean())/data.std()
    return data
##Standardize the selling price and sales volume of dish order table
data4=StandardScaler(detail['counts'])
data5=StandardScaler(detail['amounts'])
data6=pd.concat([data4,data5],axis=1)
print('The sales volume and selling price data before the standardization of standard deviation are:\n',
    detail[['counts','amounts']].head())
print('After the standard deviation is standardized, the sales volume and selling price data are:\n',data6.head())


Through comparison, it can be found that the value range after standard deviation standardization is not limited to [0,1], but also has negative values, while retaining the distribution of number data.

3, Decimal calibration standardized data

Decimal scaling standardization is to map the data to the [- 1,1] interval by moving the decimal places of the data. The moving decimal places depend on the maximum value of the absolute value of the data.
X∗=X10kX^*=\frac{X}{10^k} \quad X∗=10kX​
k depends on the maximum value of the absolute value of the data.

##User defined decimal calibration difference standardization function
def DecimalScaler(data):
    data=data/10**np.ceil(np.log10(data.abs().max()))
    return data
##Standardize the selling price and sales volume of dish order table
data7=DecimalScaler(detail['counts'])
data8=DecimalScaler(detail['amounts'])
data9=pd.concat([data7,data8],axis=1)
print('Sales volume and selling price data before decimal calibration and standardization:\n',
    detail[['counts','amounts']].head())
print('Standardized data after sales volume and decimal price:\n',data9.head())


After standardizing the data according to the decimal calibration, we conduct back detection and find out that the maximum value of the absolute value of the data is 178 three digits, so the number of decimal points moved forward defined in this method is 3 digits.

Back to top

Tags: Python Data Analysis

Posted by jchemie on Tue, 24 May 2022 16:55:00 +0300