Pandas: group aggregation

Click jump
Pandas series catalog

1. Group by method splits data

  • This method provides the splitting function in the grouping aggregation step, which can group the data according to the index or field
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False, **kwargs)

  • Special description of by parameter

    • If a function is passed in, the index is calculated and grouped
    • If a dictionary or Series is passed in, the value of the dictionary or Series is used as the grouping basis
    • If a Numpy array is passed in, the elements of the data are used as the basis for grouping
    • If the string or string list is passed in, the fields represented by these strings are used as the grouping basis
  • Descriptive statistical methods commonly used by group by objects

    • The group by method can't be used to directly view the results after being grouped in memory

    • Descriptive statistical methods commonly used by group by objects

import pandas as pd
import numpy as np

detail = pd.read_excel('...')

detailGroup = detail[['order_id', 'counts', 'amounts']].groupby(by='order_id')
print('The order details after grouping are as follows:', detailGroup)   # Print address

print(detailGroup.mean().head())
print(detailGroup.std().head())
print(detailGroup.size().head())

2. agg method aggregates data

  • agg and aggregate function parameters and their descriptions

    • Both agg and aggregate methods support the application of their functions to each group, including Python built-in functions or custom functions. At the same time, these two methods can also directly apply functions to DataFrame

    • During normal use, the functions of agg function and aggregate function are almost the same when operating on DataFrame object, so you only need to master one of them

      • DataFrame.agg(func, axis=0, *args, **kwargs)
        
      • DataFrame.aggregate(func, axis=0, *args, **kwargs)
        

  • Finding statistics by agg method

    • The agg method can be used to calculate the sum and average of the sales and selling prices of all dishes in the current data at one time

      • detail[['counts', 'amounts']].agg([np.sum, np.mean])
        
    • For one field, you want to do only the mean operation, while for another field, you want to do only the sum operation. You can use the dictionary to take the names of the two fields as key s, and then take the sum and mean functions of numpy library as value s respectively

      • detail.agg({'counts': np.sum, 'amounts': np.mean})
        
    • In some cases, you also want to find multiple statistics for a field, while for some fields, you only need to find one statistics. At this time, you only need to change the value of the key corresponding to the dictionary into a list, and the list element is the statistics of multiple targets

      • detail.agg({'counts': np.sum, 'amounts': [np.mean, np.sum]})
        
  • agg method and custom function

    • You can pass in custom functions in the agg method
    # Double the sum of custom functions
    def DoubleSum(data):
        s = data.sum() * 2
        return s
    
    print('The sum of twice the sales volume of dishes in the dish order details table is:', '\n',
          detail.agg({'counts': DoubleSum}, axis=0))
    
    • When using custom functions, you should pay attention to the function NP in the numpy library mean,np.median,np.prod,np.sum,np.std,np.var can be directly used in agg, but using these functions in numpy Library in user-defined functions will not get the desired results if it is a single sequence during calculation. If multiple columns of data are calculated at the same time, this problem will not occur. Therefore, try not to use NP functions in custom functions
    # Double the sum of custom functions
    def DoubleSum1(data):
        s = np.sum(data) * 2
        return s
    
    print('The sum of double sales of dishes in the order details table is:\n',
          detail.agg({'counts': DoubleSum1}, axis=0).head())    # Abnormal results
    
    print('Twice the sum of the sales volume and selling price of dishes in the order details table is:\n',
          detail[['counts', 'amounts']].agg(DoubleSum1))        # Normal results
    
    • Using the agg method, you can use the same function for each field and group
    • The agg method can operate on the groupby object
      • If you need to apply different functions to different fields, you can use the same agg method as in DataFrame
    print(detailGroup.agg(np.mean).head(3))
    print(detailGroup.agg(np.std).head(3))
    print(detailGroup.agg({'counts': np.sum, 'amounts': np.mean}).head(3))
    

3. Aggregate data with the apply method

  • The apply method is similar to the agg method and can apply functions to each column. The difference is that the function passed in by the apply method can only act on the whole DataFrame or Series, and cannot apply different functions to different fields and obtain different results like agg
  • Using the apply method to aggregate groupby objects is the same as the agg method, except that the agg method can apply different functions to different fields, but apply cannot
DataFrame.apply(func, axis=0, broadcast=False, raw=False, reduce=None, args=(), **kwds)

print(detail[['counts', 'amounts']].apply(np.mean))
print(detailGroup.apply(np.mean).head(3))
print(detailGroup.apply(np.std).head(3))

4. The transform method aggregates data

  • The transform method can operate on all elements of the entire DataFrame. And the transform method has only one parameter func, which represents the function of DataFrame operation

    print(detail[['counts', "amounts"]].transform(lambda x: x * 2).head(4))
    
  • At the same time, the transform method can also operate the object groupby after DataFrame grouping, and can realize the standardization of deviation within the group

    print(detailGroup.transform(lambda x: (x.mean() - x.min()) / (x.max() - x.min())).head())
    
  • If NaN is included in the result when calculating the dispersion standardization, this is because according to the dispersion standardization formula, when the maximum value and the minimum value are the same, the denominator is 0, and the number with denominator 0 is expressed as NaN in Python

Tags: Python pandas

Posted by 3s2ng on Fri, 13 May 2022 07:12:55 +0300