Transferred from AI Studio, original link: Model quantization (2): static and dynamic quantization of Paddle model  Propeller AI Studio
1. Introduction

The previous article briefly introduced the basic principle and simple examples of model quantization

This time, we will continue the previous content, combined with the tool library PaddleSlim of deep learning model compression

This paper introduces how to use PaddleSlim to dynamically quantify and statically quantify the Paddle model
2. References

Github: PaddlePaddle/PaddleSlim
3. PaddleSlim
3.1 introduction

PaddleSlim is a tool library officially developed by Paddle that focuses on deep learning model compression

Provide model compression strategies such as clipping, quantification, distillation, and model structure search to help users quickly realize the miniaturization of the model
3.2 functions

PaddleSlim supports the following functions, as well as userdefined quantization, clipping and other functions
Quantization Pruning NAS Distilling QAT SensitivityPruner *Simulate Anneal based NAS *FSP PACT FPGMFilterPruner *Reinforcement Learning based NAS *DML PTQStatic L1NormFilterPruner **DARTS *DK for YOLOv3 PTQDynamic L2NormFilterPruner **PCDARTS Embedding Quant *SlimFilterPruner **OnceforAll *OptSlimFilterPruner *Hardwareaware Search  *Indicates that only static graphs are supported
 **Indicates that only dynamic graphs are supported
3.3 effect

PaddleSlim does model compression on typical visual and natural language processing tasks

And tested the acceleration on Nvidia GPU, ARM and other devices. Here we show the compression effect of some models

For details, please refer to the CV and NLP model compression scheme in the official document

YOLOv3: accelerate 3.55 times on the mobile terminal SD855

PPOCR: the volume is reduced from 8.9M to 2.9M, and the acceleration is 1.27 times on SD855

BERT: when the model parameters are reduced from 110M to 80M and the accuracy is improved, the calculation of Tesla T4 GPU FP16 is accelerated by 1.47 times

3.3 installation
 Use the pip command to quickly install PaddleSlim
In [ ]
!pip install paddleslim
4. Model quantification
4.1 quantitative method

PaddleSlim mainly includes three quantization methods:

Quantitative aware training (qat): quantitative training enables the model to perceive the impact of quantitative operation on the accuracy of the model, and reduces the quantization error through finetune training.

Post training quantization dynamic (PTQ dynamic): dynamic offline quantization only maps the weight of specific operators in the model from FP32 type to INT8/16 type.

Post training quantization static (PTQ static): static offline quantization uses a small amount of unlabeled calibration data, and uses KL divergence and other methods to calculate the quantization scale factor.

4.2 method selection

The following figure shows how to select the model quantization method as needed
4.3 method comparison

The following table comprehensively compares the use conditions, ease of use, accuracy loss and expected benefits of the model quantification method

The following table comprehensively compares the API interface, functions and classic application scenarios of the model quantification method
Quantitative method API interface function Classic applicable scenarios Online quantization (QAT) Dynamic graph: paddleslim Qat static diagram: paddleslim quant. quant_ aware The quantization error of the model is minimized through Finetune training Scenes and models sensitive to quantization, such as target detection, segmentation, OCR, etc Static offline quantization (PTQ Static) paddleslim.quant.quant_post_static The quantitative model is obtained by a small amount of calibration data Scenes that are not sensitive to quantization, such as image classification tasks Dynamic offline quantization (PTQ Dynamic) paddleslim.quant.quant_post_dynamic Quantifying only the learnable weight of the model Models with large volume and large memory access overhead, such as BERT model Embedded quantization (Quant Embedding) paddleslim.quant.quant_embedding Quantifying only the Embedding parameters of the model Any model that contains an Embedding layer
4.4 more details
 For more detailed information, please refer to PaddleSlim's model quantification document
5. Quantitative practice
5.1 case introduction

Model selection: MobileNet v1 image classification model based on ImageNet pre training provided by PaddleClas

Quantization methods: this time, we introduce two post training quantization methods, namely dynamic quantization and static quantization
5.2 download model
 Download and unzip the MobileNet v1 pre training model using the model download address provided by PaddleClas
In [ ]
# Download model file !wget P models https://paddleimagenetmodelsname.bj.bcebos.com/dygraph/inference/MobileNetV1_infer.tar # Unzip the model file !cd models && tar xf MobileNetV1_infer.tar
5.3 dynamic quantification

Introduction:

Dynamic offline quantization, quantifying the weight of specific OP in the model from FP32 type to INT8/16 type

A trained prediction model is required before quantification, and the model can be transformed into INT8 or INT16 type as required

At present, only the inverse quantitative prediction method is supported, which can mainly reduce the size of the model and accelerate the timeconsuming model with specific loading weights

The quantization accuracy of the original model is 16 / 2, and the quantization accuracy of the original model is not affected.

When the weight is quantized into INT8 type, the accuracy of the model will be affected, and the size of the model is 1 / 4 of the original.


use:

Prepare the prediction model: first save the prediction model of FP32 for quantitative compression

Output quantization model: use PaddleSlim to call the dynamic offline quantization interface and output quantization model


Practice:

Dynamic quantization is a relatively simple method of model quantization. It does not need additional data. Generally, it only needs to call the API interface corresponding to the quantization tool

In PaddleSlim, you only need to call PaddleSlim quant. quant_ post_ Dynamic interface can realize the dynamic quantization of the model

It should be noted that using the above interface needs to be in the static diagram mode of pad, otherwise it cannot operate normally

The specific dynamic quantization example code is as follows:

In [ ]
import paddle import paddleslim # Turn on static graph mode paddle.enable_static() # Path and file name of the model model_dir = "models/MobileNetV1_infer" model_filename = 'inference.pdmodel' params_filename = 'inference.pdiparams' model_dir_quant_dynamic = "models/MobileNetV1_infer_quant_dynamic" # Dynamic quantization paddleslim.quant.quant_post_dynamic( model_dir=model_dir, # Enter model path model_filename=model_filename, # Enter the file name of model calculation diagram params_filename=params_filename, # Enter model parameter file name save_model_dir=model_dir_quant_dynamic, # Output model path save_model_filename=model_filename, # Output model calculation chart name save_params_filename=params_filename, # Output model parameter file name weight_bits=8, # The number of quantization bits 8 / 16 corresponds to the type of INT8/16 )
5.4 static quantification

Introduction:

Static offline quantization is a method to calculate the quantization scale factor based on sampling data and KL divergence

Compared with quantitative training, static offline quantization does not need retraining, and the quantitative model can be obtained quickly

The objective of static offline quantization is to obtain the quantization scale factor. There are two main methods:

No Saturation: the unsaturated quantization method calculates the maximum ABS of absolute value in FP32 Tensor_ Max, which is mapped to 127, then the quantization scale factor is equal to abs_max/127

Saturation quantization method: the saturation quantization method uses KL divergence to calculate an appropriate threshold T (0 < T < mab_max), and maps it to 127, then the quantization scale factor is equal to T/127


Generally speaking, for the weight Tensor of the OP to be quantized, the unsaturated quantization method is adopted, and for the active Tensor of the OP to be quantized (including input and output), the saturated quantization method is adopted


use:

Load the pre trained FP32 model and configure the Reader

Read the sample data, execute the forward reasoning of the model, and save the value of OP activation Tensor to be quantified

Based on the sampling data of the active Tensor, the quantization scale factor is calculated using the saturation quantization method

The model weight Tensor data remains unchanged. The absolute maximum value of each channel is calculated by using the unsaturated method as the quantitative scale factor of each channel

Convert FP32 model into INT8 model and save it


Practice:

Compared with dynamic quantization, static quantization is more complex and requires some additional unlabeled calibration data

Here, a small number of images in the verification set of ImageNet are used as calibration data

In static quantization, a calibration data Reader needs to be constructed first

Then call PaddleSlim of PaddleSlim quant. quant_ post_ Static interface for static quantization

Of course, static quantization also needs to be in the static graph mode of Paddle to operate normally

It should also be noted that the static quantized model of PaddleSlim is not a model that can be finally deployed on the CPU

You need to use the conversion script officially provided by Paddle save_quant_model.py Transform and optimize the quantitative model of output

The specific static quantization example code is as follows:

5.4.1 decompressing data sets
 Unzip the ImageNet validation set data
In [ ]
# Decompress data set !mkdir ~/data/ILSVRC2012 !tar xf ~/data/data68594/ILSVRC2012_img_val.tar C ~/data/ILSVRC2012
5.4.2 model static quantification
 Read calibration data  > build reader  > static quantization
In [ ]
import os import paddle import paddleslim import numpy as np import paddle.vision.transforms as T from PIL import Image # Turn on static graph mode paddle.enable_static() # Path and file name of the model model_dir = "models/MobileNetV1_infer" model_filename = 'inference.pdmodel' params_filename = 'inference.pdiparams' model_dir_quant_static = "models/MobileNetV1_infer_quant_static" # Data preprocessing ''' zoom > Center Cut > Type conversion > Transpose > normalization > Add dimension ''' mean = [0.485, 0.456, 0.406] std = [0.229, 0.224, 0.225] val_transforms = T.Compose( [ T.Resize(256, interpolation="bilinear"), T.CenterCrop(224), lambda x: np.asarray(x, dtype='float32').transpose(2, 0, 1) / 255.0, T.Normalize(mean, std), lambda x: x[None, ...] ] ) # Calibration data reading ''' Read image > Pretreatment > Composition data dictionary ''' img_dir = 'data/ILSVRC2012' img_num = 32 datas = iter([ val_transforms( Image.open(os.path.join(img_dir, img)).convert('RGB') ) for img in os.listdir(img_dir)[:img_num] ]) # Static quantization paddleslim.quant.quant_post_static( executor=paddle.static.Executor(), # Paddle static graph actuator model_dir=model_dir, # Enter model path model_filename=model_filename, # Enter the file name of model calculation diagram params_filename=params_filename, # Enter model parameter file name quantize_model_path=model_dir_quant_static, # Output model path save_model_filename=model_filename, # Output model calculation diagram file name save_params_filename=params_filename, # Output model parameter file name batch_generator=None, # The data batch Generator needs to pass in a callable object and return a Generator sample_generator=lambda:datas, # The data sampling Generator needs to pass in a callable object and return a Generator data_loader=None, # Paddle DataLoader batch_size=32, # Data batch size batch_nums=1, # Number of data batches. The default is to use all data weight_bits=8, # Parameter quantization bits 8 / 16 corresponding to INT8/16 type activation_bits=8, # The number of quantization bits of the activation value 8 / 16 corresponds to the INT8/16 type weight_quantize_type='channel_wise_abs_max', # The parameter quantization method currently supports' range '_ abs_max', 'moving_ average_ abs_max 'and' ABS'_ max' activation_quantize_type='range_abs_max', # The activation value quantization method currently supports' range '_ abs_max', 'moving_ average_ abs_max 'and' ABS'_ max' algo='KL', # The calibration method currently supports' KL ',' hist ',' MSE ',' AVG ',' ABS'_ max' )
5.4.3 model transformation
 Execute the conversion script to convert and export the final deployment model
In [ ]
# Execute conversion script !python save_quant_model.py \ quant_model_path "models/MobileNetV1_infer_quant_static" \ quant_model_filename "inference.pdmodel" \ quant_params_filename "inference.pdiparams" \ int8_model_save_path "models/MobileNetV1_infer_quant_static/quantized_model" \ save_model_filename "inference.pdmodel" \ save_params_filename "inference.pdiparams"
6. Model deployment
6.1 file size

The size comparison of model files generated by the two quantization methods is shown in the following table:
Model Calculation drawing file Parameter file Original model 414.7KB 16.2MB Dynamic quantization 454.0KB 7.2MB Static quantization 221.4KB 16.1MB 
Because of the paste framework and model storage format, the statically quantified model parameters are still stored in Float32 format, so the size of the model parameter file is not reduced

The model parameters generated by dynamic quantization are stored in Int8 format, so the file size is reduced to about half of the original model
6.2 deployment framework

Dynamic quantization: at present, only PaddleLite only supports inverse quantization prediction, and the serverside prediction (PaddleInference) does not support loading the quantization model

Static quantization: use PaddleLite or PaddleInference to load the quantization model for prediction and reasoning
6.3 deployment references

End side deployment: PaddleLite quantitative deployment

Server deployment: Intel CPU quantitative deployment
6.4 deployment practice

The quantitative model generated by PaddleSlim is not well deployed in AIStuido, so it will not be demonstrated here

A bunch of strange pits, either the result is wrong, or it can't be loaded. I don't know what's wrong. Maybe the technology is not good. Forget it, don't do it
7. Tail

Of course, this is all about deploying PaddleSlim. PaddleSlim also has many functions. This time, we only introduce the use of two relatively simple model quantification methods

The introduction to the use of PaddleSlim quantitative training and other functions should be shared slowly later