Model quantization: static quantization and dynamic quantization of paddy model

Transferred from AI Studio, original link: Model quantization (2): static and dynamic quantization of Paddle model - Propeller AI Studio

1. Introduction

  • The previous article briefly introduced the basic principle and simple examples of model quantization

  • This time, we will continue the previous content, combined with the tool library PaddleSlim of deep learning model compression

  • This paper introduces how to use PaddleSlim to dynamically quantify and statically quantify the Paddle model

2. References

3. PaddleSlim

3.1 introduction

  • PaddleSlim is a tool library officially developed by Paddle that focuses on deep learning model compression

  • Provide model compression strategies such as clipping, quantification, distillation, and model structure search to help users quickly realize the miniaturization of the model

3.2 functions

  • PaddleSlim supports the following functions, as well as user-defined quantization, clipping and other functions

    QuantizationPruningNASDistilling
    QATSensitivityPruner*Simulate Anneal based NAS*FSP
    PACTFPGMFilterPruner*Reinforcement Learning based NAS*DML
    PTQ-StaticL1NormFilterPruner**DARTS*DK for YOLOv3
    PTQ-DynamicL2NormFilterPruner**PC-DARTS
    Embedding Quant*SlimFilterPruner**Once-for-All
    *OptSlimFilterPruner*Hardware-aware Search
    • *Indicates that only static graphs are supported
    • **Indicates that only dynamic graphs are supported

3.3 effect

  • PaddleSlim does model compression on typical visual and natural language processing tasks

  • And tested the acceleration on Nvidia GPU, ARM and other devices. Here we show the compression effect of some models

  • For details, please refer to the CV and NLP model compression scheme in the official document

    • YOLOv3: accelerate 3.55 times on the mobile terminal SD855

    • PP-OCR: the volume is reduced from 8.9M to 2.9M, and the acceleration is 1.27 times on SD855

    • BERT: when the model parameters are reduced from 110M to 80M and the accuracy is improved, the calculation of Tesla T4 GPU FP16 is accelerated by 1.47 times

3.3 installation

  • Use the pip command to quickly install PaddleSlim

In [ ]

!pip install paddleslim

4. Model quantification

4.1 quantitative method

  • PaddleSlim mainly includes three quantization methods:

    • Quantitative aware training (qat): quantitative training enables the model to perceive the impact of quantitative operation on the accuracy of the model, and reduces the quantization error through finetune training.

    • Post training quantization dynamic (PTQ dynamic): dynamic offline quantization only maps the weight of specific operators in the model from FP32 type to INT8/16 type.

    • Post training quantization static (PTQ static): static offline quantization uses a small amount of unlabeled calibration data, and uses KL divergence and other methods to calculate the quantization scale factor.

4.2 method selection

  • The following figure shows how to select the model quantization method as needed

4.3 method comparison

  • The following table comprehensively compares the use conditions, ease of use, accuracy loss and expected benefits of the model quantification method

  • The following table comprehensively compares the API interface, functions and classic application scenarios of the model quantification method

    Quantitative methodAPI interfacefunctionClassic applicable scenarios
    Online quantization (QAT)Dynamic graph: paddleslim Qat static diagram: paddleslim quant. quant_ awareThe quantization error of the model is minimized through Finetune trainingScenes and models sensitive to quantization, such as target detection, segmentation, OCR, etc
    Static offline quantization (PTQ Static)paddleslim.quant.quant_post_staticThe quantitative model is obtained by a small amount of calibration dataScenes that are not sensitive to quantization, such as image classification tasks
    Dynamic offline quantization (PTQ Dynamic)paddleslim.quant.quant_post_dynamicQuantifying only the learnable weight of the modelModels with large volume and large memory access overhead, such as BERT model
    Embedded quantization (Quant Embedding)paddleslim.quant.quant_embeddingQuantifying only the Embedding parameters of the modelAny model that contains an Embedding layer

4.4 more details

5. Quantitative practice

5.1 case introduction

  • Model selection: MobileNet v1 image classification model based on ImageNet pre training provided by PaddleClas

  • Quantization methods: this time, we introduce two post training quantization methods, namely dynamic quantization and static quantization

5.2 download model

  • Download and unzip the MobileNet v1 pre training model using the model download address provided by PaddleClas

In [ ]

# Download model file
!wget -P models https://paddle-imagenet-models-name.bj.bcebos.com/dygraph/inference/MobileNetV1_infer.tar

# Unzip the model file
!cd models && tar -xf MobileNetV1_infer.tar

5.3 dynamic quantification

  • Introduction:

    • Dynamic off-line quantization, quantifying the weight of specific OP in the model from FP32 type to INT8/16 type

    • A trained prediction model is required before quantification, and the model can be transformed into INT8 or INT16 type as required

    • At present, only the inverse quantitative prediction method is supported, which can mainly reduce the size of the model and accelerate the time-consuming model with specific loading weights

    • The quantization accuracy of the original model is 16 / 2, and the quantization accuracy of the original model is not affected.

    • When the weight is quantized into INT8 type, the accuracy of the model will be affected, and the size of the model is 1 / 4 of the original.

  • use:

    • Prepare the prediction model: first save the prediction model of FP32 for quantitative compression

    • Output quantization model: use PaddleSlim to call the dynamic offline quantization interface and output quantization model

  • Practice:

    • Dynamic quantization is a relatively simple method of model quantization. It does not need additional data. Generally, it only needs to call the API interface corresponding to the quantization tool

    • In PaddleSlim, you only need to call PaddleSlim quant. quant_ post_ Dynamic interface can realize the dynamic quantization of the model

    • It should be noted that using the above interface needs to be in the static diagram mode of pad, otherwise it cannot operate normally

    • The specific dynamic quantization example code is as follows:

In [ ]

import paddle
import paddleslim

# Turn on static graph mode
paddle.enable_static()

# Path and file name of the model
model_dir = "models/MobileNetV1_infer"
model_filename = 'inference.pdmodel'
params_filename = 'inference.pdiparams'
model_dir_quant_dynamic = "models/MobileNetV1_infer_quant_dynamic"

# Dynamic quantization
paddleslim.quant.quant_post_dynamic(
    model_dir=model_dir, # Enter model path
    model_filename=model_filename, # Enter the file name of model calculation diagram
    params_filename=params_filename, # Enter model parameter file name
    save_model_dir=model_dir_quant_dynamic, # Output model path
    save_model_filename=model_filename, # Output model calculation chart name
    save_params_filename=params_filename, # Output model parameter file name
    weight_bits=8, # The number of quantization bits 8 / 16 corresponds to the type of INT8/16
)

5.4 static quantification

  • Introduction:

    • Static off-line quantization is a method to calculate the quantization scale factor based on sampling data and KL divergence

    • Compared with quantitative training, static off-line quantization does not need retraining, and the quantitative model can be obtained quickly

    • The objective of static off-line quantization is to obtain the quantization scale factor. There are two main methods:

      • No Saturation: the unsaturated quantization method calculates the maximum ABS of absolute value in FP32 Tensor_ Max, which is mapped to 127, then the quantization scale factor is equal to abs_max/127

      • Saturation quantization method: the saturation quantization method uses KL divergence to calculate an appropriate threshold T (0 < T < mab_max), and maps it to 127, then the quantization scale factor is equal to T/127

    • Generally speaking, for the weight Tensor of the OP to be quantized, the unsaturated quantization method is adopted, and for the active Tensor of the OP to be quantized (including input and output), the saturated quantization method is adopted

  • use:

    • Load the pre trained FP32 model and configure the Reader

    • Read the sample data, execute the forward reasoning of the model, and save the value of OP activation Tensor to be quantified

    • Based on the sampling data of the active Tensor, the quantization scale factor is calculated using the saturation quantization method

    • The model weight Tensor data remains unchanged. The absolute maximum value of each channel is calculated by using the unsaturated method as the quantitative scale factor of each channel

    • Convert FP32 model into INT8 model and save it

  • Practice:

    • Compared with dynamic quantization, static quantization is more complex and requires some additional unlabeled calibration data

    • Here, a small number of images in the verification set of ImageNet are used as calibration data

    • In static quantization, a calibration data Reader needs to be constructed first

    • Then call PaddleSlim of PaddleSlim quant. quant_ post_ Static interface for static quantization

    • Of course, static quantization also needs to be in the static graph mode of Paddle to operate normally

    • It should also be noted that the static quantized model of PaddleSlim is not a model that can be finally deployed on the CPU

    • You need to use the conversion script officially provided by Paddle save_quant_model.py Transform and optimize the quantitative model of output

    • The specific static quantization example code is as follows:

5.4.1 decompressing data sets

  • Unzip the ImageNet validation set data

In [ ]

# Decompress data set
!mkdir ~/data/ILSVRC2012
!tar -xf ~/data/data68594/ILSVRC2012_img_val.tar -C ~/data/ILSVRC2012

5.4.2 model static quantification

  • Read calibration data - > build reader - > static quantization

In [ ]

import os
import paddle
import paddleslim
import numpy as np
import paddle.vision.transforms as T

from PIL import Image


# Turn on static graph mode
paddle.enable_static()

# Path and file name of the model
model_dir = "models/MobileNetV1_infer"
model_filename = 'inference.pdmodel'
params_filename = 'inference.pdiparams'
model_dir_quant_static = "models/MobileNetV1_infer_quant_static"

# Data preprocessing
'''
    zoom -> Center Cut  -> Type conversion -> Transpose -> normalization -> Add dimension
'''
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
val_transforms = T.Compose(
    [
        T.Resize(256, interpolation="bilinear"),
        T.CenterCrop(224),
        lambda x: np.asarray(x, dtype='float32').transpose(2, 0, 1) / 255.0,
        T.Normalize(mean, std),
        lambda x: x[None, ...]
    ]
)

# Calibration data reading
'''
    Read image -> Pretreatment -> Composition data dictionary
'''
img_dir = 'data/ILSVRC2012'
img_num = 32
datas = iter([
    val_transforms(
        Image.open(os.path.join(img_dir, img)).convert('RGB')
    ) for img in os.listdir(img_dir)[:img_num]
])

# Static quantization
paddleslim.quant.quant_post_static(
    executor=paddle.static.Executor(), # Paddle static graph actuator
    model_dir=model_dir, # Enter model path
    model_filename=model_filename, # Enter the file name of model calculation diagram
    params_filename=params_filename, # Enter model parameter file name
    quantize_model_path=model_dir_quant_static, # Output model path
    save_model_filename=model_filename, # Output model calculation diagram file name
    save_params_filename=params_filename, # Output model parameter file name
    batch_generator=None, # The data batch Generator needs to pass in a callable object and return a Generator
    sample_generator=lambda:datas, # The data sampling Generator needs to pass in a callable object and return a Generator
    data_loader=None, # Paddle DataLoader
    batch_size=32, # Data batch size
    batch_nums=1, # Number of data batches. The default is to use all data
    weight_bits=8, # Parameter quantization bits 8 / 16 corresponding to INT8/16 type
    activation_bits=8, # The number of quantization bits of the activation value 8 / 16 corresponds to the INT8/16 type
    weight_quantize_type='channel_wise_abs_max', # The parameter quantization method currently supports' range '_ abs_max', 'moving_ average_ abs_max 'and' ABS'_ max'
    activation_quantize_type='range_abs_max', # The activation value quantization method currently supports' range '_ abs_max', 'moving_ average_ abs_max 'and' ABS'_ max'
    algo='KL', # The calibration method currently supports' KL ',' hist ',' MSE ',' AVG ',' ABS'_ max'
)

5.4.3 model transformation

  • Execute the conversion script to convert and export the final deployment model

In [ ]

# Execute conversion script
!python save_quant_model.py \
    --quant_model_path "models/MobileNetV1_infer_quant_static" \
    --quant_model_filename "inference.pdmodel" \
    --quant_params_filename "inference.pdiparams" \
    --int8_model_save_path "models/MobileNetV1_infer_quant_static/quantized_model" \
    --save_model_filename "inference.pdmodel" \
    --save_params_filename "inference.pdiparams"

6. Model deployment

6.1 file size

  • The size comparison of model files generated by the two quantization methods is shown in the following table:

    ModelCalculation drawing fileParameter file
    Original model414.7KB16.2MB
    Dynamic quantization454.0KB7.2MB
    Static quantization221.4KB16.1MB
  • Because of the paste framework and model storage format, the statically quantified model parameters are still stored in Float32 format, so the size of the model parameter file is not reduced

  • The model parameters generated by dynamic quantization are stored in Int8 format, so the file size is reduced to about half of the original model

6.2 deployment framework

  • Dynamic quantization: at present, only PaddleLite only supports inverse quantization prediction, and the server-side prediction (PaddleInference) does not support loading the quantization model

  • Static quantization: use PaddleLite or PaddleInference to load the quantization model for prediction and reasoning

6.3 deployment references

6.4 deployment practice

  • The quantitative model generated by PaddleSlim is not well deployed in AIStuido, so it will not be demonstrated here

  • A bunch of strange pits, either the result is wrong, or it can't be loaded. I don't know what's wrong. Maybe the technology is not good. Forget it, don't do it

7. Tail

  • Of course, this is all about deploying PaddleSlim. PaddleSlim also has many functions. This time, we only introduce the use of two relatively simple model quantification methods

  • The introduction to the use of PaddleSlim quantitative training and other functions should be shared slowly later

Tags: AI paddlepaddle paddle

Posted by billli on Fri, 13 May 2022 01:27:08 +0300