TVM is an open source deep learning compiler, which can be applied to all kinds of CPUs, GPUs and other special accelerators. Its goal is to enable us to optimize and run our own model on any hardware. Unlike the deep learning framework, which focuses on model productivity, TVM pays more attention to the performance and efficiency of the model in hardware.
This article only briefly introduces the compilation process of TVM and how to automatically tune your own model. For more in-depth understanding, you can see the official content of TVM:
- file: https://tvm.apache.org/docs/
- Source code: https://github.com/apache/tvm
Compilation process
TVM document Design and Architecture It describes the example compilation process, logical structure components, equipment target realization and so on. The process is shown in the figure below:
At a high level, it includes the following steps:
- Import: the front-end component extracts the model into the IRModule, which is the function collection of the internal representation (IR) of the model.
- Transformation: the compiler converts an IRModule into another IRModule that is functionally equivalent or approximately equivalent (such as in the case of quantization). Most transformations are independent of the target (back end). TVM also allows the target to affect the configuration of the conversion channel.
- Target Translation: the compiler translates (code generation) IRModule to the executable format on the target. The Target Translation result is encapsulated as runtime Module, which can be exported, loaded and executed in the target runtime environment.
- Runtime Execution: the user loads a runtime Module and run the compiled functions in the supported runtime environment.
Optimization model
TVM document User Tutorial Start with how to compile the optimization model and gradually go deep into the lower logical structure components such as te, tensorir and relay.
Here we only talk about how to use AutoTVM to automatically tune the model, and actually understand the process of TVM compiling, tuning and running the model. See original text Compiling and Optimizing a Model with the Python Interface (AutoTVM).
Prepare TVM
First, install TVM. Visible document Installing TVM , or notes "TVM installation".
After that, you can tune the model through the TVM Python API. Let's import the following dependencies first:
import onnx from tvm.contrib.download import download_testdata from PIL import Image import numpy as np import tvm.relay as relay import tvm from tvm.contrib import graph_executor
Prepare the model and load it
Obtain the pre trained ResNet-50 v2 ONNX model and load:
model_url = "".join( [ "https://github.com/onnx/models/raw/", "main/vision/classification/resnet/model/", "resnet50-v2-7.onnx", ] ) model_path = download_testdata(model_url, "resnet50-v2-7.onnx", module="onnx") onnx_model = onnx.load(model_path)
Prepare pictures and pre process them
Obtain a test picture and pre process it into 224x224 NCHW format:
img_url = "https://s3.amazonaws.com/model-server/inputs/kitten.jpg" img_path = download_testdata(img_url, "imagenet_cat.png", module="data") # Resize it to 224x224 resized_image = Image.open(img_path).resize((224, 224)) img_data = np.asarray(resized_image).astype("float32") # Our input image is in HWC layout while ONNX expects CHW input, so convert the array img_data = np.transpose(img_data, (2, 0, 1)) # Normalize according to the ImageNet input specification imagenet_mean = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1)) imagenet_stddev = np.array([0.229, 0.224, 0.225]).reshape((3, 1, 1)) norm_img_data = (img_data / 255 - imagenet_mean) / imagenet_stddev # Add the batch dimension, as we are expecting 4-dimensional input: NCHW. img_data = np.expand_dims(norm_img_data, axis=0)
Compile the model with TVM Relay
Create tvnx Relay model and import it into TVM Relay:
target = input("target [llvm]: ") if not target: target = "llvm" # target = "llvm -mcpu=core-avx2" # target = "llvm -mcpu=skylake-avx512" # The input name may vary across model types. You can use a tool # like Netron to check input names input_name = "data" shape_dict = {input_name: img_data.shape} mod, params = relay.frontend.from_onnx(onnx_model, shape_dict) with tvm.transform.PassContext(opt_level=3): lib = relay.build(mod, target=target, params=params) dev = tvm.device(str(target), 0) module = graph_executor.GraphModule(lib["default"](dev))
Where target is the target hardware platform. llvm refers to the CPU. It is recommended to specify the architecture instruction set to optimize performance. The CPU can be viewed with the following command:
$ llc --version | grep CPU Host CPU: skylake $ lscpu
Or go directly to the manufacturer's website (e.g IntelĀ® Products )View product parameters.
Run the model with TVM Runtime
Use TVM Runtime model to predict:
dtype = "float32" module.set_input(input_name, img_data) module.run() output_shape = (1, 1000) tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy()
Collect performance data before optimization
Collect performance data before optimization:
import timeit timing_number = 10 timing_repeat = 10 unoptimized = ( np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number)) * 1000 / timing_number ) unoptimized = { "mean": np.mean(unoptimized), "median": np.median(unoptimized), "std": np.std(unoptimized), } print(unoptimized)
Then, it is used to compare the optimized performance.
After processing the output, we can know the prediction result
The output prediction results are post processed into readable classification results:
from scipy.special import softmax # Download a list of labels labels_url = "https://s3.amazonaws.com/onnx-model-zoo/synset.txt" labels_path = download_testdata(labels_url, "synset.txt", module="data") with open(labels_path, "r") as f: labels = [l.rstrip() for l in f] # Open the output and read the output tensor scores = softmax(tvm_output) scores = np.squeeze(scores) ranks = np.argsort(scores)[::-1] for rank in ranks[0:5]: print("class='%s' with probability=%f" % (labels[rank], scores[rank]))
Tuning model and obtaining tuning data
On the target hardware platform, use AutoTVM to automatically tune and obtain tuning data:
import tvm.auto_scheduler as auto_scheduler from tvm.autotvm.tuner import XGBTuner from tvm import autotvm number = 10 repeat = 1 min_repeat_ms = 0 # since we're tuning on a CPU, can be set to 0 timeout = 10 # in seconds # create a TVM runner runner = autotvm.LocalRunner( number=number, repeat=repeat, timeout=timeout, min_repeat_ms=min_repeat_ms, enable_cpu_cache_flush=True, ) tuning_option = { "tuner": "xgb", "trials": 10, "early_stopping": 100, "measure_option": autotvm.measure_option( builder=autotvm.LocalBuilder(build_func="default"), runner=runner ), "tuning_records": "resnet-50-v2-autotuning.json", } # begin by extracting the tasks from the onnx model tasks = autotvm.task.extract_from_program(mod["main"], target=target, params=params) # Tune the extracted tasks sequentially. for i, task in enumerate(tasks): prefix = "[Task %2d/%2d] " % (i + 1, len(tasks)) tuner_obj = XGBTuner(task, loss_type="rank") tuner_obj.tune( n_trial=min(tuning_option["trials"], len(task.config_space)), early_stopping=tuning_option["early_stopping"], measure_option=tuning_option["measure_option"], callbacks=[ autotvm.callback.progress_bar(tuning_option["trials"], prefix=prefix), autotvm.callback.log_to_file(tuning_option["tuning_records"]), ], )
Above tuning_option selects the XGBoost Grid algorithm for optimization search, and records the data into tuning_records.
Recompile the model and use the tuning data
Recompile an optimization model according to the optimization data:
with autotvm.apply_history_best(tuning_option["tuning_records"]): with tvm.transform.PassContext(opt_level=3, config={}): lib = relay.build(mod, target=target, params=params) dev = tvm.device(str(target), 0) module = graph_executor.GraphModule(lib["default"](dev)) # Verify that the optimized model runs and produces the same results dtype = "float32" module.set_input(input_name, img_data) module.run() output_shape = (1, 1000) tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy() scores = softmax(tvm_output) scores = np.squeeze(scores) ranks = np.argsort(scores)[::-1] for rank in ranks[0:5]: print("class='%s' with probability=%f" % (labels[rank], scores[rank]))
Compare the optimized and non optimized models
Collect the performance data after optimization and compare it with that before optimization:
import timeit timing_number = 10 timing_repeat = 10 optimized = ( np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number)) * 1000 / timing_number ) optimized = {"mean": np.mean(optimized), "median": np.median(optimized), "std": np.std(optimized)} print("optimized: %s" % (optimized)) print("unoptimized: %s" % (unoptimized))
The running results of the whole process are as follows:
$ time python autotvm_tune.py # TVM compilation and operation model ## Downloading and Loading the ONNX Model ## Downloading, Preprocessing, and Loading the Test Image ## Compile the Model With Relay target [llvm]: llvm -mcpu=core-avx2 One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details. ## Execute on the TVM Runtime ## Collect Basic Performance Data {'mean': 44.97057118016528, 'median': 42.52320024970686, 'std': 6.870915251002107} ## Postprocess the output class='n02123045 tabby, tabby cat' with probability=0.621104 class='n02123159 tiger cat' with probability=0.356378 class='n02124075 Egyptian cat' with probability=0.019712 class='n02129604 tiger, Panthera tigris' with probability=0.001215 class='n04040759 radiator' with probability=0.000262 # AutoTVM tuning model [Y/n] ## Tune the model [Task 1/25] Current/Best: 156.96/ 353.76 GFLOPS | Progress: (10/10) | 4.78 s Done. [Task 2/25] Current/Best: 54.66/ 241.25 GFLOPS | Progress: (10/10) | 2.88 s Done. [Task 3/25] Current/Best: 116.71/ 241.30 GFLOPS | Progress: (10/10) | 3.48 s Done. [Task 4/25] Current/Best: 119.92/ 184.18 GFLOPS | Progress: (10/10) | 3.48 s Done. [Task 5/25] Current/Best: 48.92/ 158.38 GFLOPS | Progress: (10/10) | 3.13 s Done. [Task 6/25] Current/Best: 156.89/ 230.95 GFLOPS | Progress: (10/10) | 2.82 s Done. [Task 7/25] Current/Best: 92.33/ 241.99 GFLOPS | Progress: (10/10) | 2.40 s Done. [Task 8/25] Current/Best: 50.04/ 331.82 GFLOPS | Progress: (10/10) | 2.64 s Done. [Task 9/25] Current/Best: 188.47/ 409.93 GFLOPS | Progress: (10/10) | 4.44 s Done. [Task 10/25] Current/Best: 44.81/ 181.67 GFLOPS | Progress: (10/10) | 2.32 s Done. [Task 11/25] Current/Best: 83.74/ 312.66 GFLOPS | Progress: (10/10) | 2.74 s Done. [Task 12/25] Current/Best: 96.48/ 294.40 GFLOPS | Progress: (10/10) | 2.82 s Done. [Task 13/25] Current/Best: 123.74/ 354.34 GFLOPS | Progress: (10/10) | 2.62 s Done. [Task 14/25] Current/Best: 23.76/ 178.71 GFLOPS | Progress: (10/10) | 2.90 s Done. [Task 15/25] Current/Best: 119.18/ 534.63 GFLOPS | Progress: (10/10) | 2.49 s Done. [Task 16/25] Current/Best: 101.24/ 172.92 GFLOPS | Progress: (10/10) | 2.49 s Done. [Task 17/25] Current/Best: 309.85/ 309.85 GFLOPS | Progress: (10/10) | 2.69 s Done. [Task 18/25] Current/Best: 54.45/ 368.31 GFLOPS | Progress: (10/10) | 2.46 s Done. [Task 19/25] Current/Best: 78.69/ 162.43 GFLOPS | Progress: (10/10) | 3.29 s Done. [Task 20/25] Current/Best: 40.78/ 317.50 GFLOPS | Progress: (10/10) | 4.52 s Done. [Task 21/25] Current/Best: 169.03/ 296.36 GFLOPS | Progress: (10/10) | 3.95 s Done. [Task 22/25] Current/Best: 90.96/ 210.43 GFLOPS | Progress: (10/10) | 2.28 s Done. [Task 23/25] Current/Best: 48.93/ 217.36 GFLOPS | Progress: (10/10) | 2.87 s Done. [Task 25/25] Current/Best: 0.00/ 0.00 GFLOPS | Progress: (0/10) | 0.00 s Done. [Task 25/25] Current/Best: 25.50/ 33.86 GFLOPS | Progress: (10/10) | 9.28 s Done. ## Compiling an Optimized Model with Tuning Data class='n02123045 tabby, tabby cat' with probability=0.621104 class='n02123159 tiger cat' with probability=0.356378 class='n02124075 Egyptian cat' with probability=0.019712 class='n02129604 tiger, Panthera tigris' with probability=0.001215 class='n04040759 radiator' with probability=0.000262 ## Comparing the Tuned and Untuned Models optimized: {'mean': 34.736288779822644, 'median': 34.547542000655085, 'std': 0.5144378649382363} unoptimized: {'mean': 44.97057118016528, 'median': 42.52320024970686, 'std': 6.870915251002107} real 3m23.904s user 5m2.900s sys 5m37.099s
Comparing the performance data, we can find that the running speed of the optimized model is faster and more stable.
reference resources
- Notes: start-ai-compiler
Information:
2020 / The Deep Learning Compiler: A Comprehensive Survey
- [[translation] overview of deep learning compiler]( https://www.jianshu.com/p/ed3...)
2018 / TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
- TVM: an automatic end-to-end deep learning optimization compiler( https://zhuanlan.zhihu.com/p/...)
Experience sharing of GoCoding personal practice, you can pay attention to official account!