TVM, acceleration model, optimization, inference

TVM is an open source deep learning compiler, which can be applied to all kinds of CPUs, GPUs and other special accelerators. Its goal is to enable us to optimize and run our own model on any hardware. Unlike the deep learning framework, which focuses on model productivity, TVM pays more attention to the performance and efficiency of the model in hardware.

This article only briefly introduces the compilation process of TVM and how to automatically tune your own model. For more in-depth understanding, you can see the official content of TVM:

Compilation process

TVM document Design and Architecture It describes the example compilation process, logical structure components, equipment target realization and so on. The process is shown in the figure below:

At a high level, it includes the following steps:

  • Import: the front-end component extracts the model into the IRModule, which is the function collection of the internal representation (IR) of the model.
  • Transformation: the compiler converts an IRModule into another IRModule that is functionally equivalent or approximately equivalent (such as in the case of quantization). Most transformations are independent of the target (back end). TVM also allows the target to affect the configuration of the conversion channel.
  • Target Translation: the compiler translates (code generation) IRModule to the executable format on the target. The Target Translation result is encapsulated as runtime Module, which can be exported, loaded and executed in the target runtime environment.
  • Runtime Execution: the user loads a runtime Module and run the compiled functions in the supported runtime environment.

Optimization model

TVM document User Tutorial Start with how to compile the optimization model and gradually go deep into the lower logical structure components such as te, tensorir and relay.

Here we only talk about how to use AutoTVM to automatically tune the model, and actually understand the process of TVM compiling, tuning and running the model. See original text Compiling and Optimizing a Model with the Python Interface (AutoTVM).

Prepare TVM

First, install TVM. Visible document Installing TVM , or notes "TVM installation".

After that, you can tune the model through the TVM Python API. Let's import the following dependencies first:

import onnx
from tvm.contrib.download import download_testdata
from PIL import Image
import numpy as np
import tvm.relay as relay
import tvm
from tvm.contrib import graph_executor

Prepare the model and load it

Obtain the pre trained ResNet-50 v2 ONNX model and load:

model_url = "".join(
    [
        "https://github.com/onnx/models/raw/",
        "main/vision/classification/resnet/model/",
        "resnet50-v2-7.onnx",
    ]
)

model_path = download_testdata(model_url, "resnet50-v2-7.onnx", module="onnx")
onnx_model = onnx.load(model_path)

Prepare pictures and pre process them

Obtain a test picture and pre process it into 224x224 NCHW format:

img_url = "https://s3.amazonaws.com/model-server/inputs/kitten.jpg"
img_path = download_testdata(img_url, "imagenet_cat.png", module="data")

# Resize it to 224x224
resized_image = Image.open(img_path).resize((224, 224))
img_data = np.asarray(resized_image).astype("float32")

# Our input image is in HWC layout while ONNX expects CHW input, so convert the array
img_data = np.transpose(img_data, (2, 0, 1))

# Normalize according to the ImageNet input specification
imagenet_mean = np.array([0.485, 0.456, 0.406]).reshape((3, 1, 1))
imagenet_stddev = np.array([0.229, 0.224, 0.225]).reshape((3, 1, 1))
norm_img_data = (img_data / 255 - imagenet_mean) / imagenet_stddev

# Add the batch dimension, as we are expecting 4-dimensional input: NCHW.
img_data = np.expand_dims(norm_img_data, axis=0)

Compile the model with TVM Relay

Create tvnx Relay model and import it into TVM Relay:

target = input("target [llvm]: ")
if not target:
    target = "llvm"
    # target = "llvm -mcpu=core-avx2"
    # target = "llvm -mcpu=skylake-avx512"

# The input name may vary across model types. You can use a tool
# like Netron to check input names
input_name = "data"
shape_dict = {input_name: img_data.shape}

mod, params = relay.frontend.from_onnx(onnx_model, shape_dict)

with tvm.transform.PassContext(opt_level=3):
    lib = relay.build(mod, target=target, params=params)

dev = tvm.device(str(target), 0)
module = graph_executor.GraphModule(lib["default"](dev))

Where target is the target hardware platform. llvm refers to the CPU. It is recommended to specify the architecture instruction set to optimize performance. The CPU can be viewed with the following command:

$ llc --version | grep CPU
  Host CPU: skylake
$ lscpu

Or go directly to the manufacturer's website (e.g IntelĀ® Products )View product parameters.

Run the model with TVM Runtime

Use TVM Runtime model to predict:

dtype = "float32"
module.set_input(input_name, img_data)
module.run()
output_shape = (1, 1000)
tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy()

Collect performance data before optimization

Collect performance data before optimization:

import timeit

timing_number = 10
timing_repeat = 10
unoptimized = (
    np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number))
    * 1000
    / timing_number
)
unoptimized = {
    "mean": np.mean(unoptimized),
    "median": np.median(unoptimized),
    "std": np.std(unoptimized),
}

print(unoptimized)

Then, it is used to compare the optimized performance.

After processing the output, we can know the prediction result

The output prediction results are post processed into readable classification results:

from scipy.special import softmax

# Download a list of labels
labels_url = "https://s3.amazonaws.com/onnx-model-zoo/synset.txt"
labels_path = download_testdata(labels_url, "synset.txt", module="data")

with open(labels_path, "r") as f:
    labels = [l.rstrip() for l in f]

# Open the output and read the output tensor
scores = softmax(tvm_output)
scores = np.squeeze(scores)
ranks = np.argsort(scores)[::-1]
for rank in ranks[0:5]:
    print("class='%s' with probability=%f" % (labels[rank], scores[rank]))

Tuning model and obtaining tuning data

On the target hardware platform, use AutoTVM to automatically tune and obtain tuning data:

import tvm.auto_scheduler as auto_scheduler
from tvm.autotvm.tuner import XGBTuner
from tvm import autotvm

number = 10
repeat = 1
min_repeat_ms = 0  # since we're tuning on a CPU, can be set to 0
timeout = 10  # in seconds

# create a TVM runner
runner = autotvm.LocalRunner(
    number=number,
    repeat=repeat,
    timeout=timeout,
    min_repeat_ms=min_repeat_ms,
    enable_cpu_cache_flush=True,
)

tuning_option = {
    "tuner": "xgb",
    "trials": 10,
    "early_stopping": 100,
    "measure_option": autotvm.measure_option(
        builder=autotvm.LocalBuilder(build_func="default"), runner=runner
    ),
    "tuning_records": "resnet-50-v2-autotuning.json",
}

# begin by extracting the tasks from the onnx model
tasks = autotvm.task.extract_from_program(mod["main"], target=target, params=params)

# Tune the extracted tasks sequentially.
for i, task in enumerate(tasks):
    prefix = "[Task %2d/%2d] " % (i + 1, len(tasks))
    tuner_obj = XGBTuner(task, loss_type="rank")
    tuner_obj.tune(
        n_trial=min(tuning_option["trials"], len(task.config_space)),
        early_stopping=tuning_option["early_stopping"],
        measure_option=tuning_option["measure_option"],
        callbacks=[
            autotvm.callback.progress_bar(tuning_option["trials"], prefix=prefix),
            autotvm.callback.log_to_file(tuning_option["tuning_records"]),
        ],
    )

Above tuning_option selects the XGBoost Grid algorithm for optimization search, and records the data into tuning_records.

Recompile the model and use the tuning data

Recompile an optimization model according to the optimization data:

with autotvm.apply_history_best(tuning_option["tuning_records"]):
    with tvm.transform.PassContext(opt_level=3, config={}):
        lib = relay.build(mod, target=target, params=params)

dev = tvm.device(str(target), 0)
module = graph_executor.GraphModule(lib["default"](dev))


# Verify that the optimized model runs and produces the same results

dtype = "float32"
module.set_input(input_name, img_data)
module.run()
output_shape = (1, 1000)
tvm_output = module.get_output(0, tvm.nd.empty(output_shape)).numpy()

scores = softmax(tvm_output)
scores = np.squeeze(scores)
ranks = np.argsort(scores)[::-1]
for rank in ranks[0:5]:
    print("class='%s' with probability=%f" % (labels[rank], scores[rank]))

Compare the optimized and non optimized models

Collect the performance data after optimization and compare it with that before optimization:

import timeit

timing_number = 10
timing_repeat = 10
optimized = (
    np.array(timeit.Timer(lambda: module.run()).repeat(repeat=timing_repeat, number=timing_number))
    * 1000
    / timing_number
)
optimized = {"mean": np.mean(optimized), "median": np.median(optimized), "std": np.std(optimized)}

print("optimized: %s" % (optimized))
print("unoptimized: %s" % (unoptimized))

The running results of the whole process are as follows:

$ time python autotvm_tune.py
# TVM compilation and operation model
## Downloading and Loading the ONNX Model
## Downloading, Preprocessing, and Loading the Test Image
## Compile the Model With Relay
target [llvm]: llvm -mcpu=core-avx2
One or more operators have not been tuned. Please tune your model for better performance. Use DEBUG logging level to see more details.
## Execute on the TVM Runtime
## Collect Basic Performance Data
{'mean': 44.97057118016528, 'median': 42.52320024970686, 'std': 6.870915251002107}
## Postprocess the output
class='n02123045 tabby, tabby cat' with probability=0.621104
class='n02123159 tiger cat' with probability=0.356378
class='n02124075 Egyptian cat' with probability=0.019712
class='n02129604 tiger, Panthera tigris' with probability=0.001215
class='n04040759 radiator' with probability=0.000262
# AutoTVM tuning model [Y/n]
## Tune the model
[Task  1/25]  Current/Best:  156.96/ 353.76 GFLOPS | Progress: (10/10) | 4.78 s Done.
[Task  2/25]  Current/Best:   54.66/ 241.25 GFLOPS | Progress: (10/10) | 2.88 s Done.
[Task  3/25]  Current/Best:  116.71/ 241.30 GFLOPS | Progress: (10/10) | 3.48 s Done.
[Task  4/25]  Current/Best:  119.92/ 184.18 GFLOPS | Progress: (10/10) | 3.48 s Done.
[Task  5/25]  Current/Best:   48.92/ 158.38 GFLOPS | Progress: (10/10) | 3.13 s Done.
[Task  6/25]  Current/Best:  156.89/ 230.95 GFLOPS | Progress: (10/10) | 2.82 s Done.
[Task  7/25]  Current/Best:   92.33/ 241.99 GFLOPS | Progress: (10/10) | 2.40 s Done.
[Task  8/25]  Current/Best:   50.04/ 331.82 GFLOPS | Progress: (10/10) | 2.64 s Done.
[Task  9/25]  Current/Best:  188.47/ 409.93 GFLOPS | Progress: (10/10) | 4.44 s Done.
[Task 10/25]  Current/Best:   44.81/ 181.67 GFLOPS | Progress: (10/10) | 2.32 s Done.
[Task 11/25]  Current/Best:   83.74/ 312.66 GFLOPS | Progress: (10/10) | 2.74 s Done.
[Task 12/25]  Current/Best:   96.48/ 294.40 GFLOPS | Progress: (10/10) | 2.82 s Done.
[Task 13/25]  Current/Best:  123.74/ 354.34 GFLOPS | Progress: (10/10) | 2.62 s Done.
[Task 14/25]  Current/Best:   23.76/ 178.71 GFLOPS | Progress: (10/10) | 2.90 s Done.
[Task 15/25]  Current/Best:  119.18/ 534.63 GFLOPS | Progress: (10/10) | 2.49 s Done.
[Task 16/25]  Current/Best:  101.24/ 172.92 GFLOPS | Progress: (10/10) | 2.49 s Done.
[Task 17/25]  Current/Best:  309.85/ 309.85 GFLOPS | Progress: (10/10) | 2.69 s Done.
[Task 18/25]  Current/Best:   54.45/ 368.31 GFLOPS | Progress: (10/10) | 2.46 s Done.
[Task 19/25]  Current/Best:   78.69/ 162.43 GFLOPS | Progress: (10/10) | 3.29 s Done.
[Task 20/25]  Current/Best:   40.78/ 317.50 GFLOPS | Progress: (10/10) | 4.52 s Done.
[Task 21/25]  Current/Best:  169.03/ 296.36 GFLOPS | Progress: (10/10) | 3.95 s Done.
[Task 22/25]  Current/Best:   90.96/ 210.43 GFLOPS | Progress: (10/10) | 2.28 s Done.
[Task 23/25]  Current/Best:   48.93/ 217.36 GFLOPS | Progress: (10/10) | 2.87 s Done.
[Task 25/25]  Current/Best:    0.00/   0.00 GFLOPS | Progress: (0/10) | 0.00 s Done.
[Task 25/25]  Current/Best:   25.50/  33.86 GFLOPS | Progress: (10/10) | 9.28 s Done.
## Compiling an Optimized Model with Tuning Data
class='n02123045 tabby, tabby cat' with probability=0.621104
class='n02123159 tiger cat' with probability=0.356378
class='n02124075 Egyptian cat' with probability=0.019712
class='n02129604 tiger, Panthera tigris' with probability=0.001215
class='n04040759 radiator' with probability=0.000262
## Comparing the Tuned and Untuned Models
optimized: {'mean': 34.736288779822644, 'median': 34.547542000655085, 'std': 0.5144378649382363}
unoptimized: {'mean': 44.97057118016528, 'median': 42.52320024970686, 'std': 6.870915251002107}

real    3m23.904s
user    5m2.900s
sys     5m37.099s

Comparing the performance data, we can find that the running speed of the optimized model is faster and more stable.

reference resources

Experience sharing of GoCoding personal practice, you can pay attention to official account!

Tags: Deep Learning compiler

Posted by dkruythoff on Sun, 22 May 2022 09:15:51 +0300