Retrace the long march - PaddleClas training ImageNet 1K dataset practice - background task version

Retrace the long march - PaddleClas training ImageNet 1K dataset practice

Let's: reproduce the work of PaddleClas engineers
Friendly reminder, the default configuration background task running time of this project is 100 hours for a single card and 25 hours for four cards! If you are not a local hero, please modify the Epoch value. Local heroes and training enthusiasts are free!


Although I have been using PaddleClas for a long time, I have never been involved in ImageNet 1K training. Therefore, I have always been familiar with the accuracy of those model libraries of PaddleClas, but how to train them has never been practiced.
This project is to "follow the long march road again", and train the model from the beginning on the AIStudio platform to see whether the accuracy indicated in the paper and PaddleClas documents can be reproduced.

For me, a novice, there are two main obstacles:

  1. The dataset, that is, the ImageNet 1K dataset, is too large
    This dataset cannot be opened in AIStudio's notebook environment because it exceeds the 100G hard disk space limit.
  2. Training requires a lot of computational power.
    The government has provided a model for training, and most people will not retrain it for no reason.

However, the novice will eventually become an old hand. It is an inevitable process for every AI friend to train a mainstream model from the beginning. So don't be afraid of blocking the way. Knowing that there are tigers in the mountain, I prefer to travel in the tiger mountain!


  1. For the problem that the data set is too large and out of limit, background training or script training can be used. These two training methods allow the use of large data sets.

  2. Background training or script mode is used for training. Although it will cost a lot of computing power, it is inevitable and unavoidable in many scenarios, such as paper reproduction.

Practice step decomposition

(1) processing ImageNet data set
It's the same sentence. If you understand this data set and don't understand it, there is only a layer of window paper.
After the first training in confusion, you can use this data set again later, and you can use it.
If you haven't pierced this layer of window paper, and want to know more about ImageNet 1K data set, you can move here: The first step of the long march – play with ImageNet 1K data set

(2) model training
Use the PaddleClas suite to train. The suite has written all the code. Just put the data set in the specified directory and execute the command.

However, the background and script tasks need to be slightly changed. The main reason is that the background and script tasks have restrictions on the size and number of saved files. For example, for background tasks, if we put the data set in the default PaddleClas/dataset directory, then unless we delete the data set at last, we will report an error "the task failed to run successfully and the processing result failed. Please ensure that the output result is less than 10000 files and less than 20GB", and finally we will not get the result of the training that requires a lot of computational effort.

Next, please start our practice!

1, Play with ImageNet datasets

The ImageNet dataset includes 14197122 images and 21841 categories, that is, 15 million images and 20000 categories in the legend! About 1TB of data. Download website:

Also refer to this project The first step of the long march – play with ImageNet 1K data set

ImageNet 1K data set: the subset with the highest utilization rate in ImageNet data set is ImageNet 1K data set, which is the image classification and positioning data set of large-scale visual recognition challenge (ILSVRC) 2012-2017, so it is also called ImageNet ILSVRC data set. The data set has 1000 classifications, including 1281167 training images, 50000 verification images and 100000 test images. The accuracy of a model we often talk about generally refers to the accuracy of training and verification under this data set. Of course, what is more authoritative and credible is the accuracy obtained from the test set, which is ranked on the Internet.

For us, the ImageNet-1k data set can be used. Two datasets have been loaded in this project (/ home/aistudio/data/data114241/Light_ILSVRC2012_part_0.tar and / home/aistudio/data/data114746/Light_ILSVRC2012_part_1.tar). The reason why there are two datasets is that one is too large to fit.

Because it is too large and cannot be expanded in the Notebook, the project will be terminated because the hard disk is too large. Therefore, we finally need to train in background tasks or script tasks. This project is to use background task training.

2, One in a million model

Initial selection: swing transformer

In the initial stage, the Swin Transformer model was selected, and the Swin Transformer series model was added on June 29, 2021. The highest accuracy of Top1 acc on the ImageNet1k dataset can reach 87.2%; Support training prediction evaluation and whl package deployment. The pre training model can be downloaded here.

However, after practical testing, we found that the training speed is very slow. We can't train Swin's highest accuracy model with less than 100 point cards in the 4-card mode: SwinTransformer_large_patch4_window12_384.
According to the test, one Epoch of the model takes 6 hours (16G V100 version), so it takes about 6 / 2 * 300 = 900 hours for 32G V100 to run 300 epochs. Even with the tiny model: SwinTransformer_tiny_patch4_window7_224, it also takes more than 100 hours, that is, more than 100 GPU training point cards.

Its training speed is: swintransformer_ large_ patch4_ window12_ The ips of 384 is 13

SwinTransformer_ tiny_ patch4_ window7_ The ips of 224 was 97

So we started looking for other models.

Final selection: PP lcnet model

PP lcnet is a lightweight star model of the flying propeller. Its biggest feature is that the reasoning speed is fast, especially in the CPU environment, the reasoning speed is far faster than the competitive product!

Its training speed is about 500ips, which is much faster than swing!
I finally chose it!

First, download the PaddleClas library file.
Second, prepare the data set.
Get ready for training!

# Download PaddleClas
!git clone
# Install related libraries. In the AIStudio project, this step can be omitted
# !pip install -qr ~/PaddleClas/requirements.txt -i 
# Put the configuration files to be trained into the ~ / work directory for modification
# !cp PaddleClas/ppcls/configs/ImageNet/SwinTransformer/SwinTransformer_large_patch4_window12_384.yaml ~/work

Unzip the data set in the ~ / data directory, and the package directory is ~ / data/Light_ILSVRC2012/

This needs to be decompressed in the background task, otherwise the project will be closed due to the disk space overrun.

In addition, because the data set is decompressed to the ~ / data directory instead of being placed in the ~ / PaddleClas/dataset default directory according to the Convention of PaddleClas, it is necessary to modify the model configuration file, such as work / pplcnet_ x1_ Data set setting section in 0.yaml file:

    # Original configuration
          image_root: ./dataset/ILSVRC2012/
          cls_label_path: ./dataset/ILSVRC2012/train_list.txt
    # Modified configuration
          image_root: /home/aistudio/data/Light_ILSVRC2012/
          cls_label_path: /home/aistudio/data/Light_ILSVRC2012/train_list.txt

Decompression takes about 20 minutes.

!cd ~/data && tar -xf /home/aistudio/data/data114241/Light_ILSVRC2012_part_0.tar
!cd ~/data && tar -xf /home/aistudio/data/data114746/Light_ILSVRC2012_part_1.tar
!cd ~/data && ls

3, Long march: Training

Because the full amount of ImageNet 1K data cannot be used in the Notebook, we use the background task for training (a script task will be sent in the future).
PaddleClas training tips

Background training settings

The number of training epochs in the official configuration file is 360, and here we set it to 160, so the total cost of GPU point cards is about 100 points!
Believe me, this 100 point card is worth a lot! The Red Army, which has experienced the Long March, has stronger combat effectiveness and higher ideological awareness than before! And friends who have trained 100 point cards will also feel that they know more about PaddleClas training than before, have stronger training ability, and further understand AI artificial intelligence!

No one knows artificial intelligence better than me! (Donald Trump encourages you)

If the point card is not too rich, it can be set to 12Epoch, and the operation time is about 8 hours. Modify ~ / work / pplcnet_ train_ ImageNet. The epochs parameter in the yaml file can be set to epochs: 12.

Start background training

  • First, fork the project
  • After starting the project, click "version" on the left to generate a new version
  • Click task to create a new background task

There is a long wait behind. The total operation time is about 100 hours for a single card and about 25 hours for four cards.

You can view the training log output through "view log". It will look like this:

[2022/08/05 08:56:08] ppcls INFO: [Train][Epoch 1/360][Iter: 160/2503]lr(LinearWarmup): 0.01029165, top1: 0.00121, top5: 0.00543, CELoss: 6.92706, loss: 6.92706, batch_cost: 0.96780s, reader_cost: 0.68862, ips: 529.03591 samples/s, eta: 10 days, 2:11:48

!echo "PPLCNet Training 160 Epochs"
!cd ~/PaddleClas/ && python3 -m paddle.distributed.launch \
    tools/ \
        -c ~/work/PPLCNet_train_ImageNet.yaml

In the background tasks of 4 cards, the average time of each epoch is 8.5 minutes, which is 34 minutes for a single card.

[2022/08/11 17:35:13] ppcls INFO: [Train][Epoch 2/12][Iter: 0/626]lr(LinearWarmup): 0.16025559, top1: 0.02148, top5: 0.07422, CELoss: 6.28872, loss: 6.28872, batch_cost: 0.89291s, reader_cost: 0.43947, ips: 573.40513 samples/s, eta: 1:42:28
[2022/08/11 17:41:39] ppcls INFO: [Train][Epoch 2/12][Iter: 500/626]lr(LinearWarmup): 0.28805112, top1: 0.04258, top5: 0.12995, CELoss: 5.90013, loss: 5.90013, batch_cost: 0.77378s, reader_cost: 0.05757, ips: 661.68411 samples/s, eta: 1:22:21
[2022/08/11 17:43:11] ppcls INFO: [Train][Epoch 2/12][Avg]top1: 0.05146, top5: 0.14980, CELoss: 5.79814, loss: 5.79814
[2022/08/11 17:43:13] ppcls INFO: [Eval][Epoch 2][Iter: 0/196]CELoss: 4.68889, loss: 4.68889, top1: 0.09766, top5: 0.29688, batch_cost: 1.56802s, reader_cost: 1.28539, ips: 40.81592 images/sec
[2022/08/11 17:43:36] ppcls INFO: [Eval][Epoch 2][Avg]CELoss: 4.85905, loss: 4.85905, top1: 0.10064, top5: 0.26262
[2022/08/11 17:43:36] ppcls INFO: Already save model in ./output/PPLCNet_x1_0/best_model
[2022/08/11 17:43:36] ppcls INFO: [Eval][Epoch 2][best metric: 0.10063999891281128]
[2022/08/11 17:43:36] ppcls INFO: Already save model in ./output/PPLCNet_x1_0/epoch_2
[2022/08/11 17:43:36] ppcls INFO: Already save model in ./output/PPLCNet_x1_0/latest
[2022/08/11 17:43:43] ppcls INFO: [Train][Epoch 3/12][Iter: 0/626]lr(LinearWarmup): 0.32025559, top1: 0.08398, top5: 0.24609, CELoss: 5.41669, loss: 5.41669, batch_cost: 0.77636s, reader_cost: 0.07374, ips: 659.48866 samples/s, eta: 1:21:00

4, Victory meeting: evaluation, prediction and reasoning

After the previous training is completed, the following is how to use the training model.

Model evaluation

After training the model, you can use the following command to evaluate the model indicators.

python3 tools/
-c ppcls/configs/ImageNet/ResNet/ResNet50.yaml
-o Global.pretrained_model=output/ResNet50/best_model

Of which - O global pretrained_ Model = "output/ResNet50/best_model" specifies the path where the current best weight is located. If you specify other weights, you only need to replace the corresponding path.

!echo "assessment"
!cd ~/PaddleClas/ && python tools/ \
    -c ~/work/PPLCNet_train_ImageNet.yaml \
    -o Global.pretrained_model=output/PPLCNet_x1_0/best_model

Evaluation result [2022/08/11 19:07:46] ppcls INFO: [Eval][Epoch 0][Avg]CELoss: 1.91064, loss: 1.91064, top1: 0.57338, top5: 0.80684
It can be seen that 12 epochs can achieve 57% top1 accuracy, which is quite good.

model prediction

After the model training is completed, the pre training model obtained from the training can be loaded for model prediction. In the tools / infer Py provides a complete example. You can complete the model prediction by executing the following command:

python3 tools/ \
    -c ppcls/configs/ImageNet/ResNet/ResNet50.yaml \
    -o Global.pretrained_model=output/ResNet50/best_model 

The predicted picture is ~ / paddleclas / docs / images / information_ deployment/whl_ demo. Jpg: we can take a look at this picture first

from PIL import Image
img ="/home/aistudio/PaddleClas/docs/images/inference_deployment/whl_demo.jpg")

!echo "model prediction "
!cd PaddleClas && python tools/ \
    -c ~/work/PPLCNet_train_ImageNet.yaml \
    -o Global.pretrained_model=output/PPLCNet_x1_0/best_model 

Prediction results: [{'class_ids': [8, 7, 12, 88, 86],' scores': [0.71487, 0.22119, 0.00517, 0.00416, 0.00282], 'file_name': 'docs / images / information_deployment / whl_demo. JPG', 'label_names': ['hen', 'cock', 'house find, linnet, Carpodacus mexicanus',' Macaw ',' partridge ']}]
You can view the results through the ImageNet dataset label mapping table

7 cock
8 hen
9 ostrich, Struthio camelus
10 brambling, Fringilla montifringilla
11 goldfinch, Carduelis carduelis
12 house finch, linnet, Carpodacus mexicanus

It can be seen that the recognition effect is still good, the probability of hen is 71%, and the probability of cock is 22%.

The influence model is derived based on the weights obtained from the training

Here, we provide a script to convert weights and models. Execute the script to get the corresponding reference model:

python3 tools/
-c ppcls/configs/ImageNet/ResNet/ResNet50.yaml
-o Global.pretrained_model=output/ResNet50/best_model
-o Global.save_inference_dir=deploy/models/ResNet50_infer
After executing this script, resnet50 will be generated under deploy/models /_ The infer folder and the models folder should have the following file structure:

├── ResNet50_infer
│ ├── inference.pdiparams
│ ├──
│ └── inference.pdmodel

!echo "export inference Model"
!cd PaddleClas && python tools/ \
    -c ~/work/PPLCNet_train_ImageNet.yaml \
    -o Global.pretrained_model=output/PPLCNet_x1_0/best_model \
    -o Global.save_inference_dir=deploy/models/PPLCNet_x1_0_infer

Inference based on Python prediction engine

Predicted single image

Enter the deploy Directory:
Run the following command to run the image. / images / Imagenet / ilsvrc2012_ val_ 00000010.jpeg. Prediction image in configs / reference_ cls. Setting: infer in yaml file_ imgs: "./images/ImageNet/ILSVRC2012_val_00000010.jpeg"

Use the following command to use GPU for prediction:
python3 python/ -c configs/inference_cls.yaml -o Global.inference_model_dir=models/ResNet50_infer

Use the following command to use CPU for prediction:
python3 python/ -c configs/inference_cls.yaml -o Global.inference_model_dir=models/ResNet50_infer -o Global.use_gpu=False

Let's take a look at the picture first:

from PIL import Image
img ="/home/aistudio/PaddleClas/deploy/images/ImageNet/ILSVRC2012_val_00000010.jpeg")

!echo "be based on Python Prediction engine inference"
!cd ~/PaddleClas/deploy && python python/ \
    -c configs/inference_cls.yaml \
    -o Global.inference_model_dir=models/PPLCNet_x1_0_infer

Output results

ILSVRC2012_val_00000010.jpeg:	class id(s): [153, 333, 204, 332, 283], score(s): [0.27, 0.12, 0.10, 0.08, 0.03], label_name(s): ['Maltese dog, Maltese terrier, Maltese', 'hamster', 'Lhasa, Lhasa apso', 'Angora, Angora rabbit', 'Persian cat']

The corresponding labels are classified as:

153 Maltese dog, Maltese terrier, Maltese Maltese dog
333 hamster Hamster
204 Lhasa, Lhasa apso
332 Angora, Angora rabbit Angora rabbit

To be honest, I don't know what it is.

By viewing val_list.txt file, whose label is ILSVRC2012_val_00000010.JPEG 332
That is, this picture is Angora rabbit

After more rounds of training, such as 160Epoch, the recognition rate of the model will be higher, and the prediction result is

ILSVRC2012_val_00000010.jpeg: class id(s): [153, 332, 229, 204, 265], score(s): [0.41, 0.39, 0.05, 0.04, 0.04], label_name(s): ['Maltese dog, Maltese terrier, Maltese', 'Angora, Angora rabbit', 'Old English sheepdog, bobtail', 'Lhasa, Lhasa apso', 'toy poodle']

It can be seen that the accuracy has reached 39% for Angora rabbits and 41% for Maltese dogs, which is good. Of course, this picture is difficult to distinguish even for people.

Folder based batch forecast

If you want to predict the images in the folder, you can directly modify the global infer_ IMGs field, infer_imgs: "./images/ImageNet /", you can also modify the corresponding configuration through the - o parameter.

!echo "be based on Python Prediction engine inference: folder batch prediction"
!cd ~/PaddleClas/deploy && python python/ \
    -c configs/inference_cls.yaml \
    -o Global.inference_model_dir=models/PPLCNet_x1_0_infer \
    -o Global.infer_imgs=./images/ImageNet/

The output result is:

ILSVRC2012_val_00000010.jpeg:	class id(s): [153, 333, 204, 332, 283], score(s): [0.27, 0.12, 0.10, 0.08, 0.03], label_name(s): ['Maltese dog, Maltese terrier, Maltese', 'hamster', 'Lhasa, Lhasa apso', 'Angora, Angora rabbit', 'Persian cat']
ILSVRC2012_val_00010010.jpeg:	class id(s): [732, 662, 622, 633, 710], score(s): [0.11, 0.10, 0.07, 0.06, 0.04], label_name(s): ['Polaroid camera, Polaroid Land camera', 'modem', 'lens cap, lens cover', "loupe, jeweler's loupe", 'pencil sharpener']
ILSVRC2012_val_00020010.jpeg:	class id(s): [178, 209, 211, 208, 180], score(s): [0.99, 0.00, 0.00, 0.00, 0.00], label_name(s): ['Weimaraner', 'Chesapeake Bay retriever', 'vizsla, Hungarian pointer', 'Labrador retriever', 'American Staffordshire terrier, Staffordshire terrier, American pit bull terrier, pit bull terrier']
ILSVRC2012_val_00030010.jpeg:	class id(s): [80, 134, 7, 23, 8], score(s): [0.95, 0.01, 0.01, 0.01, 0.00], label_name(s): ['black grouse', 'crane', 'cock', 'vulture', 'hen']

It can be found that the identification of the latter two figures is more accurate.

5, Summary

Nothing is difficult if you put your heart into it!
After a long training process, we finally completed 100 hours of training. Everyone is a little excited!

The latter is flexible use. For example, if you write a model or reproduce a model, you can use this background task for training and testing. In fact, at the beginning of the establishment of this project, it was for the purpose of participating in the reproduction of papers and for the training of using PaddleClas.

6, Take the long march again, and retrain in the Notebook environment

Conduct PP-LCNet training document according to the document's use of manned / unmanned scene data set, change the model to swing transformer, and review it in Notebook mode to deepen the impression!

Remove the comments of the following parts (select all, and then press the shortcut key ctrl + /, to remove all the comments), and then directly execute. Of course, it is mainly to familiarize yourself with the operation. If you really need training, you still need to use background tasks to train with ImageNet 1K data.

Note: the following code can be executed under notebook

Preparation before training

First download the PaddleClas library file.
Then enter the PaddleClas/dataset / directory, download and unzip the data of the manned / unmanned scene.

# !git clone
# !cd ~/PaddleClas/dataset && wget && tar -xf person_exists.tar


# ! echo "test swin tiny 300Epochs single card for 3 days and 4 hours, and 4Epochs for about 1 hour."
# !cd ~/PaddleClas/ && python3 tools/ \
#         -c ~/work/swin_tiny.yaml


# !cd ~/PaddleClas/ && python tools/ \
#     -c ~/work/swin_tiny.yaml \
#     -o Global.pretrained_model=output/SwinTransformer_tiny_patch4_window7_224/best_model


# !cd PaddleClas && python tools/ \
#     -c ~/work/swin_tiny.yaml \
#     -o Global.pretrained_model=output/SwinTransformer_tiny_patch4_window7_224/best_model 

The influence model is derived based on the weights obtained from the training

# !cd PaddleClas && python tools/ \
#     -c ~/work/swin_tiny.yaml \
#     -o Global.pretrained_model=output/SwinTransformer_tiny_patch4_window7_224/best_model \
#     -o Global.save_inference_dir=deploy/models/SwinTransformer_tiny_patch4_window7_224_infer

Reasoning using python

The reasoning picture is ". / images/ImageNet/ILSVRC2012_val_00000010.jpeg". Let's see what the reasoning picture looks like:

from PIL import Image
img ="/home/aistudio/PaddleClas/deploy/images/ImageNet/ILSVRC2012_val_00000010.jpeg")

# !cd ~/PaddleClas/deploy && python python/ \
#     -c configs/inference_cls.yaml \
#     -o Global.inference_model_dir=models/SwinTransformer_tiny_patch4_window7_224_infer


Report an error. "_ DataLoaderIterMultiProcess’ object has no attribute ‘_ shutdown’

The following errors are reported occasionally and often during training:

    for iter_id, batch in enumerate(engine.train_dataloader):
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/", line 566, in __iter__
    return _DataLoaderIterMultiProcess(self)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/", line 379, in __init__
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/", line 391, in _init_workers
    self._workers_idx_cycle = itertools.cycle(range(self._num_workers))
TypeError: 'float' object cannot be interpreted as an integer
Exception ignored in: <function _DataLoaderIterMultiProcess.__del__ at 0x7fd7a7a15680>
Traceback (most recent call last):
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/", line 712, in __del__
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dataloader/", line 503, in _try_shutdown_all
    if not self._shutdown:
AttributeError: '_DataLoaderIterMultiProcess' object has no attribute '_shutdown'
LAUNCH INFO 2022-08-04 23:27:41,651 Exit code 1

This error is relatively easy to occur in a 16G video memory environment. This problem is not easy to occur in background and script tasks.

According to Baidu search, the problem is that the ubuntu exchange memory is too small to create a new thread. If this problem occurs, you can try to increase the swap memory.

You can use the free command to view and find that the swap of the system is 0:

              total        used        free      shared  buff/cache   available
Mem:      528380564    13541812   173683904     1090716   341154848   509119516
Swap:             0           0           0

Method of increasing swap exchange area

increase swap file
mkdir swap
cd swap
sudo dd if=/dev/zero of=sfile bs=1024 count=1000000

Convert to swap file
sudo mkswap sfile

4.activation swap file
sudo swapon sfile

5.View effect
 Enter again: free -m

Solution 1

However, in the AIStudio project, sudo operation cannot be performed, so the last step is to increase num_workers, from 4 and 2 to 8.

Solution 2

Another solution is to use_ shared_ Set memory to false: 'use_shared_memory: False’

Concluding remarks

Use the flying oars to make an epoch! Let's paddle in the sea of AI!

Flying propeller official website:

Because of the limited level, there are inevitably shortcomings. Please help me.

Author: Duan Chunhua, online name skywalk or Tianma XingKong, AI architect of Jining jikuai Software Technology Co., Ltd., Baidu PaddlePaddle PPDE.

I got the highest level on AI Studio, lit up 11 badges, come and pay attention~


This item is handling
Original project link

Tags: Machine Learning AI Deep Learning

Posted by phui_99 on Sun, 14 Aug 2022 21:38:15 +0300