Python deep learning Chapter 5 "deep learning for computer vision"

preface

Early review: Python in-depth learning Chapter IV fundamentals of machine learning
The most common machine learning task about vector data is written in the above article.

OK, let's get to the point.

This chapter includes the following contents:

  • Understanding convolutional neural network (convnet)
  • Use data enhancement to reduce overfitting
  • Feature extraction using pre trained convolutional neural network
  • Fine tuning pre trained convolutional neural network
  • Visualize what convolutional neural network has learned and how to make classification decisions

This chapter will introduce convolutional neural network, also known as convnet, which is a deep learning model used in almost all computer vision applications. You will learn to apply convolutional neural networks to image classification problems, especially those with small training data sets. If you don't work in a large technology company, this will also be your most common use scenario.

5.1 introduction to convolutional neural network

We will explain in depth the principle of convolutional neural network and why it is so successful in computer vision tasks. But before that, let's look at a simple example of convolutional neural network, that is, using convolutional neural network to classify MNIST numbers. We used dense connection network in Chapter 2 (the test accuracy was 97.8%). Although the convolutional neural network in this example is very simple, its accuracy will certainly exceed that of the densely connected network in Chapter 2.

The following code will show a simple convolutional neural network. It is a stack of Conv2D layers and MaxPooling2D layers. Soon you will know what these layers do.

Listing 5-1 instantiates a small convolutional neural network

from keras import layers
from keras import models
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))

Importantly, the convolutional neural network receives the input tensor (excluding the batch dimension) with the shape of (image_height, image_width, image_channels). In this example, the convolutional neural network is set to process the input tensor with the size of (28, 28, 1), which is the format of MNIST image. We pass the parameter input to the first layer_ Shape = (28, 28, 1) to complete this setting.

Let's take a look at the current architecture of convolutional neural networks.

>>> model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 26, 26, 32) 320
_________________________________________________________________
max_pooling2d_1 (MaxPooling2D) (None, 13, 13, 32) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 11, 11, 64) 18496
_________________________________________________________________
max_pooling2d_2 (MaxPooling2D) (None, 5, 5, 64) 0
_________________________________________________________________
conv2d_3 (Conv2D) (None, 3, 3, 64) 36928
=================================================================
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0

It can be seen that the output of each Conv2D layer and MaxPooling2D layer is a 3D tensor with a shape of (height, width, channels). The dimensions of width and height usually decrease as the network deepens. The number of channels is controlled by the first parameter passed into the Conv2D layer (32 or 64).

The next step is to input the final output tensor [size (3, 3, 64)] into a densely connected classifier network, that is, the stack of Dense layers, which you are already familiar with. These classifiers can handle 1D vectors, and the current output is 3D tensor. First, we need to flatten the 3D output to 1D, and then add several density layers on it.

Listing 5-2 adds a classifier to the convolutional neural network

model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

We will classify 10 categories, and the last layer is activated with softmax with 10 outputs. Now the network architecture is as follows.

>>> model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 26, 26, 32) 320
_________________________________________________________________
max_pooling2d_1 (MaxPooling2D) (None, 13, 13, 32) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 11, 11, 64) 18496
_________________________________________________________________
max_pooling2d_2 (MaxPooling2D) (None, 5, 5, 64) 0
_________________________________________________________________
conv2d_3 (Conv2D) (None, 3, 3, 64) 36928 
_________________________________________________________________
flatten_1 (Flatten) (None, 576) 0
_________________________________________________________________
dense_1 (Dense) (None, 64) 36928
_________________________________________________________________
dense_2 (Dense) (None, 10) 650
=================================================================
Total params: 93,322
Trainable params: 93,322
Non-trainable params: 0

As you can see, before entering the two density layers, the output of the shape (3, 3, 64) is flattened into a vector of the shape (576,).

Next, we train this convolutional neural network on MNIST digital image. We will reuse much of the code from the MNIST example in Chapter 2.

Code listing 5-3 training convolutional neural network on MNIST image

from keras.datasets import mnist
from keras.utils import to_categorical
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype('float32') / 255
test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype('float32') / 255
train_labels = to_categorical(train_labels)
test_labels = to_categorical(test_labels)
model.compile(optimizer='rmsprop',
 loss='categorical_crossentropy',
 metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=5, batch_size=64)

We evaluate the model on the test data.

>>> test_loss, test_acc = model.evaluate(test_images, test_labels)
>>> test_acc
0.99080000000000001

In Chapter 2, the test accuracy of dense connected network is 97.8%, but the test accuracy of this simple convolutional neural network is 99.3%, and we reduce the error rate by 68% (relative proportion). Pretty good!

Why is this simple convolutional neural network so effective compared with the dense connection model? To answer this question, let's have an in-depth understanding of the role of Conv2D layer and MaxPooling2D layer.

5.1.1 convolution operation

The fundamental difference between Dense connection layer and convolution layer is that the Dense layer learns the global mode from the input feature space

(for example, for MNIST digital, the global mode is the mode involving all pixels), while the convolution layer learns the local mode (see Figure 5-1), and for the image, it learns the mode found in the two-dimensional small window of the input image. In the above example, the size of these windows is 3 × 3.

Figure 5-1 image can be decomposed into local modes, such as edge, texture, etc

This important characteristic makes convolutional neural network have the following two interesting properties.

  • The patterns learned by convolutional neural network have translation invariance. After learning a pattern in the lower right corner of the image, the convolutional neural network can recognize the pattern anywhere, such as the upper left corner. For densely connected networks, if the pattern appears in a new location, it can only relearn the pattern. This makes the convolutional neural network can make efficient use of data when processing images (because the visual world is fundamentally translation invariant). It only needs less training samples to learn the data representation with generalization ability.
  • The convolutional neural network can learn the spatial hierarchies of patterns, as shown in Figure 5-2. The first convolution layer will learn smaller local patterns (such as edges), the second convolution layer will learn larger patterns composed of the features of the first layer, and so on. This makes the convolutional neural network can effectively learn more and more complex and abstract visual concepts (because the visual world fundamentally has a spatial hierarchy).

For a 3D tensor containing two spatial axes (height and width) and one depth axis (also known as channel axis), its convolution is also called feature map. For RGB images, the dimension size of the depth axis is equal to 3 because the image has three color channels: red, green and blue. For black-and-white images (such as MNIST digital images), the depth is equal to 1 (representing gray level). The convolution operation extracts blocks from the input feature map, and applies the same transformation to all these blocks to generate the output feature map. The output feature map is still a 3D tensor with width and height, and its depth can be taken arbitrarily, because the output depth is a parameter of the layer, and different channels of the depth axis no longer represent a specific color like RGB input, but represent a filter. Filters encode an aspect of the input data. For example, a single filter can encode the concept from a higher level: "the input contains a face."

Figure 5-2 the visual world forms the spatial hierarchy of the visual module: the super local edges are combined into local objects, such as eyes or ears, and these local objects are combined into high-level concepts, such as "cat"

In the MNIST example, the first convolution layer receives a feature map with a size of (28, 28, 1) and outputs a feature map with a size of (26, 26, 32), that is, it calculates 32 filters on the input. For these 32 output channels, each channel contains a 26 × 26, which is the response map of the filter to the input, indicating the response of the filter mode at different positions in the input (see Figure 5-3). This is also the meaning of the term feature graph: each dimension of the depth axis is a feature (or filter), and the 2D tensor output [:,:, n] is the two-dimensional spatial map of the response of the filter on the input.

Figure 5-3 concept of response diagram: a two-dimensional diagram of whether a mode exists at different positions in the input

Convolution is defined by the following two key parameters.

  • Block size extracted from input: the size of these blocks is usually 3 × 3 or 5 × 5. 3 in this case × 3. This is a very common choice.
  • Depth of output characteristic graph: the number of filters calculated by convolution. In this example, the depth of the first layer is 32 and the depth of the last layer is 64.

For the Conv2D layer of Keras, these parameters are the first parameters passed to the layer: Conv2D(output_depth, (window_height, window_width)).

How convolution works: slide these 3 on the 3D input feature map × 3 or 5 × 5 window, stop at every possible position and extract the 3D block of surrounding features [the shape is (window_height, window_width, input_depth)]. Then, each 3D block makes tensor product with the learned same weight matrix (called convolution kernel) and converts it into a 1D vector with the shape of (output_depth,). Then, all these vectors are spatially reorganized to convert them into 3D output feature map with shape of (height, width, output_depth). Each spatial position in the output feature map corresponds to the same position in the input feature map (for example, the lower right corner of the output contains the information of the lower right corner of the input). For example, use 3 × 3 window, vector output[i, j,:] from 3D block input[i-1:i+1, j-1:j+1,:]. See Figure 5-4 for the whole process.

Fig. 5-4 working principle of convolution

Note that the width and height of the output may be different from the width and height of the input. There may be two different reasons.

  • The boundary effect can be offset by filling the input feature map.
  • stride is used, which will be defined later.

Let's take a closer look at these concepts.

  1. Understanding boundary effects and filling

Suppose there is a 5 × 5 (25 blocks in total). Only 9 of them can put a 3 as the center × 3 window, these 9 squares form a 3 × 3 grid (see Figure 5-5). Therefore, the size of the output feature map is 3 × 3. It is a little smaller than the input size. In this case, it is exactly 2 squares smaller along each dimension. In the previous example, you can also see the effect of this boundary effect: the initial input size is 28 × 28, after the first convolution, the size becomes 26 × 26.

Figure 5-5 × In the input feature map of 5, 3 can be extracted × Effective position of block 3

If you want the spatial dimension of the output feature map to be the same as the input, you can use padding. Filling is to add an appropriate number of rows and columns on each side of the input characteristic graph, so that each input block can be used as the center of the convolution window. For 3 × 3 window, add a column on the left and right, and a row on the top and bottom. For 5 × 5, add two rows and two columns respectively (see Figure 5-6).

Figure 5-6 to 5 × The input of 5 is filled in so that 25 3 can be extracted × Block of 3

For Conv2D layer, filling can be set through the padding parameter, which has two values: "valid" means that filling is not used (only valid window position is used); "Same" means "the width and height of the output after filling are the same as the input". The default value of the padding parameter is "valid".

  1. Understanding convolution stride

Another factor affecting the output size is the concept of stride. So far, the description of convolution assumes that the central blocks of convolution window are adjacent. However, the distance between two consecutive windows is a parameter of convolution, called stride, and the default value is 1. You can also use stepped convolution, that is, convolution with a step greater than 1. In Figure 5-7, you can see 3 with step 2 × 3 convolution from 5 × 5 blocks extracted from input (no padding).

Figure 5-7 2 × 2 stride 3 × 3 convolution block

A stride of 2 means that the width and height of the feature map are down sampled twice (except for changes caused by boundary effects). Although step convolution may be useful for some types of models, it is rarely used in practice. It's good to be familiar with this concept.

In order to down sample the feature map, we usually use the max pooling operation instead of stride. You have seen this operation in the first convolutional neural network example. Now let's study this operation in depth.

5.1.2 maximum pool operation

In the convolutional neural network example, you may notice that after each MaxPooling2D layer, the size of the feature map will be halved. For example, before the first MaxPooling2D layer, the size of the feature map is 26 × 26, but the maximum pooling operation halves it to 13 × 13. This is the role of maximum pooling: down sampling the feature map, which is similar to step convolution.

Maximum pooling is to extract the window from the input characteristic graph and output the maximum value of each channel. Its concept is similar to convolution, but maximum pooling uses hard coded max tensor operation to transform local blocks, rather than using the learned linear transformation (convolution kernel). The biggest difference between maximum pooling and convolution is that maximum pooling usually uses 2 × 2 window and stride 2, which aims to sample the feature map down twice. In contrast, convolution usually uses 3 × 3 window and stride 1.

Why use this method to sample the feature map? Why not delete the maximum pool layer and keep the larger feature map all the time? Let's try this. At this time, the convolutional base of the model is as follows.

model_no_max_pool = models.Sequential()
model_no_max_pool.add(layers.Conv2D(32, (3, 3), activation='relu',
 input_shape=(28, 28, 1)))
model_no_max_pool.add(layers.Conv2D(64, (3, 3), activation='relu'))
model_no_max_pool.add(layers.Conv2D(64, (3, 3), activation='relu'))

The architecture of the model is as follows.

>>> model_no_max_pool.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_4 (Conv2D) (None, 26, 26, 32) 320
_________________________________________________________________
conv2d_5 (Conv2D) (None, 24, 24, 64) 18496
_________________________________________________________________
conv2d_6 (Conv2D) (None, 22, 22, 64) 36928
=================================================================
Total params: 55,744
Trainable params: 55,744
Non-trainable params: 0

What's wrong with this architecture? There are two questions.

  • This architecture is not conducive to the spatial hierarchy of learning features. 3 on the third floor × 3 the window contains only 7 of the initial input × 7 information contained in the window. The advanced patterns learned by convolutional neural networks are still small relative to the initial input, which may not be enough to learn to classify numbers (you can try just 7 pixels) × 7-pixel window to observe the image to identify the numbers in it). We need to make the feature of the last convolution contain the overall information entered.
  • The feature map of the last layer has a total of 22 for each sample × twenty-two × 64 = 30 976 elements. That's too much. If you flatten it and add a 512 density layer on it, that layer will have 15.8 million parameters. This is too much for such a small model, which will lead to serious over fitting.

In short, the reason for using down sampling is to reduce the number of elements of the feature map to be processed, and to introduce the hierarchical structure of spatial filter by making the observation window of continuous convolution larger and larger (that is, the proportion of the window covering the original input is larger and larger).

Note that maximum pooling is not the only way to achieve this down sampling. As you already know, you can also use strides in the previous convolution layer. In addition, you can also use average pooling instead of maximum pooling. The method is to transform each local input block into taking the average value of each channel of the block instead of the maximum value. However, the effect of maximum pooling is often better than these alternatives. In short, the reason is that features often encode whether a certain pattern or concept exists in different positions of the feature graph (hence the name feature graph), and observing the maximum value rather than the average value of different features can give more information. Therefore, the most reasonable sub sampling strategy is to first generate a dense feature map (through stepless convolution), and then observe the maximum activation on each small block of the feature, rather than looking at the input sparse window (through stepwise convolution) or averaging the input blocks, because the latter two methods may lead to missing or diluting the information of whether the feature exists.

Now you should have understood the basic concepts of convolutional neural network, namely feature graph, convolution and maximum pooling, and also know how to build a small convolutional neural network to solve simple problems, such as MNIST digital classification. Next, we will introduce more practical applications.

5.2 train a convolutional neural network from scratch on a small data set

It is very common to use very little data to train an image classification model. If you want to engage in computer vision, you are likely to encounter this situation in practice. The "few" samples may be hundreds of images or tens of thousands of images. As an example, we will focus on the classification of cat and dog images. The data set contains 4000 images of cat and dog (2000 images of cat and 2000 images of dog). We used 2000 images for training, 1000 for validation and 1000 for testing.

This section will introduce the basic strategy to solve this problem, that is, using a small amount of existing data to train a new model from scratch. Firstly, a simple small convolutional neural network is trained on 2000 training samples without any regularization to set a benchmark for the model goal. This gives a classification accuracy of 71%. At this time, the main problem is over fitting. Then, we will introduce data augmentation, which is a very powerful technology to reduce over fitting in the field of computer vision. After using data enhancement, the network accuracy will be improved to 82%.

Section 5.3 will introduce two other important techniques for applying deep learning to small data sets: feature extraction with the pre trained network (the accuracy range is 90% ~ 96%), and fine tuning the pre trained network (the final accuracy is 97%). In short, these three strategies - training a small model from scratch, using the pre trained network for feature extraction, and fine tuning the pre trained network - constitute your toolbox, which can be used to solve the image classification problem of small data sets in the future.

5.2.1 correlation between deep learning and small data problems

Sometimes you hear people say that deep learning is effective only when a large amount of data is available. This statement is partly correct: a basic feature of deep learning is that it can find interesting features in training data independently without artificial feature engineering, which can only be realized when there are a large number of training samples. This is especially true for problems with very high dimensions of input samples (such as images).

But for beginners, the so-called "large" samples are relative, that is, relative to the size and depth of the network you want to train. It is impossible to train convolutional neural network with only dozens of samples to solve a complex problem, but if the model is small, well regularized and the task is very simple, hundreds of samples may be enough. Because the convolutional neural network learns local and translation invariant features, it can make efficient use of data for perception problems. Although the data is relatively small, training a convolutional neural network from scratch on a very small image data set can still get good results without any custom feature engineering. In this section you will see the effect.

In addition, the deep learning model is highly reusable in nature. For example, there is an image classification model or voice to text model trained on large-scale data sets. You can reuse it for completely different problems with only a small modification. Especially in the field of computer vision, many pre trained models (usually trained on ImageNet data sets) can now be downloaded publicly and can be used to build powerful visual models with little data. This is the content of section 5.3. Let's take a look at the data first.

5.2.2 downloading data

The cat and dog classification data set used in this section is not included in Keras. It was published by Kaggle at the end of 2013 as part of a computational vision competition. At that time, convolutional neural network was not the mainstream algorithm. You can start from https://www.Kaggle.com/ C / dogs VS Cats / data download the original dataset (if you don't have a Kaggle account, you need to register one. Don't worry, it's very simple).

These images are medium resolution color JPEG images. Figure 5-8 shows some sample examples.

Figure 5-8 some samples of cat and dog classification dataset. No size modification: the sample is different in size, appearance, etc

As expected, the winner of the 2013 cat dog classification Kaggle competition used convolutional neural network. The best result achieves 95% accuracy. In this example, although you only train the model on less than 10% of the data used by the contestants, the results are quite close to this accuracy (see the next section).

This dataset contains 25000 cat and dog images (12500 for each category) with a size of 543MB (compressed). After downloading and decompressing the data, you need to create a new data set, which contains three subsets: the training set of 1000 samples in each category, the verification set of 500 samples in each category and the test set of 500 samples in each category.

The code to create a new dataset is shown below.

Code listing 5-4 copies the image to the training, validation, and testing directory

import os, shutil
original_dataset_dir = '/Users/fchollet/Downloads/kaggle_original_data'
base_dir = '/Users/fchollet/Downloads/cats_and_dogs_small'
os.mkdir(base_dir)
train_dir = os.path.join(base_dir, 'train')
os.mkdir(train_dir)
validation_dir = os.path.join(base_dir, 'validation')
os.mkdir(validation_dir)
test_dir = os.path.join(base_dir, 'test')
os.mkdir(test_dir)
train_cats_dir = os.path.join(train_dir, 'cats')
os.mkdir(train_cats_dir)
train_dogs_dir = os.path.join(train_dir, 'dogs')
os.mkdir(train_dogs_dir)
validation_cats_dir = os.path.join(validation_dir, 'cats')
os.mkdir(validation_cats_dir)
validation_dogs_dir = os.path.join(validation_dir, 'dogs')
os.mkdir(validation_dogs_dir)
test_cats_dir = os.path.join(test_dir, 'cats')
os.mkdir(test_cats_dir)
test_dogs_dir = os.path.join(test_dir, 'dogs')
os.mkdir(test_dogs_dir)
fnames = ['cat.{}.jpg'.format(i) for i in range(1000)]
for fname in fnames:
 src = os.path.join(original_dataset_dir, fname)
 dst = os.path.join(train_cats_dir, fname)
 shutil.copyfile(src, dst)
fnames = ['cat.{}.jpg'.format(i) for i in range(1000, 1500)]
for fname in fnames:
 src = os.path.join(original_dataset_dir, fname)
 dst = os.path.join(validation_cats_dir, fname)
 shutil.copyfile(src, dst)
fnames = ['cat.{}.jpg'.format(i) for i in range(1500, 2000)]
for fname in fnames:
 src = os.path.join(original_dataset_dir, fname)
 dst = os.path.join(test_cats_dir, fname)
 shutil.copyfile(src, dst)
fnames = ['dog.{}.jpg'.format(i) for i in range(1000)]
for fname in fnames:
 src = os.path.join(original_dataset_dir, fname)
 dst = os.path.join(train_dogs_dir, fname)
 shutil.copyfile(src, dst)
fnames = ['dog.{}.jpg'.format(i) for i in range(1000, 1500)]
for fname in fnames:
 src = os.path.join(original_dataset_dir, fname)
 dst = os.path.join(validation_dogs_dir, fname)
 shutil.copyfile(src, dst)
fnames = ['dog.{}.jpg'.format(i) for i in range(1500, 2000)]
for fname in fnames:
 src = os.path.join(original_dataset_dir, fname)
 dst = os.path.join(test_dogs_dir, fname)
 shutil.copyfile(src, dst)

Let's check to see how many images are included in each group (training / verification / test).

>>> print('total training cat images:', len(os.listdir(train_cats_dir)))
total training cat images: 1000
>>> print('total training dog images:', len(os.listdir(train_dogs_dir)))
total training dog images: 1000
>>> print('total validation cat images:', len(os.listdir(validation_cats_dir)))
total validation cat images: 500
>>> print('total validation dog images:', len(os.listdir(validation_dogs_dir)))
total validation dog images: 500
>>> print('total test cat images:', len(os.listdir(test_cats_dir)))
total test cat images: 500
>>> print('total test dog images:', len(os.listdir(test_dogs_dir)))
total test dog images: 500

So we do have 2000 training images, 1000 verification images and 1000 test images. The number of samples of the two categories in each group is the same, which is a balanced binary classification problem. The classification accuracy can be used as an indicator of success.

5.2.3 building network

In the previous MNIST example, we built a small convolutional neural network, so you should be familiar with this kind of network. We will reuse the same overall structure, that is, the convolutional neural network is composed of Conv2D layer (activated by relu) and MaxPooling2D layer alternately stacked.

However, due to the larger images and more complex problems to be handled here, you need to increase the network accordingly, that is, add another combination of Conv2D+MaxPooling2D. This can not only increase the network capacity, but also further reduce the size of the feature map, so that it will not be too large when connecting the Flatten layer. In this example, the initial input size is 150 × 150 (some optional), so the size of the feature map before the Flatten layer is 7 × 7.

Note that the depth of the feature map in the network is gradually increasing (from 32 to 128), while the size of the feature map is gradually decreasing (from 150) × 150 reduced to 7 × 7). This is almost all neural networks.

You are facing a binary classification problem, so the last layer of the network is a single unit activated by sigmoid (density layer with size of 1). This unit will encode the probability of a category.

Code listing 5-5 instantiates the small convolutional neural network for cat and dog classification

from keras import layers
from keras import models
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu',
 input_shape=(150, 150, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

Let's take a look at how the dimension of the feature graph changes with each layer.

>>> model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_1 (Conv2D) (None, 148, 148, 32) 896
_________________________________________________________________
max_pooling2d_1 (MaxPooling2D) (None, 74, 74, 32) 0
_________________________________________________________________
conv2d_2 (Conv2D) (None, 72, 72, 64) 18496
_________________________________________________________________
max_pooling2d_2 (MaxPooling2D) (None, 36, 36, 64) 0
_________________________________________________________________
conv2d_3 (Conv2D) (None, 34, 34, 128) 73856
_________________________________________________________________
max_pooling2d_3 (MaxPooling2D) (None, 17, 17, 128) 0
_________________________________________________________________
conv2d_4 (Conv2D) (None, 15, 15, 128) 147584
_________________________________________________________________
max_pooling2d_4 (MaxPooling2D) (None, 7, 7, 128) 0
_________________________________________________________________
flatten_1 (Flatten) (None, 6272) 0
_________________________________________________________________
dense_1 (Dense) (None, 512) 3211776
_________________________________________________________________
dense_2 (Dense) (None, 1) 513
=================================================================
Total params: 3,453,121
Trainable params: 3,453,121
Non-trainable params: 0

In the compilation step, as before, we will use the RMSprop optimizer. Because the last layer of the network is a single sigmoid unit, we will use binary cross entropy as the loss function (remind me, table 4-1 lists the loss functions that should be used in various cases).

Listing 5-6 configures the model for training

from keras import optimizers
model.compile(loss='binary_crossentropy',
 optimizer=optimizers.RMSprop(lr=1e-4),
 metrics=['acc'])

5.2.4 data preprocessing

You now know that data should be formatted as preprocessed floating-point tensors before entering neural networks. Now, the data is saved in the hard disk in the form of JPEG file, so the data preprocessing steps are roughly as follows.

  1. Read the image file.
  2. Decodes JPEG files into RGB pixel grids.
  3. These pixel meshes are converted into floating-point tensors.
  4. Scale the pixel values (in the range of 0-255) to the [0,1] range (as you know, neural networks like to deal with smaller input values).

It seems that keras has the tools to complete these steps automatically. Keras has a module of image processing auxiliary tool, which is located in keras preprocessing. image. In particular, it contains the ImageDataGenerator class, which can quickly create a Python generator and automatically convert the image files on the hard disk into preprocessed tensor batches. Next we will use this class.

Listing 5-7 reads an image from a directory using the ImageDataGenerator

from keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
 train_dir,
 target_size=(150, 150),
 batch_size=20,
 class_mode='binary')
validation_generator = test_datagen.flow_from_directory(
 validation_dir,
 target_size=(150, 150),
 batch_size=20,
 class_mode='binary')

Understanding Python generators

A Python generator is an object similar to an iterator, an object that can be used with the for... in operator. The generator is constructed using the yield operator. The following example of a generator can generate integers.

def generator():
 i=0
 while True:
 i += 1
 yield i
for item in generator():
 print(item)
 if item > 4:
 break
 The output results are as follows.
1
2
3
4
5

Let's look at the output of one of the generators: it generates 150 × 150 RGB image [shape (20, 150, 150, 3)] and binary label [shape (20,)]. Each batch contains 20 samples (batch size). Note that the generator will keep generating these batches, and it will continuously cycle the images in the target folder. Therefore, you need to break the iteration loop at some point.

>>> for data_batch, labels_batch in train_generator:
>>> print('data batch shape:', data_batch.shape)
>>> print('labels batch shape:', labels_batch.shape)
>>> break
data batch shape: (20, 150, 150, 3)
labels batch shape: (20,)

Using the generator, we let the model fit the data. We will use fit_generator method, and its effect on the data generator is the same as that of fit. Its first parameter should be a Python generator, which can continuously generate batches composed of input and target, such as train_generator. Because the data is constantly generated, the Keras model needs to know how many samples need to be taken from the generator in each round. This is steps_ per_ The function of the epoch parameter: extract steps from the generator_ per_ After epoch batches (i.e. steps_per_epoch gradient descent is run), the fitting process will enter the next round. In this example, each batch contains 20 samples, so it takes 100 batches to read all 2000 samples.

Using fit_ When generating, you can pass in a validation_data parameter, whose function is similar to that in fit method. It is worth noting that this parameter can be a data generator, but it can also be a tuple composed of Numpy arrays. If to validation_ If data is passed into a generator, the generator should be able to generate validation data batches continuously, so you also need to specify validation_ The steps parameter indicates how many batches need to be extracted from the validation generator for evaluation.

Listing 5-8 uses a batch generator to fit the model

history = model.fit_generator(
 train_generator,
 steps_per_epoch=100,
 epochs=30,
 validation_data=validation_generator,
 validation_steps=50)

It is a good practice to always save the model after training.

Listing 5-9 saves the model

model.save('cats_and_dogs_small_1.h5')

Let's draw the loss and accuracy of the model on the training data and verification data during the training process respectively (see Fig. 5-9 and Fig. 5-10).

Code listing 5-10 draws the loss curve and accuracy curve during training

import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

Figure 5-9 training accuracy and verification accuracy

Figure 5-10 training loss and verification loss

The feature of over fitting can be seen from these images. The training accuracy increases linearly with time until it is close to 100%, while the verification accuracy stays at 70% ~ 72%. The verification loss reaches the minimum value only after 5 rounds, and then remains unchanged, while the training loss decreases linearly until it is close to 0.

Because there are relatively few training samples (2000), over fitting is your most concerned problem. Several techniques to reduce overfitting have been introduced earlier, such as dropout and weight attenuation (L2 regularization). Now we will use a new method aimed at the field of computer vision, which is almost always used when processing images with deep learning model. It is data augmentation.

5.2.5 use data enhancement

The reason for over fitting is that there are too few learning samples, which makes it impossible to train a model that can be generalized to new data. If you have unlimited data, the model can observe all the contents of the data distribution, so it will never over fit. Data enhancement is to generate more training data from the existing training samples. Its method is to increase the samples by using a variety of random transformations that can generate credible images. The goal is that the model will not view the same image twice during training. This allows the model to observe more content of the data, so it has better generalization ability.

In Keras, this can be achieved by performing multiple random transformations on the image read by the ImageDataGenerator instance. Let's start with an example.

Listing 5-11 uses the ImageDataGenerator to set up data enhancement

datagen = ImageDataGenerator(
 rotation_range=40,
 width_shift_range=0.2,
 height_shift_range=0.2,
 shear_range=0.2,
 zoom_range=0.2,
 horizontal_flip=True,
 fill_mode='nearest')

Only a few parameters are selected here (for more parameters, please refer to the Keras documentation). Let's quickly introduce the meaning of these parameters.

  • rotation_range is the angle value (in the range of 0 ~ 180), indicating the angle range of random rotation of the image.
  • width_shift and height_shift is the extent to which the image is translated horizontally or vertically (relative to the total width or height).
  • shear_ The tangent angle of the range transformation is random.
  • zoom_range is the range at which the image is randomly scaled.
  • horizontal_flip is to flip half the image horizontally at random. This makes sense without the assumption of horizontal asymmetry (such as real-world images).
  • fill_mode is the method used to fill in newly created pixels, which may come from rotation or width / height translation. Let's take a look at the enhanced image (see Figure 5-11).

Figure 5-11 cat image generated by random data enhancement

Listing 5-12 shows several randomly enhanced training images

from keras.preprocessing import image
fnames = [os.path.join(train_cats_dir, fname) for
 fname in os.listdir(train_cats_dir)]
 img_path = fnames[3]
img = image.load_img(img_path, target_size=(150, 150))
x = image.img_to_array(img)
x = x.reshape((1,) + x.shape)
i = 0
for batch in datagen.flow(x, batch_size=1):
 plt.figure(i)
 imgplot = plt.imshow(image.array_to_img(batch[0]))
 i += 1
 if i % 4 == 0:
 break
plt.show()

If you use this data enhancement to train a new network, the network will not see the same input twice. However, the inputs seen by the network are still highly relevant because they come from a small number of original images. You can't generate new information, you can only mix existing information. Therefore, this method may not be sufficient to completely eliminate over fitting. To further reduce overfitting, you also need to add a Dropout layer to the model before adding it to the dense connection classifier.

Listing 5-13 defines a new convolutional neural network with dropout

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu',
 input_shape=(150, 150, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(128, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dropout(0.5))
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy',
 optimizer=optimizers.RMSprop(lr=1e-4),
 metrics=['acc'])

Let's train this network using data enhancement and dropout.

Code listing 5-14 training convolutional neural network using data enhancement generator

train_datagen = ImageDataGenerator(
 rescale=1./255,
 rotation_range=40,
 width_shift_range=0.2,
 height_shift_range=0.2,
 shear_range=0.2,
 zoom_range=0.2,
 horizontal_flip=True,)
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
 train_dir,
 target_size=(150, 150),
 batch_size=32,
 class_mode='binary')
validation_generator = test_datagen.flow_from_directory(
 validation_dir,
 target_size=(150, 150),
 batch_size=32,
 class_mode='binary')
history = model.fit_generator(
 train_generator,
 steps_per_epoch=100,
 epochs=100,
 validation_data=validation_generator,
 validation_steps=50)

We'll save the model and you'll use it in section 5.4.

Listing 5-15 saves the model

model.save('cats_and_dogs_small_2.h5')

We plot the results again (see Figure 5-12 and figure 5-13). After using data enhancement and dropout, the model is no longer over fitted: the training curve closely follows the verification curve. The current accuracy is 82%, which is 15% higher than that of the non regularized model.

Figure 5-12 training accuracy and verification accuracy after data enhancement

Figure 5-13 training loss and verification loss after data enhancement

By further using regularization method and adjusting network parameters (such as the number of filters per convolution layer or the number of layers in the network), you can get higher accuracy, which can reach 86% or 87%. However, it is very difficult to improve the accuracy only by training their own convolutional neural network from scratch, because there is too little data available. To further improve the accuracy of this problem, we need to use the pre training model in the next step, which is the focus of the next two sections.

5.3 using pre trained convolutional neural networks

To apply deep learning to small image data sets, a common and very efficient method is to use pre training network. Pre trained network is a well preserved network, which has been trained on large data sets (usually large-scale image classification tasks). If the original data set is large enough and general enough, the spatial hierarchy of features learned by the pre training network can be effectively used as a general model of the visual world. Therefore, these features can be used for a variety of different computer vision problems, even if the categories involved in these new problems are completely different from the original tasks. For example, you train a network on ImageNet (its categories are mainly animals and daily necessities), and then apply the trained network to an irrelevant task, such as identifying furniture in an image. The portability of this learned feature between different problems is an important advantage of deep learning compared with many early shallow learning methods. It makes deep learning very effective for small data problems.

In this example, suppose there is a large convolutional neural network trained on the ImageNet dataset (1.4 million labeled images, 1000 different categories). ImageNet contains many animal categories, including different kinds of cats and dogs, so it can be considered that it can also perform well in cat and dog classification.

We will use VGG16 architecture, which was developed by Karen Simonyan and Andrew Zisserman in 2014 a. For ImageNet, it is a simple and widely used convolutional neural network architecture. Although VGG16 is an old model, its performance is far less than the current most advanced model, and it is more complex than many new models, I chose it because its architecture is very similar to that you are already familiar with, so it can be well understood without introducing new concepts. This may be the first time you encounter such strange model names - VGG, ResNet, Inception, Inception ResNet, Xception, etc. You will get used to these names because they will appear frequently if you keep using deep learning for computer vision.

There are two methods to use the pre training network: feature extraction and fine tuning. We will introduce both methods. First, let's look at feature extraction.

5.3.1 feature extraction

Feature extraction is to extract interesting features from new samples using the representations learned from the network. Then these features are input into a new classifier and trained from scratch.

As mentioned earlier, the convolutional neural network for image classification consists of two parts: the first is a series of pooling layers and convolution layers, and the last is a densely connected classifier. The first part is called the convolutional base of the model. For convolutional neural networks, feature extraction is to take out the convolution basis of the previously trained network, run new data on it, and then train a new classifier on the output (see Figure 5-14).

Figure 5-14 change the classifier while keeping the convolution basis unchanged

Why reuse only convolution basis? Can we also reuse dense join classifiers? In general, this should be avoided. The reason is that the representation learned by convolution basis may be more general, so it is more suitable for reuse. The characteristic graph of convolutional neural network shows whether the general concept exists in the image. No matter what kind of computer vision problems we face, this characteristic graph may be very useful. However, the representation learned by the classifier must be the category trained for the model, which only contains the probability information of a category appearing in the whole image. In addition, the representation of the dense connection layer no longer contains the position information of the object in the input image. The dense connection layer discards the concept of space, and the object position information is still described by the convolution characteristic graph. If the position of the object is important to the problem, the characteristics of the dense connection layer are largely useless.

Note that the generality (and reusability) of the representation extracted by a convolution layer depends on the depth of the layer in the model. The layer closer to the bottom of the model extracts local and highly general feature images (such as visual edges, colors and textures), while the layer closer to the top extracts more abstract concepts (such as "cat ears" or "dog eyes"). Therefore, if your new data set is very different from the data set trained by the original model, it is best to use only the first few layers of the model for feature extraction, rather than the whole convolution basis.

In this example, since ImageNet's categories include a variety of dog and cat categories, it may be useful to reuse the information contained in the dense connection layer of the original model. However, we chose not to do so in order to cover more general cases where the categories of new problems are inconsistent with those of the original model. Let's practice, using the convolution basis of VGG16 network trained on ImageNet to extract interesting features from cat and dog images, and then train a cat and dog classifier on these features.

VGG16 and other models are built into Keras. You can get it from Keras Import in the applications module. Here is Keras Some image classification models in applications (all pre trained on ImageNet dataset):

‰ Xception
‰ Inception V3
‰ ResNet50
‰ VGG16
‰ VGG19
‰ MobileNet

We instantiate the VGG16 model.

Listing 5-16 instantiates VGG16 convolution basis

from keras.applications import VGG16
conv_base = VGG16(weights='imagenet',
 include_top=False,
 input_shape=(150, 150, 3))

Here, three parameters are passed into the constructor.

  • weights specifies the weight checkpoint for model initialization.
  • include_top specifies whether the model contains dense join classifiers at the end. By default, this dense connection classifier corresponds to 1000 categories of ImageNet. Because we intend to use our own dense join classifier (only two categories: cat and dog), we don't need to include it.
  • input_shape is the shape of the image tensor input into the network. This parameter is completely optional. If this parameter is not passed in, the network can handle any shape of input.

The detailed architecture of VGG16 convolution basis is shown below. It is very similar to the simple convolutional neural network you are already familiar with.

>>> conv_base.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 150, 150, 3) 0 
_________________________________________________________________
block1_conv1 (Conv2D) (None, 150, 150, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, 150, 150, 64) 36928
_________________________________________________________________
block1_pool (MaxPooling2D) (None, 75, 75, 64) 0
_________________________________________________________________
block2_conv1 (Conv2D) (None, 75, 75, 128) 73856
_________________________________________________________________
block2_conv2 (Conv2D) (None, 75, 75, 128) 147584
_________________________________________________________________
block2_pool (MaxPooling2D) (None, 37, 37, 128) 0
_________________________________________________________________
block3_conv1 (Conv2D) (None, 37, 37, 256) 295168
_________________________________________________________________
block3_conv2 (Conv2D) (None, 37, 37, 256) 590080
_________________________________________________________________
block3_conv3 (Conv2D) (None, 37, 37, 256) 590080
_________________________________________________________________
block3_pool (MaxPooling2D) (None, 18, 18, 256) 0
_________________________________________________________________
block4_conv1 (Conv2D) (None, 18, 18, 512) 1180160
_________________________________________________________________
block4_conv2 (Conv2D) (None, 18, 18, 512) 2359808
_________________________________________________________________
block4_conv3 (Conv2D) (None, 18, 18, 512) 2359808
_________________________________________________________________
block4_pool (MaxPooling2D) (None, 9, 9, 512) 0
_________________________________________________________________
block5_conv1 (Conv2D) (None, 9, 9, 512) 2359808
_________________________________________________________________
block5_conv2 (Conv2D) (None, 9, 9, 512) 2359808
_________________________________________________________________
block5_conv3 (Conv2D) (None, 9, 9, 512) 2359808
_________________________________________________________________
block5_pool (MaxPooling2D) (None, 4, 4, 512) 0
=================================================================
Total params: 14,714,688
Trainable params: 14,714,688
Non-trainable params: 0

The final feature graph shape is (4, 4, 512). We will add a dense connection classifier to this feature. Next, there are two options for the next step.

  • Run the convolution base on your data set, save the output as a Numpy array in the hard disk, and then use this data as input to an independent dense connection classifier (similar to the classifier introduced in the first part of this book). This method has high speed and low computational cost, because it only needs to run the convolution basis once for each input image, and the convolution basis is the most expensive in the current process. But for the same reason, this approach does not allow you to use data enhancement.
  • Add a sense layer at the top to extend the existing model (i.e. conv_base) and run the whole model end-to-end on the input data. In this way, you can use data enhancement because each input image will pass through the convolution basis when entering the model. But for the same reason, the computational cost of this method is much higher than that of the first method.

We will introduce both methods. First, let's look at the code of the first method: save your data in conv_base, and then use these outputs as inputs for the new model.

  1. Fast feature extraction without data enhancement

First, run the ImageDataGenerator instance to extract the image and its tags into a Numpy array. We need to call conv_base model to extract features from these images.

Listing 5-17 uses pre trained convolution bases to extract features

import os
import numpy as np
from keras.preprocessing.image import ImageDataGenerator
base_dir = '/Users/fchollet/Downloads/cats_and_dogs_small'
train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'validation')
test_dir = os.path.join(base_dir, 'test')
datagen = ImageDataGenerator(rescale=1./255)
batch_size = 20
def extract_features(directory, sample_count):
 features = np.zeros(shape=(sample_count, 4, 4, 512))
 labels = np.zeros(shape=(sample_count))
 generator = datagen.flow_from_directory(
 directory,
 target_size=(150, 150),
 batch_size=batch_size,
 class_mode='binary')
 i = 0
 for inputs_batch, labels_batch in generator:
 features_batch = conv_base.predict(inputs_batch)
 features[i * batch_size : (i + 1) * batch_size] = features_batch
 labels[i * batch_size : (i + 1) * batch_size] = labels_batch
 i += 1
 if i * batch_size >= sample_count:
 break
 return features, labels
train_features, train_labels = extract_features(train_dir, 2000)
validation_features, validation_labels = extract_features(validation_dir, 1000)
test_features, test_labels = extract_features(test_dir, 1000)

At present, the extracted feature shapes are (samples, 4, 4, 512). We need to input it into the dense connection classifier, so we must flatten its shape to (samples, 8192) first.

train_features = np.reshape(train_features, (2000, 4 * 4 * 512))
validation_features = np.reshape(validation_features, (1000, 4 * 4 * 512))
test_features = np.reshape(test_features, (1000, 4 * 4 * 512))

Now you can define your dense connection classifier (note to use dropout regularization) and train the classifier on the data and tags just saved.

Listing 5-18 defines and trains a dense join classifier

from keras import models
from keras import layers
from keras import optimizers
model = models.Sequential()
model.add(layers.Dense(256, activation='relu', input_dim=4 * 4 * 512))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer=optimizers.RMSprop(lr=2e-5),
 loss='binary_crossentropy',
 metrics=['acc'])
history = model.fit(train_features, train_labels,
 epochs=30,
 batch_size=20,
 validation_data=(validation_features, validation_labels))

Training is very fast because you only need to deal with two density layers. Even if it runs on the CPU, the time of each round is less than one second.

Let's look at the loss curve and accuracy curve during training (see Figure 5-15 and figure 5-16).

Code listing 5-19 draws the result

import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

Figure 5-15 training accuracy and verification accuracy of simple feature extraction

Figure 5-16 training loss and verification loss of simple feature extraction

Our verification accuracy is about 90%, which is much better than the small model trained from scratch in the previous section. However, it can also be seen from the figure that although the dropout ratio is quite large, the model has been fitted almost from the beginning. This is because this method does not use data enhancement, which is very important to prevent over fitting of small image data sets.

  1. Feature extraction using data enhancement

Let's take a look at the second method of feature extraction, which is slower and more expensive, but data enhancement can be used during training. This method is to extend conv_ Run the model to the data input end of the model.

Note that the calculation cost of this method is very high, and it can only be tried when there is a GPU. It is absolutely difficult to run on the CPU. If you can't run code on the GPU, use the first method.

The behavior of the model is similar to that of the layer, so you can add a model (such as conv_base) to the Sequential model, just like adding a layer.

Listing 5-20 adds a dense join classifier to the convolution base

from keras import models
from keras import layers
model = models.Sequential()
model.add(conv_base)
model.add(layers.Flatten())
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

Now the architecture of the model is as follows.

>>> model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
vgg16 (Model) (None, 4, 4, 512) 14714688
_________________________________________________________________
flatten_1 (Flatten) (None, 8192) 0
_________________________________________________________________
dense_1 (Dense) (None, 256) 2097408
_________________________________________________________________
dense_2 (Dense) (None, 1) 257
=================================================================
Total params: 16,812,353
Trainable params: 16,812,353
Non-trainable params: 0

As you can see, the convolution basis of VGG16 has 14714688 parameters, a lot. The classifier added to it has 2 million parameters.

Before compiling and training the model, we must "freeze" the convolution basis. Freeze one or more layers means to keep their weights unchanged during training. If this is not done, the previously learned representation of the convolution basis will be modified during the training process. Because the density layer added on it is initialized randomly, very large weight updates will spread in the network, causing great damage to the previously learned representation.

In Keras, the way to freeze the network is to set its trainable attribute to False.

>>> print('This is the number of trainable weights '
 'before freezing the conv base:', len(model.trainable_weights))
This is the number of trainable weights before freezing the conv base: 30
>>> conv_base.trainable = False
>>> print('This is the number of trainable weights '
 'after freezing the conv base:', len(model.trainable_weights))
This is the number of trainable weights after freezing the conv base: 4

After this setting, only the weights of the two density layers added will be trained. There are four weight tensors in total, two in each layer (sovereign weight matrix and bias vector). Note that for these changes to take effect, you must first compile the model. If the trainable attribute of the weight is modified after compilation, the model should be recompiled, otherwise these modifications will be ignored.

Now you can start training the model, using the same data enhancement settings as in the previous example.

Listing 5-21 trains the model end-to-end using frozen convolution bases

from keras.preprocessing.image import ImageDataGenerator
from keras import optimizers
train_datagen = ImageDataGenerator(
 rescale=1./255,
 rotation_range=40,
 width_shift_range=0.2,
 height_shift_range=0.2,
 shear_range=0.2,
 zoom_range=0.2,
 horizontal_flip=True,
 fill_mode='nearest')
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
 train_dir,
 target_size=(150, 150),
 batch_size=20,
 class_mode='binary')
validation_generator = test_datagen.flow_from_directory(
 validation_dir,
 target_size=(150, 150),
 batch_size=20,
 class_mode='binary')
model.compile(loss='binary_crossentropy',
 optimizer=optimizers.RMSprop(lr=2e-5),
 metrics=['acc'])
history = model.fit_generator(
 train_generator,
 steps_per_epoch=100,
 epochs=30,
 validation_data=validation_generator,
 validation_steps=50)

Let's plot the results again (see Figure 5-17 and figure 5-18). As you can see, the verification accuracy is about 96%. This is much better than the small convolutional neural network trained from scratch.

Figure 5-17 training accuracy and verification accuracy of feature extraction with data enhancement

Figure 5-18 training loss and verification loss of feature extraction with data enhancement

5.3.2 fine tuning model

Another widely used model reuse method is fine tuning, which complements feature extraction. For the frozen model base used for feature extraction, fine-tuning refers to "thawing" the top layers, and jointly training the thawed layers with the newly added part (full connection classifier in this example) (see Fig. 5-19). It is called fine tuning because it only slightly adjusts the more abstract representations in the reused model to make them more relevant to the problem at hand.

Figure 5-19 fine tuning the last convolution block of VGG16 network

As mentioned earlier, the convolution basis of VGG16 is frozen in order to train a randomly initialized classifier on it. Similarly, only when the above classifier has been trained can the top layers of convolution basis be fine tuned. If the classifier is not well trained, the error signal transmitted through the network during training will be particularly large, and the representations learned before the fine-tuning layers will be destroyed. Therefore, the steps to fine tune the network are as follows.

  1. Add a custom network to the trained base network.
  2. Freeze the base network.
  3. The part added to the training.
  4. Thaw some layers of the base network.
  5. Joint training thawed these layers and added parts.

You have completed the first three steps in feature extraction. Let's continue with step 4: thaw the conv first_ Base, and then freeze some of the layers.

Remind me, the architecture of convolution basis is as follows.

>>> conv_base.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 150, 150, 3) 0
_________________________________________________________________
block1_conv1 (Conv2D) (None, 150, 150, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, 150, 150, 64) 36928
_________________________________________________________________
block1_pool (MaxPooling2D) (None, 75, 75, 64) 0
_________________________________________________________________
block2_conv1 (Conv2D) (None, 75, 75, 128) 73856
_________________________________________________________________
block2_conv2 (Conv2D) (None, 75, 75, 128) 147584
_________________________________________________________________
block2_pool (MaxPooling2D) (None, 37, 37, 128) 0
_________________________________________________________________
block3_conv1 (Conv2D) (None, 37, 37, 256) 295168
_________________________________________________________________
block3_conv2 (Conv2D) (None, 37, 37, 256) 590080
_________________________________________________________________
block3_conv3 (Conv2D) (None, 37, 37, 256) 590080
_________________________________________________________________
block3_pool (MaxPooling2D) (None, 18, 18, 256) 0
_________________________________________________________________
block4_conv1 (Conv2D) (None, 18, 18, 512) 1180160
_________________________________________________________________
block4_conv2 (Conv2D) (None, 18, 18, 512) 2359808
_________________________________________________________________
block4_conv3 (Conv2D) (None, 18, 18, 512) 2359808
_________________________________________________________________
block4_pool (MaxPooling2D) (None, 9, 9, 512) 0
_________________________________________________________________
block5_conv1 (Conv2D) (None, 9, 9, 512) 2359808
_________________________________________________________________
block5_conv2 (Conv2D) (None, 9, 9, 512) 2359808
_________________________________________________________________
block5_conv3 (Conv2D) (None, 9, 9, 512) 2359808 
_________________________________________________________________
block5_pool (MaxPooling2D) (None, 4, 4, 512) 0
=================================================================
Total params: 14,714,688
Trainable params: 14,714,688
Non-trainable params: 0

We will fine tune the last three convolutions, that is, until block4_ All layers of pool should be frozen, while block5_conv1,block5_conv2 and block5_conv3 layer 3 should be trainable.

Why not fine tune more layers? Why not fine tune the entire convolution basis? Of course you can, but you need to consider the following points.

  • The layer closer to the bottom of the convolution base encodes more general reusable features, while the layer closer to the top encodes more specialized features. Fine tuning these more specialized features is more useful because they need to change purpose on your new problem. Fine tuning the layer closer to the bottom will get less return.
  • The more training parameters, the greater the risk of over fitting. The convolution basis has 15 million parameters, so training so many parameters on your small data set is risky.

Therefore, in this case, a good strategy is to fine tune only the last two or three layers of the convolution basis. Let's continue to implement this method from where the previous example ended.

Code listing 5-22 freezes all layers up to a certain layer

conv_base.trainable = True
set_trainable = False
for layer in conv_base.layers:
 if layer.name == 'block5_conv1':
 set_trainable = True
 if set_trainable:
 layer.trainable = True
 else:
 layer.trainable = False

Now you can start fine tuning the network. We will use the RMSProp optimizer with a very low learning rate. The reason why the learning rate is very small is that we hope that the scope of change of the three-tier representation of fine-tuning will not be too large. Too much weight update may destroy these representations.

Code listing 5-23 fine tuning model

model.compile(loss='binary_crossentropy',
 optimizer=optimizers.RMSprop(lr=1e-5),
 metrics=['acc'])
history = model.fit_generator(
 train_generator,
 steps_per_epoch=100,
 epochs=100,
 validation_data=validation_generator,
 validation_steps=50)

We use the same drawing code as before to draw the results (see Figure 5-20 and figure 5-21).

Figure 5-20 training accuracy and verification accuracy of fine tuning model

Figure 5-21 training loss and verification loss of fine tuning model

These curves appear to contain noise. To make the image more readable, you can replace each loss and accuracy with an exponential moving average to smooth the curve. The following is implemented with a simple practical function (see Fig. 5-22 and Fig. 5-23).

Listing 5-24 smoothes the curve

def smooth_curve(points, factor=0.8):
 smoothed_points = []
 for point in points:
 if smoothed_points:
 previous = smoothed_points[-1]
 smoothed_points.append(previous * factor + point * (1 - factor))
 else:
 smoothed_points.append(point)
 return smoothed_points
plt.plot(epochs,
 smooth_curve(acc), 'bo', label='Smoothed training acc')
 plt.plot(epochs,
 smooth_curve(val_acc), 'b', label='Smoothed validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs,
 smooth_curve(loss), 'bo', label='Smoothed training loss')
plt.plot(epochs,
 smooth_curve(val_loss), 'b', label='Smoothed validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

Figure 5-22 smoothed curve of training accuracy and verification accuracy of fine tuning model

Figure 5-23 smoothed curve of training loss and verification loss of fine tuning model

Verify that the accuracy curve becomes clearer. It can be seen that the accuracy value is increased by 1%, from about 96% to more than 97%.

Note that there is no real improvement (actually getting worse) from the loss curve. You may wonder, if the loss is not reduced, how can the accuracy remain stable or improve? The answer is simple: the figure shows the average value of the pointwise loss value, but it is the distribution of the loss value that affects the accuracy, not the average value, because the accuracy is the binary threshold of the category probability predicted by the model. Even if it cannot be seen from the average loss, the model may still be improving.

Now, you can finally evaluate the model on the test data.

test_generator = test_datagen.flow_from_directory(
 test_dir,
 target_size=(150, 150),
 batch_size=20,
 class_mode='binary')
test_loss, test_acc = model.evaluate_generator(test_generator, steps=50)
print('test acc:', test_acc)

We got 97% accuracy. This result was one of the best in the original Kaggle competition on this data set. But with modern deep learning technology, you can get this result with only a small part of training data (about 10%). There is a big difference between training 20000 samples and training 2000 samples!

5.3.3 summary

Here are the main points you should learn from the exercises in the above two sections.

  • Convolutional neural network is the best machine learning model for computer vision tasks. A convolutional neural network can be trained from scratch even on a very small data set, and the results are good.
  • The main problem on small data sets is over fitting. When processing image data, data enhancement is a powerful method to reduce over fitting.
  • Using feature extraction, the existing convolutional neural network can be easily reused in new data sets. This is a valuable method for small image data sets.
  • As a supplement to feature extraction, you can also use fine tuning to apply some data representations learned before the existing model to new problems. This method can further improve the performance of the model. Now you have a set of reliable tools to deal with image classification problems, especially for small data sets.

5.4 visualization of convolutional neural network

It is often said that the deep learning model is a "black box", that is, the representation learned by the model is difficult to extract and present in a way that human beings can understand. Although this is partly true for some types of deep learning models, it is definitely not true for convolutional neural networks. The representations learned by convolutional neural networks are very suitable for visualization, largely because they are the representation of visual concepts. Since 2013, a variety of technologies have been developed to visualize and interpret these representations. We won't cover it all in the book, but we will introduce the three easiest to understand and most useful methods.

  • Visualizing the intermediate output (intermediate activation) of convolutional neural network: it is helpful to understand how the continuous layer of convolutional neural network transforms the input, and it is also helpful to preliminarily understand the meaning of each filter of convolutional neural network.
  • Filter of visual convolutional neural network: it is helpful to accurately understand the visual patterns or visual concepts easily accepted by each filter in convolutional neural network.
  • Thermal map of class activation in visual image: it is helpful to understand which part of the image is recognized as belonging to a certain category, so that the objects in the image can be located.

For the first method (i.e. activated visualization), we will use the small convolutional neural network trained from scratch in Section 5.2 on cat and dog classification. For the other two visualization methods, we will use the VGG16 model introduced in section 5.3.

5.4.1 visual intermediate activation

Visual intermediate activation refers to the characteristic diagram showing the output of each convolution layer and pooling layer in the network for a given input (the output of the layer is usually referred to as the activation of the layer, that is, the output of the activation function). This allows us to see how the input is decomposed into different filters learned by the network. We want to visualize the feature map in three dimensions: width, height and depth (channel). Each channel corresponds to relatively independent features, so the correct way to visualize these feature maps is to draw the contents of each channel into two-dimensional images. Let's first load the model saved in Section 5.2.

>>> from keras.models import load_model
>>> model = load_model('cats_and_dogs_small_2.h5')
>>> model.summary() # As a reminder
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_5 (Conv2D) (None, 148, 148, 32) 896
_________________________________________________________________
max_pooling2d_5 (MaxPooling2D) (None, 74, 74, 32) 0
_________________________________________________________________
conv2d_6 (Conv2D) (None, 72, 72, 64) 18496
_________________________________________________________________
max_pooling2d_6 (MaxPooling2D) (None, 36, 36, 64) 0
_________________________________________________________________
conv2d_7 (Conv2D) (None, 34, 34, 128) 73856
_________________________________________________________________
max_pooling2d_7 (MaxPooling2D) (None, 17, 17, 128) 0
_________________________________________________________________
conv2d_8 (Conv2D) (None, 15, 15, 128) 147584
_________________________________________________________________
max_pooling2d_8 (MaxPooling2D) (None, 7, 7, 128) 0
_________________________________________________________________
flatten_2 (Flatten) (None, 6272) 0
_________________________________________________________________
dropout_1 (Dropout) (None, 6272) 0
_________________________________________________________________
dense_3 (Dense) (None, 512) 3211776
_________________________________________________________________
dense_4 (Dense) (None, 1) 513
=================================================================
Total params: 3,453,121
Trainable params: 3,453,121
Non-trainable params: 0

Next, we need to input an image that belongs to the network.

Code listing 5-25 preprocessing a single image

img_path = '/Users/fchollet/Downloads/cats_and_dogs_small/test/cats/cat.1700.jpg'
from keras.preprocessing import image
import numpy as np
img = image.load_img(img_path, target_size=(150, 150))
img_tensor = image.img_to_array(img)
img_tensor = np.expand_dims(img_tensor, axis=0)
img_tensor /= 255.
# Its shape is (1, 150, 150, 3)
print(img_tensor.shape)

Let's show this image (see Figure 5-24).

Listing 5-26 shows the test image

import matplotlib.pyplot as plt
plt.imshow(img_tensor[0])
plt.show()

Figure 5-24 cat images tested

In order to extract the feature map we want to view, we need to create a Keras model, take the image batch as the input, and output the activation of all convolution layers and pooling layers. To do this, we need to use Keras's model class. Model instantiation requires two parameters: an input tensor (or list of input tensors) and an output tensor (or list of output tensors). The resulting class is a Keras model, which maps a specific input to a specific output, just like the Sequential model you are familiar with. The model class allows a model to have multiple outputs, which is different from the Sequential model. For more information about the model class, see Section 7.1.

Code listing 5-27 instantiates the model with an input tensor and an output tensor list

from keras import models
layer_outputs = [layer.output for layer in model.layers[:8]]
activation_model = models.Model(inputs=model.input, outputs=layer_outputs) 

Enter an image and the model will return the activation value of the first 8 layers of the original model. This is the first multi output model you encounter in this book. Previous models had only one input and one output. Generally, the model can have any input and output. This model has one input and eight outputs, that is, each layer activation corresponds to one output.

Listing 5-28 runs the model in predictive mode

activations = activation_model.predict(img_tensor)
For example, for the input cat image, the activation of the first convolution layer is as follows.
>>> first_layer_activation = activations[0]
>>> print(first_layer_activation.shape)
(1, 148, 148, 32)

It is 148 in size × 148, with 32 channels. Let's draw the fourth channel activated by the first layer of the original model (see Figure 5-25).

Listing 5-29 visualizes the fourth channel

import matplotlib.pyplot as plt
plt.matshow(first_layer_activation[0, :, :, 4], cmap='viridis')

Figure 5-25 for the tested cat image, the fourth channel activated by the first layer

This channel appears to be a diagonal edge detector. Let's take another look at the seventh channel (see Figure 5-26). But please note that your channel may be different, because the filter learned by the convolution layer is not certain.

Listing 5-30 visualizes the seventh channel

plt.matshow(first_layer_activation[0, :, :, 7], cmap='viridis')

Figure 5-26 for the cat image tested, the 7th channel activated by the first layer

This channel looks like a "bright green dot" detector, which is very useful for finding cat eyes. Now let's draw a complete visualization of all activation in the network (see Figure 5-27). We need to extract and draw each channel in each of the eight feature maps, and then overlay the results in a large image tensor, side by side according to the channel.

Listing 5-31 visualizes all channels activated in each intermediate

layer_names = []
for layer in model.layers[:8]:
 layer_names.append(layer.name)
images_per_row = 16
for layer_name, layer_activation in zip(layer_names, activations):
 n_features = layer_activation.shape[-1]
 size = layer_activation.shape[1]
 n_cols = n_features // images_per_row
 display_grid = np.zeros((size * n_cols, images_per_row * size))
 for col in range(n_cols):
 for row in range(images_per_row):
 channel_image = layer_activation[0,
 :, :,
 col * images_per_row + row]
 channel_image -= channel_image.mean()
 channel_image /= channel_image.std()
 channel_image *= 64
 channel_image += 128
 channel_image = np.clip(channel_image, 0, 255).astype('uint8')
 display_grid[col * size : (col + 1) * size,
 row * size : (row + 1) * size] = channel_image
 scale = 1. / size
 plt.figure(figsize=(scale * display_grid.shape[1],
 scale * display_grid.shape[0]))
 plt.title(layer_name)
 plt.grid(False)
 plt.imshow(display_grid, aspect='auto', cmap='viridis')

Figure 5-27 for the tested cat image, all channels activated by each layer

Here we need to pay attention to the following points.

  • The first layer is the collection of various edge detectors. At this stage, activation retains almost all the information in the original image.
  • With the deepening of the number of layers, activation becomes more and more abstract and more difficult to understand intuitively. They began to represent higher-level concepts, such as "cat ears" and "cat eyes". The deeper the number of layers, the less information about the visual content of the image in its representation, and the more information about the category.
  • The sparsity of activation increases with the deepening of the number of layers. In the first layer, all filters are activated by the input image, but in the later layer, more and more filters are blank. That is, the pattern encoded by these filters cannot be found in the input image.

We have just revealed an important universal feature of representation learned by deep neural network: with the deepening of the number of layers, the features extracted by layers become more and more abstract. The higher layer activation contains less and less information about a particular input and more and more information about the target (in this case, the category of the image: cat or dog). Deep neural network can effectively serve as an information distillation pipeline, input the original data (RGB image in this example), transform it repeatedly, filter out irrelevant information (such as the specific appearance of the image), and enlarge and refine useful information (such as the category of the image).

This is similar to the way humans and animals perceive the world: after observing a scene for a few seconds, humans can remember which abstract objects (such as bicycles and trees) are in it, but they can't remember the specific appearance of these objects. In fact, if you try to draw an ordinary bicycle from memory, you may not be able to draw the real one at all, although you have seen thousands of bicycles in your life (see figure 5-28). You can try to draw it now. This statement is absolutely true. Your brain has learned to completely abstract visual input, that is, transform it into higher-level visual concepts, and filter out irrelevant visual details, which makes it difficult for your brain to remember the appearance of things around you.

Figure 5-28 (left) try to draw a bicycle from memory; (right figure) schematic diagram of bicycle

5.4.2 filter of visual convolutional neural network

Another simple way to observe the filters learned by convolutional neural networks is to display the visual patterns that each filter responds to. This can be achieved by gradient rising in the input space: starting from the blank input image, the gradient descent is applied to the value of the convolution neural network input image, in order to maximize the response of a filter. The resulting input image is the image with the maximum response of the selected filter.

The process is simple: we need to build a loss function to maximize the value of a filter in a certain convolution layer; Then, we use random gradient descent to adjust the value of the input image in order to maximize the activation value. For example, for the VGG16 network pre trained on ImageNet, its block3_ The loss of activation of the 0th filter in conv1 layer is shown below.

Listing 5-32 defines the loss tensor for the visualization of the filter

from keras.applications import VGG16
from keras import backend as K
model = VGG16(weights='imagenet',
 include_top=False)
layer_name = 'block3_conv1'
filter_index = 0
layer_output = model.get_layer(layer_name).output
loss = K.mean(layer_output[:, :, :, filter_index])

In order to achieve gradient descent, we need to get the gradient of loss relative to the model input. To do this, we need to use the gradients function built into Keras's backend module.

Code listing 5-33 gets the gradient of the loss relative to the input

grads = K.gradients(loss, model.input)[0] 

In order to make the gradient descent process go smoothly, a non obvious technique is to standardize the gradient tensor by dividing it by its L2 norm (the square root of the average of the squares of all values in the tensor). This ensures that the update size of the input image is always in the same range.

Code listing 5-34 gradient standardization techniques

grads /= (K.sqrt(K.mean(K.square(grads))) + 1e-5) 

Now you need a method: given the input image, it can calculate the values of loss tensor and gradient tensor. You can define a Keras back-end function to implement this method: iterate is a function that converts a Numpy tensor (expressed as a tensor list with length of 1) into a list composed of two Numpy tensors, which are loss value and gradient value respectively.

Code listing 5-35 gives the Numpy input value and obtains the Numpy output value

iterate = K.function([model.input], [loss, grads])
import numpy as np
loss_value, grads_value = iterate([np.zeros((1, 150, 150, 3))])

Now you can define a Python loop for random gradient descent.

Code listing 5-36 maximizes losses through random gradient descent

input_img_data = np.random.random((1, 150, 150, 3)) * 20 + 128. 
step = 1.
for i in range(40):
 loss_value, grads_value = iterate([input_img_data])
 input_img_data += grads_value * step 

The obtained image tensor is a floating-point tensor with the shape of (1, 150, 150, 3), and its value may not be an integer in the [0, 255] interval. Therefore, you need to post process this tensor and convert it into a displayable image. The following simple utility function can do this.

Listing 5-37 is a practical function for converting tensors into effective images

def deprocess_image(x):
 x -= x.mean()
 x /= (x.std() + 1e-5)
 x *= 0.1
 x += 0.5
 x = np.clip(x, 0, 1)
 x *= 255
 x = np.clip(x, 0, 255).astype('uint8')
 return x

Next, we put the above code fragment into a Python function, enter a layer name and a filter index, and it will return a valid image tensor representing the mode that can maximize the activation of a specific filter.

Listing 5-38 shows the function that generates the filter visualization

def generate_pattern(layer_name, filter_index, size=150):
 layer_output = model.get_layer(layer_name).output
 loss = K.mean(layer_output[:, :, :, filter_index])
 grads = K.gradients(loss, model.input)[0]
 grads /= (K.sqrt(K.mean(K.square(grads))) + 1e-5)
 iterate = K.function([model.input], [loss, grads])
 input_img_data = np.random.random((1, size, size, 3)) * 20 + 128.
 step = 1.
 for i in range(40):
 loss_value, grads_value = iterate([input_img_data])
 input_img_data += grads_value * step
 img = input_img_data[0]
 return deprocess_image(img)

Let's try this function (see Figure 5-29).

>>> plt.imshow(generate_pattern('block3_conv1', 0))

Figure 5-29 block3_ The mode in which the 0th channel of conv1 layer has the largest response

Looks like block3_ The 0th filter of conv1 layer responds to the polka dot pattern. Let's look at the interesting part: we can visualize each filter in each layer. For simplicity, we only look at the first 64 filters of each layer and only the first layer of each convolution block (i.e. block1_conv1, block2_conv1, block3_conv1, block4_conv1, block5_conv1). We put the output in an 8 × 8 grid, each grid is a 64 pixel × In the 64 pixel filter mode, some black edges are left between the two filter modes (see Fig. 5-30 ~ Fig. 5-33).

Listing 5-39 generates a grid of all filter response patterns in a layer

layer_name = 'block1_conv1'
size = 64
margin = 5
results = np.zeros((8 * size + 7 * margin, 8 * size + 7 * margin, 3))
for i in range(8):
 for j in range(8):
 filter_img = generate_pattern(layer_name, i + (j * 8), size=size)
 horizontal_start = i * size + i * margin
 horizontal_end = horizontal_start + size
 vertical_start = j * size + j * margin
 vertical_end = vertical_start + size
 results[horizontal_start: horizontal_end,
 vertical_start: vertical_end, :] = filter_img
plt.figure(figsize=(20, 20))
plt.imshow(results)

Figure 5-30 block1_ Filter mode of conv1 layer

Figure 5-31 block2_ Filter mode of conv1 layer

Figure 5-32 block3_ Filter mode of conv1 layer

Figure 5-33 block4_ Filter mode of conv1 layer

These filter visualizations contain a lot of information about how the layers of the convolutional neural network observe the world: each layer of the convolutional neural network learns a set of filters in order to represent its input as a combination of filters. This is similar to the process of Fourier transform decomposing the signal into a set of cosine functions. With the deepening of the number of layers, the filter in convolutional neural network becomes more and more complex and refined.

  • The filters in the first layer of the model (block1_conv1) correspond to simple directional edges and colors (some are color edges).
  • block2_ The filter of conv1 layer corresponds to a simple texture composed of edges and colors.
  • Higher level filters are similar to textures in natural images: feathers, eyes, leaves, etc.

5.4.3 thermal diagram activated by visualization

I will also introduce another visualization method, which helps to understand which part of an image makes the convolutional neural network make the final classification decision. This is helpful to debug the decision-making process of convolutional neural network, especially in the case of classification error. This method can also locate specific targets in the image.

This general technology is called class activation map (CAM) visualization. It refers to generating class activated thermal map for input image. Class activated thermogram is a two-dimensional fractional grid related to a specific output category. Each position of any input image must be calculated, which indicates the importance of each position to the category. For example, for an image input into the cat dog classification convolutional neural network, cam visualization can generate the thermal map of the category "cat", indicating the similarity between each part of the image and the "cat". Cam visualization will also generate the thermal map of the category "dog", indicating the similarity between each part of the image and the "dog".

The specific implementation method we will use is the method described in this paper "grad cam: visual explanations from deep networks via Gradientbased localization". This method is very simple: given an input image, for the output feature map of a convolution layer, each channel in the feature map is weighted by the gradient of category relative to channel. Intuitively, one way to understand this skill is that you use "the importance of each channel to the category" to weight the spatial map of "the activation intensity of the input image to different channels", so as to obtain the spatial map of "the activation intensity of the input image to the category".

We use the pre trained VGG16 network again to demonstrate this method.

Listing 5-40 loads the VGG16 network with pre training weights

from keras.applications.vgg16 import VGG16
model = VGG16(weights='imagenet') 

Figure 5-34 shows images of two African elephants (subject to the knowledge sharing license agreement), possibly a female elephant and its baby elephant, walking on the prairie. We convert this image into a format that VGG16 model can read: the size of the model is 224 × 224. These training images are based on keras applications. VGG16. preprocess_ Preprocess with the built-in rules in the input function. Therefore, we need to load the image and resize it to 224 × 224, and then convert it to the Numpy tensor in float32 format, and apply these preprocessing rules.

Figure 5-34 test image of African elephant

Code listing 5-41 preprocesses an input image for VGG16 model

from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input, decode_predictions
import numpy as np
img_path = '/Users/fchollet/Downloads/creative_commons_elephant.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

Now you can run the pre trained VGG16 network on the image and decode its prediction vector into a human readable format.

>>> preds = model.predict(x)
>>> print('Predicted:', decode_predictions(preds, top=3)[0])
Predicted:', [(u'n02504458', u'African_elephant', 0.92546833),
(u'n01871265', u'tusker', 0.070257246),
(u'n02504013', u'Indian_elephant', 0.0042589349)]

The first three categories of prediction for this image are:

  • African elephant (92.5% probability)
  • tusker (7% probability)
  • Indian elephant (0.4% probability)

The network identified an uncertain number of African elephants in the image. The most activated element in the prediction vector is the element corresponding to the category of "African elephant", with index number 386.

>>> np.argmax(preds[0])
386

To show which parts of the image are most like African elephants, let's use the grad cam algorithm.

Code listing 5-42 apply grad cam algorithm

african_elephant_output = model.output[:, 386]
last_conv_layer = model.get_layer('block5_conv3')
grads = K.gradients(african_elephant_output, last_conv_layer.output)[0]
pooled_grads = K.mean(grads, axis=(0, 1, 2))
iterate = K.function([model.input],
 [pooled_grads, last_conv_layer.output[0]])
pooled_grads_value, conv_layer_output_value = iterate([x])
for i in range(512):
 conv_layer_output_value[:, :, i] *= pooled_grads_value[i]
heatmap = np.mean(conv_layer_output_value, axis=-1) 

In order to facilitate visualization, we also need to standardize the thermal icon to the range of 0 ~ 1. The results obtained are shown in Figure 5-35.

Code list 5-43 post processing of thermal diagram

heatmap = np.maximum(heatmap, 0)
heatmap /= np.max(heatmap)
plt.matshow(heatmap)

Figure 5-35 "African elephant" activation thermal diagram of test image

Finally, we can use OpenCV to generate an image and overlay the original image on the just obtained thermal map (see figure 5-36).

Code listing 5-44 superimposes the heat map with the original image

import cv2
img = cv2.imread(img_path)
heatmap = cv2.resize(heatmap, (img.shape[1], img.shape[0]))
heatmap = np.uint8(255 * heatmap)
heatmap = cv2.applyColorMap(heatmap, cv2.COLORMAP_JET)
superimposed_img = heatmap * 0.4 + img
cv2.imwrite('/Users/fchollet/Downloads/elephant_cam.jpg', superimposed_img)

Figure 5-36 superimposes the class activation heat map on the original image
This visualization method answers two important questions:

  • Why does the internet think this image contains an African elephant?

  • Where is the African elephant in the image?

In particular, it is worth noting that the activation intensity of small elephant ears is very high, which may be the difference between African elephants and Indian elephants found in the network.

Posted by [PQ3]RogeR on Mon, 02 May 2022 13:23:05 +0300