1. Introduction

GoogLeNet is the number one algorithm in the 2014 ILSVRC challenge. Different from other network structures that further deepen the width and depth of neural networks, GoogLeNet team proposed an Inception network structure, which is to construct a "basic neuron" structure to build a sparse and high-performance network structure.

Advantages: it overcomes the problems of over fitting and difficult calculation and training caused by deep-seated neural network.

2.Inception network structure

The above is the most original Inception structure. In short, compared with the conventional neural network connecting the convolution layers in series, the Inception structure combines different convolution layers through well connection, as shown in the above figure: filters with side lengths of 1, 3 and 5 are used respectively, and then the obtained matrices are spliced together.

However, in the above structure: all convolution kernels are made on all outputs of the upper layer, after 3 × 3. The convolution of 5x5 will lead to too many parameters of the characteristic graph and too much calculation. In order to avoid this situation, 1x1 convolution cores are added before 3x3, 5x5 and after max pooling respectively. The main purpose is to reduce the dimension and correct the linear activation, which forms the network structure of Inception v1.

(example of dimension reduction:

The input feature map is 28x28x192. Set 1x1 convolution channel as 64, 3x3 convolution channel as 128 and 5x5 convolution channel as 32. As shown in the original structure, the convolution kernel parameters are: 1x1x192x64+3x3x192x128 +5x5x192x32=387072

For 3x3 and 5x5 convolution layers, 1x1 convolution layers with 96 and 16 channels are added respectively, and the parameters are as follows: 1x1x192x64+(1x1x192x96+3x3x96x128) + (1x1x192x16+5x5x16x32) = 215040. The whole parameter is reduced by about half.)

A useful aspect of this architecture is that it allows a significant increase in the number of units in each stage without an uncontrolled explosion of computational complexity in later stages. This is achieved through the universal use of dimensionality reduction before the expensive convolution of the larger convolution kernel. In addition, the design follows the practical intuition, that is, the visual information should be processed on different scales and then aggregated, so that the features can be abstracted from different scales at the same time in the next stage.

3. GoogLeNet

(“#3 × And "reduce" # × 5 "reduce" means in 3 × 3 and 5 × 5 before convolution, use 1 × 1. Dimension reduction of filter; (all convolutions use linear activation functions)

Input: 224 × two hundred and twenty-four × RGB image of 3 (mean 0 processing)

First layer (convolution):

7 × 7 convolution kernel (step size 2, 64 channels), output 112 × one hundred and twelve × 64;

3 × max pooling layer of 3 (step size 2), output 56 × fifty-six × sixty-four

Second layer (convolution):

3 × 3 convolution kernel (step size 1192 channels), output 56 × fifty-six × 192;

3 × max pooling layer of 3 (step size 2), output 28 × twenty-eight × one hundred and ninety-two

Third layer (Inception 3a):

It is divided into 4 branches:

(1)1 × 1 convolution kernel (64 channels), the output is 28 × twenty-eight × 64；

(2)1 × The convolution kernel of 1 (96 channels) reduces the dimension, and then 3 × 3 convolution kernel (128 channels), output 28 × twenty-eight × one hundred and twenty-eight

(3)1 × The convolution kernel of 1 (16 channels) reduces the dimension, and then 5 × 5 convolution kernel (32 channels), output 28 × twenty-eight × thirty-two

(4)3 × 3, and then pass through 1 × 1 convolution kernel (channel 32), output 28 × twenty-eight × thirty-two

And so on

The final network model is as follows:

(the classification result of the auxiliary classifier will be multiplied by a weight and added to the final classification result)

4. Code reference

Implement an Inception with tensorflow:

1 slim= tf.contrib.slim 2 3 net = slim.conv2d (Input, 32 , [3 , 3]) 4 5 # slim.arg_scope Function can be used to set default parameter values 6 7 with slim.arg_scope ( [slim.conv2d, slim .max_pool2d, slim.avg_pool2d), 8 9 stride=1, padding='VALID'): 10 11 net=upper－Output node matrix of layer 12 13 with tf .variable_scope ( 'Mixed_7c '): 14 15 #First path 16 17 with tf.variable_scope('Branch_0'): 18 19 ＃Implement a convolution layer with filter side length of 1 and depth of 320 20 21 branch_0= slim.conv2d(net,320,[1 , 1] scope='Conv2d_0a_1*1') 22 23 # Inception The second path in the module. The structure on this computing path is itself a Inception Structure. 24 25 with tf.variable_scope ('Branch_1'): 26 27 branch_1 = slim.conv2d(net,384 , [1,1], scope='Conv2d_0a_1*1') 28 29 # tf.concat Function can splice multiple matrices.tf.concat The first parameter of the function specifies the splicing dimension. The "3" given here represents that the matrix is spliced in the dimension of depth. 30 31 branch_1 = tf.concat(3 , [slim.conv2d(branch_1 , 384 , [1 , 3], scope='Conv2d_0b_1*3 '),slim.conv2d(branch_1, 384, [3 , 1], scope= 'Conv2d_0c_3*1 ')]) 32 33 # Inception Third path in module 34 35 with tf .variable_scope ( ' Branch_2 '): 36 37 branch_2 = slim.conv2d(net, 448, [1, 1] , scope=' Conv2d_0a_1*1 ') 38 39 branch_2 = slim.conv2d(branch_2 , 384 , [3 , 3], scope= ' Conv2d_0b_3*3 ') 40 41 branch_2 = tf.concat(3 , [slim.conv2d(branch_2 , 384 ,[1, 3] , scope＝' Conv2d_0c_1*3 ',slim.conv2d(branch_2, 384 ,[3 , 1], scope=' Conv2d_0d_3*1') ]) 42 43 # Inception Fourth path in module 44 45 with tf.variable scope ('Branch_3 '): 46 47 branch_3 = slim .avg_pool2d(net,[3, 3] , scope='AvgPool_0a_3*3 ') 48 49 branch_3 = slim.conv2d(branch_3,192 , [1, 1],scope='Conv2d_0b_1*1') 50 51 ＃current Inceptio The final output of the module is determined by 4 above A calculation result is obtained 52 53 net = tf . concat (3, [branch_0 , branch_1, branch_2 , branch_3])