Author: Liu Weiwei

Editor: Zhao Yifan

Batchnorm is an algorithm often used in deep network to accelerate neural network training, accelerate convergence speed and stability. It can be said that it is an essential part of deep network at present. This paper aims to make a detailed interpretation of the principle and code implementation of batchnorm, a common algorithm for deep learning, in a simple language. This paper mainly includes the following parts.

01

Main problems solved by Batchnorm

First of all, this part is about why the deep network needs batchnorm. We all know that the deep learning, especially on the CV, needs to normalize the data, because the deep neural network is mainly to learn the distribution of training data and achieve good generalization effect on the test set. However, if the data input by each batch has different distribution, it will obviously bring difficulties to the network training. On the other hand, the data distribution is also changing after one layer of network calculation. This phenomenon is called Internal Covariate Shift. It will be explained in detail later, which will bring difficulties to the next layer of network learning. Batch normalization is to solve the problem of distribution change.

Internal Covariate Shift

1.1

Internal Covariate Shift: this term was put forward by the google team in the paper Batch Normailzatoin. It mainly describes the problem of training difficulties when training the deep network, because after each parameter iteration update, the output data of the upper network will change after being calculated by this network, It brings difficulties to the learning of the next layer of network (neural network is supposed to learn the distribution of data, but if the distribution keeps changing, it will be difficult to learn). This phenomenon is called Internal Covariate Shift.

The previous solution of Batch Normailzatoin was to use a small learning rate and careful initialization parameters to whiten the data, but it is obvious that the symptoms are not the root cause.

covariate shift

1.2

Internal Covariate Shift and Covariate Shift are similar, but they are not the same thing. The former occurs inside the neural network, so it is internal, and the latter occurs on the input data. Covary shift mainly describes that the distribution difference between training data and test data has an impact on the generalization and training speed of the network. The method we often use is normalization or whitening. If you want an intuitive feeling, look at the following figure:

For a simple linear classification of chestnuts, suppose that our data distribution is as shown in a, and the parameter initialization is generally 0 mean value and small variance. At this time, the orange line fitted by y=wx+b, as shown in Figure b, reaches the purple line after multiple iterations. At this time, it has a good classification effect, but if we normalize it to near 0, it will obviously speed up the training speed, In this way, we further expand the relative difference between data through transformation, so it is easier to distinguish.

The covariate data can be used to normalize the difference between them. Of course, the covariate data can be used to better describe the inconsistent distribution of the training data. Batchnorm has done it. As mentioned earlier, batchnorm is a means of normalization. To the limit, this method will reduce the absolute difference between images, highlight the relative difference and speed up the training speed. Therefore, it can not be used in all areas of deep learning, and its inapplicability will be described below.

02

Interpretation of Batchnorm principle

This part of BatchNorm is mainly combined with the original paper, excluding some complex mathematical formulas and explaining the principle of BatchNorm as detailed as possible.

As I said before, in order to reduce the Internal Covariate Shift, it is not enough to normalize each layer of the neural network. Suppose that the output data of each layer are normalized to 0 mean and 1 variance, which meets the positive Pacific distribution. However, there is a problem at this time. The data distribution of each layer is the standard positive Pacific distribution, which makes it unable to learn the characteristics of the input data at all, because, The distribution of features learned with great effort is normalized. Therefore, it is obviously unreasonable to normalize each layer directly.

However, if you modify it slightly and add trainable parameters for normalization, it is implemented by BatchNorm. Next, make a detailed analysis in combination with the pseudo code in the figure below:

The reason why it is called batchnorm is that the data of norm is a batch, assuming that the input data is batch β= x_(1...m) m data in total, and the output is y_i=BN(x), the steps of batchnorm are as follows:

- First calculate the average value of the batch data x

- Find the variance of this batch

- The next step is to normalize x to get x_i^-
- The most important step is to introduce scaling and translation variables γ and β， Calculate the normalized value

Next, let's introduce these two additional parameters in detail. It was also said before that if you do normalization directly without other processing, the neural network can't learn anything, but after adding these two parameters, things will be different. First consider the special case, if γ and β Equal to the variance and mean of this batch respectively, then y_ Isn't I restored to the x before normalization, that is, the scaling and translation to the distribution before normalization, which is equivalent to that batchnorm doesn't work, γ and β They are called translation parameters and scaling parameters respectively. This ensures that each time the data is normalized, it still retains the learned features. At the same time, it can complete the operation of normalization and accelerate the training.

Let's start with a simple code, a little Chestnut:

copydef Batchnorm_simple_for_train(x, gamma,beta, bn_param):""" param:x : Input data, set shape(B,L) param:gama : Scaling factor γ param:beta : Translation factor β param:bn_param : batchnorm Some parameters required eps : A number close to 0 to prevent the denominator from appearing 0 momentum : Momentum parameter, generally 0.9，0.99， 0.999 running_mean : The new mean value is calculated in the way of moving average, which is calculated during training to prepare for the test data running_var : The new variance is calculated by moving average, which is calculated during training to prepare the test data """ running_mean = bn_param['running_mean'] #shape = [B] running_var = bn_param['running_var'] #shape = [B] results = 0. # Create a new variable x_mean=x.mean(axis=0) # Calculate the mean of x x_var=x.var(axis=0) # Calculate variance x_normalized=(x-x_mean)/np.sqrt(x_var+eps) # normalization results = gamma * x_normalized + beta # Zoom pan running_mean = momentum * running_mean + (1 - momentum) * x_mean running_var = momentum * running_var + (1 - momentum) * x_var #Record new value bn_param['running_mean'] = running_mean bn_param['running_var'] = running_var return results , bn_param

After reading this code, do you have a clear understanding of batchnorm? First calculate the mean and variance, then normalize, and then zoom and translate. It's done! However, this is a task completed in the training. Give a batch each time, and then calculate the mean variance of the batch. But this is not the case in the test. When testing, only one picture is input at a time. How to calculate the mean and variance of the batch? Therefore, there are the following two lines in the code. Calculate the mean var during the training, and use it directly when testing without calculating the mean and variance.

copyrunning_mean = momentum * running_mean + (1- momentum) * x_mean running_var = momentum * running_var + (1 -momentum) * x_var

Therefore, the test is as follows:

copydef Batchnorm_simple_for_test(x, gamma,beta, bn_param):""" param:x : Input data, set shape(B,L) param:gama : Scaling factor γ param:beta : Translation factor β param:bn_param : batchnorm Some parameters required eps : A number close to 0 to prevent the denominator from appearing 0 momentum : Momentum parameter, generally 0.9，0.99， 0.999 running_mean : The new mean value is calculated in the way of moving average, which is calculated during training to prepare for the test data running_var : The new variance is calculated by moving average, which is calculated during training to prepare the test data """ running_mean = bn_param['running_mean'] #shape = [B] running_var = bn_param['running_var'] #shape = [B] results = 0. # Create a new variable x_normalized=(x-running_mean )/np.sqrt(running_var +eps) # normalization results = gamma * x_normalized + beta # Zoom pan return results , bn_param

Do you understand? If you don't understand, you're welcome to watch it again.

03

Interpretation of Batchnorm source code

This section mainly explains the codes that can be used by Batchnorm in tensorflow, as follows:

The code comes from Zhihu. Comments are added here to help you read.

copydef batch_norm_layer(x, train_phase,scope_bn): with tf.variable_scope(scope_bn): # Create two new variables, translation and scaling factors beta = tf.Variable(tf.constant(0.0, shape=[x.shape[-1]]), name='beta',trainable=True) gamma = tf.Variable(tf.constant(1.0, shape=[x.shape[-1]]), name='gamma',trainable=True) # Calculate the mean and variance of this batch axises = np.arange(len(x.shape) - 1) batch_mean, batch_var = tf.nn.moments(x, axises, name='moments') # Sliding average attenuation ema = tf.train.ExponentialMovingAverage(decay=0.5) def mean_var_with_update(): ema_apply_op = ema.apply([batch_mean, batch_var]) with tf.control_dependencies([ema_apply_op]): return tf.identity(batch_mean),tf.identity(batch_var) # train_phase training or test flag # Training phase calculation running_ Mean and running_ VaR, using mean_var_with_update() function # When testing, take the previous calculation directly and use EMA average(batch_mean) mean, var = tf.cond(train_phase, mean_var_with_update, lambda:(ema.average(batch_mean), ema.average(batch_var))) normed = tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-3) return normed

As for this line of code TF nn. batch_ Normalization () is a simple process of calculating batchnorm al. The code is as follows: The function realized by this function is as follows:

copydef batch_normalization(x, mean, variance, offset,scale, variance_epsilon, name=None): with ops.name_scope(name, "batchnorm", [x, mean, variance,scale, offset]): inv = math_ops.rsqrt(variance + variance_epsilon) if scale is not None: inv *= scale return x * inv + (offset - mean * inv if offset is not Noneelse -mean * inv)

04

Advantages and disadvantages of Batchnorm

Having finished the main part, let's make a summary of BatchNorm:

- Before it, you need to carefully adjust the learning rate and weight initialization, but with BN, you can safely use the college learning rate, but with BN, you don't need to carefully adjust the parameters. The larger learning rate greatly improves the learning speed,
- Batchnorm itself is also a regular method, which can replace other regular methods, such as dropout
- In addition, I personally believe that batchnorm reduces the absolute difference between data, has a decorrelation property, and considers more relative differences, so it has a better effect on classification tasks

Note: as we all know, the South Korean team achieved the top 1 achievement in 2017NTIRE image super-resolution, mainly due to the removal of the batchnorm layer in the network. It can be seen that BN is not applicable to all tasks. In image to image tasks, especially in super-resolution, the absolute difference of images is particularly important, so the scale of batchnorm is not suitable.