CNN weight parameter initialization

Original address
https://www.bilibili.com/video/BV1ba411m72B

1. Why do you need to carefully design the weight initialization

  • 1. It is prone to gradient disappearance (the gradient is particularly close to 0) and gradient explosion (the gradient is particularly large), resulting in most gradients obtained by back propagation not working or reacting

2. Design idea

The data transmission of each layer of neural network should be meaningful
The meaning is reflected in the original meaning of my expression, which can not be misinterpreted after transmission, such as a rumor that a sentence is changing from one word to another
For example, the height of a class of boys in primary school is a distribution with a variance
Get together after a few years
The height of these boys is a new distribution with a new variance
Variance describes the closeness of the distribution of this batch of data
If they are the same group of boys, their height distribution will not change
So we use variance to measure whether the input and output of neural network are the same distribution

3. Is it possible to initialize all zeros, small values and large values?

Initialize all zeros
The output of each layer is the same, so no matter how many rounds of network training
For each neuron in each layer, the weights are the same and cannot learn (extract) different features

Click to view the code
import numpy as np
from matplotlib import pyplot as plt

def init_weights(u, a0, a1=None, a2=None, a3=None, a4=None):
    a_1 = a1 or a0
    a_2 = a2 or a0
    a_3 = a3 or a0
    a_4 = a4 or a0
    W0 = np.random.normal(u, a0, 400).reshape(2, 200)
    W1 = np.random.normal(u, a_1, 60000).reshape(200, 300)
    W2 = np.random.normal(u, a_2, 120000).reshape(300, 400)
    W3 = np.random.normal(u, a_3, 120000).reshape(400, 300)
    W4 = np.random.normal(u, a_4, 600).reshape(300, 2)
    return W0, W1, W2, W3, W4
def sigmoid(x):
    return 1 / (1 + np.exp(-x))                 # Define sigmoid function
def derivative_sigmoid(x):
    return x * (1 - x)   
def relu(x):
    return np.maximum(x, 0)
def leaky_relu(x, p=0.1):
    return np.maximum(x, p*x)
def tanh(x):
    return (np.exp(x)-np.exp(-x))/(np.exp(x)+np.exp(-x))
def derivative_tanh(x):
    return 1-tanh(x)**2
def model(X, W0, W1, W2, W3, W4, act='tanh'):                       # Define the forward propagation process of the model
    if act == 'tanh':
        output_0 = tanh(X @ W0)                  # [n,  2] @ [ 2, 200] = [n, 200]
#         print(X.shape, W0.shape, output_0.shape)
        output_1 = tanh(output_0 @ W1)           # [n, 200] @ [200, 300] = [n, 300]
#         print(output_0.shape, W1.shape, output_1.shape)
        output_2 = tanh(output_1 @ W2)           # [n, 300] @ [300,  400] = [n,  400]
#         print(output_1.shape, W2.shape, output_2.shape)
        output_3 = tanh(output_2 @ W3)           # [n, 400] @ [400,  300] = [n,  300]
#         print(output_2.shape, W3.shape, output_3.shape)
        output_4 = tanh(output_3 @ W4)           # [n, 300] @ [300,  2] = [n,  2]
#         print(output_3.shape, W4.shape, output_4.shape)
    elif act == 'relu':
        output_0 = relu(X @ W0)                 
        output_1 = relu(output_0 @ W1)          
        output_2 = relu(output_1 @ W2)           
        output_3 = relu(output_2 @ W3)         
        output_4 = relu(output_3 @ W4)          
    elif act == 'leaky_relu':
        output_0 = leaky_relu(X @ W0)                 
        output_1 = leaky_relu(output_0 @ W1)           
        output_2 = leaky_relu(output_1 @ W2)        
        output_3 = leaky_relu(output_2 @ W3)          
        output_4 = leaky_relu(output_3 @ W4)         
    else:
        output_0 = sigmoid(X @ W0)              
        output_1 = sigmoid(output_0 @ W1)          
        output_2 = sigmoid(output_1 @ W2)          
        output_3 = sigmoid(output_2 @ W3)         
        output_4 = sigmoid(output_3 @ W4)         
    return [output_0, output_1, output_2, output_3, output_4]
def plot_hist(outputs, xlim=(-1, 1), ylim=(0, 1)):
    n = len(outputs)
    fig, axes=plt.subplots(1, n, figsize=(3*n, 3), sharex=True, sharey=True)
    for i in range(n):
        axes[i].hist(outputs[i].flatten(),bins=50,histtype="stepfilled",density=True,alpha=0.6)
    plt.xlim(*xlim)
    plt.ylim(*ylim)
    plt.show()

W0, W1, W2, W3, W4 = init_weights(0, 0)
X = np.random.normal(0, 1, 1000).reshape(-1, 2) # Initialize X for all of the following tests
outputs = model(X, W0, W1, W2, W3, W4, 'tanh')
plot_hist(outputs, xlim=(-1, 1), ylim=(0, 1))

![](https://img2022.cnblogs.com/blog/2682749/202205/2682749-20220516083854776-1693978353.png)

Initialization of smaller random weight
Ensure that the updates are not the same
However, the information is still unable to pass through when it is almost gathered at 0, and the local gradient is also 0. The gradient is too small

Click to view the code
W0, W1, W2, W3, W4 = init_weights(0, 0.01)

outputs = model(X, W0, W1, W2, W3, W4, 'tanh')
plot_hist(outputs, xlim=(-1, 1), ylim=(0, 1))

outputs = model(X, W0, W1, W2, W3, W4, 'sigmoid')
plot_hist(outputs, xlim=(0, 1), ylim=(0, 1))


Initialization with a larger value
Part of the information passed, but due to the action of tanh function, the value is too large and the gradient is too small

Click to view the code
W0, W1, W2, W3, W4 = init_weights(0, 1)

outputs = model(X, W0, W1, W2, W3, W4, 'tanh')
plot_hist(outputs, xlim=(-1, 1), ylim=(0, 1))

outputs = model(X, W0, W1, W2, W3, W4, 'sigmoid')
plot_hist(outputs, xlim=(0, 1), ylim=(0, 1))


4.Xavier initialization


There is something wrong with this formula. The cumulative symbol should be set with a brace to wrap the three items inside

sigmoid/tanh activation function verification

Click to view the code
# The parameter passed in here is the reciprocal of the standard deviation of the input data dimension, because numpy creates a normal distribution by standard deviation
W0, W1, W2, W3, W4 = init_weights(0, (1/2)**0.5, (1/200)**0.5, (1/300)**0.5, (1/400)**0.5, (1/300)**0.5)

outputs = model(X, W0, W1, W2, W3, W4, 'tanh')
plot_hist(outputs, xlim=(-1, 1), ylim=(0, 1))

outputs = model(X, W0, W1, W2, W3, W4, 'sigmoid')
plot_hist(outputs, xlim=(0, 1), ylim=(0, 1))
# It is obvious that the distribution basically belongs to Zhengtai distribution, and the forward propagation of data is smooth


It is not enough to consider forward propagation alone. We should also consider back propagation update gradient
If the effect is ideal, it should be that what the input is, it can be transmitted from right to left
Notice that you are looking from right to left

Click to view the code
W0, W1, W2, W3, W4 = init_weights(0, (1/2)**0.5, (1/200)**0.5, (1/300)**0.5, (1/400)**0.5, (1/300)**0.5)
lr = 0.01
epochs = 1
Y_onehot = []  # Randomly generate a y-tag heat only code because it is used to calculate the loss back-propagation update gradient
for i in range(X.shape[0]):
    temp = np.random.randint(0, 2)
    Y_onehot.append([temp, abs(1-temp)])
Y_onehot = np.array(Y_onehot)  

for epoch in range(epochs):
    for j in range(X.shape[0]):
        [output_0, output_1, output_2, output_3, output_4] = model(X[j], W0, W1, W2, W3, W4, 'tanh')
        # Back propagation calculates the gradient. The dimension of the gradient is the same as the weight dimension, because the weight will be updated later
        loss_4 = derivative_tanh(output_4) * (Y_onehot[j] - output_4)    
        grad_4 = output_3.reshape(-1,1) @ loss_4.reshape(1,-1)                  
        loss_3 = derivative_tanh(output_3) * (W4 @ loss_4)                       
        grad_3 = output_2.reshape(-1,1) @ loss_3.reshape(1,-1)  
        loss_2 = derivative_tanh(output_2) * (W3 @ loss_3)                       
        grad_2 = output_1.reshape(-1,1) @ loss_2.reshape(1,-1) 
        loss_1 = derivative_tanh(output_1) * (W2 @ loss_2)                       
        grad_1 = output_0.reshape(-1,1) @ loss_1.reshape(1,-1)                     
        loss_0 = derivative_tanh(output_0) * (W1 @ loss_1)                      
        grad_0 = X[j].reshape(-1,1) @ loss_0.reshape(1,-1)                   

        # Gradient update
        W4 += lr*grad_4
        W3 += lr*grad_3 
        W2 += lr*grad_2
        W1 += lr*grad_1 
        W0 += lr*grad_0
    
outputs = model(X, W0, W1, W2, W3, W4, 'tanh')
plot_hist(outputs, xlim=(-0.5, 0.5), ylim=(0, 4))

Obviously, looking from right to left, the information of back propagation cannot be transmitted well, so Xavier reconsiders back propagation

Conclusion: compromise forward propagation and back propagation

First, verify that the variance of weight initialization is twice the reciprocal of the sum of input dimension and output dimension 2/(Nin+Nout)
Verification with tanh and sigmoid
If the effect is ideal, it should be that what kind of input can be passed on

Click to view the code
# The parameter passed in here is the reciprocal of the standard deviation of the input data dimension, because numpy creates a normal distribution by standard deviation
W0, W1, W2, W3, W4 = init_weights(0, (2/(2+200))**0.5, (2/(200+300))**0.5, (2/(300+400))**0.5, (2/(400+300))**0.5, (2/(300+2))**0.5)
outputs = model(X, W0, W1, W2, W3, W4, 'tanh')
plot_hist(outputs, xlim=(-0.75, 0.75), ylim=(0, 1))
# It is obvious that the distribution basically belongs to Zhengtai distribution, and the forward propagation of data is smooth

lr = 0.01
epochs = 1
for epoch in range(epochs):
    for j in range(X.shape[0]):
        [output_0, output_1, output_2, output_3, output_4] = model(X[j], W0, W1, W2, W3, W4, 'tanh')
        # Back propagation calculates the gradient. The dimension of the gradient is the same as the weight dimension, because the weight will be updated later
        loss_4 = derivative_tanh(output_4) * (Y_onehot[j] - output_4)    
        grad_4 = output_3.reshape(-1,1) @ loss_4.reshape(1,-1)              
        loss_3 = derivative_tanh(output_3) * (W4 @ loss_4)                       
        grad_3 = output_2.reshape(-1,1) @ loss_3.reshape(1,-1) 
        loss_2 = derivative_tanh(output_2) * (W3 @ loss_3)                       
        grad_2 = output_1.reshape(-1,1) @ loss_2.reshape(1,-1) 
        loss_1 = derivative_tanh(output_1) * (W2 @ loss_2)                       
        grad_1 = output_0.reshape(-1,1) @ loss_1.reshape(1,-1) 
        loss_0 = derivative_tanh(output_0) * (W1 @ loss_1)                      
        grad_0 = X[j].reshape(-1,1) @ loss_0.reshape(1,-1)                   

        # Gradient update
        W4 += lr*grad_4
        W3 += lr*grad_3 
        W2 += lr*grad_2
        W1 += lr*grad_1 
        W0 += lr*grad_0
    
outputs = model(X, W0, W1, W2, W3, W4, 'tanh')
plot_hist(outputs, xlim=(-0.75, 0.75), ylim=(0, 1))


First carry out a forward propagation with tanh as shown in the figure above
The learning rate is 0.01, and the back propagation of training an epoch is shown in the figure below
The first picture is forward propagation. Looking from left to right, the information is basically transmitted
The second picture is back propagation. Looking from right to left, the information is basically transmitted ¶

Click to view the code
# The parameter passed in here is the reciprocal of the standard deviation of the input data dimension, because numpy creates a normal distribution by standard deviation
W0, W1, W2, W3, W4 = init_weights(0, (2/(2+200))**0.5, (2/(200+300))**0.5, (2/(300+400))**0.5, (2/(400+300))**0.5, (2/(300+2))**0.5)

outputs = model(X, W0, W1, W2, W3, W4, 'sigmoid')
n = len(outputs)
fig, axes=plt.subplots(1, n, figsize=(3*n, 3), sharex=True, sharey=True)
for i in range(n):
    axes[i].hist(outputs[i].flatten(),bins=25,histtype="stepfilled",density=True,alpha=0.6)
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.show()
# It is obvious that the distribution basically belongs to Zhengtai distribution, and the forward propagation of data is smooth

lr = 0.01
epochs = 1
# Y_onehot = []
# for i in range(X.shape[0]):
#     temp = np.random.randint(0, 1)
#     Y_onehot.append([temp, abs(1-temp)])
# Y_onehot = np.array(Y_onehot)  
for epoch in range(epochs):
    for j in range(X.shape[0]):
        [output_0, output_1, output_2, output_3, output_4] = model(X[j], W0, W1, W2, W3, W4, 'sigmoid')
        # Back propagation calculates the gradient. The dimension of the gradient is the same as the weight dimension, because the weight will be updated later
        # "Loss" of the last layer = sigmoid derivative brought in by the output of the last layer * (real value - output of the last layer)
        loss_4 = derivative_sigmoid(output_4) * (Y_onehot[j] - output_4)    
        # Gradient of this layer = upper layer weight @ loss of this layer
        grad_4 = output_3.reshape(-1,1) @ loss_4.reshape(1,-1)                      

        # "Loss" of this layer other than the last layer = sigmoid derivative brought in by the output of this layer * (weight of next layer - "loss" of next layer)
        loss_3 = derivative_sigmoid(output_3) * (W4 @ loss_4)                       
        grad_3 = output_2.reshape(-1,1) @ loss_3.reshape(1,-1)  

        # "Loss" of this layer other than the last layer = sigmoid derivative brought in by the output of this layer * (weight of next layer - "loss" of next layer)
        loss_2 = derivative_sigmoid(output_2) * (W3 @ loss_3)                       
        grad_2 = output_1.reshape(-1,1) @ loss_2.reshape(1,-1) 

        # "Loss" of this layer other than the last layer = sigmoid derivative brought in by the output of this layer * (weight of next layer - "loss" of next layer)
        loss_1 = derivative_sigmoid(output_1) * (W2 @ loss_2)                       
        grad_1 = output_0.reshape(-1,1) @ loss_1.reshape(1,-1)                     

        loss_0 = derivative_sigmoid(output_0) * (W1 @ loss_1)                      
        grad_0 = X[j].reshape(-1,1) @ loss_0.reshape(1,-1)                   

        # Gradient update
        W4 += lr*grad_4
        W3 += lr*grad_3 
        W2 += lr*grad_2
        W1 += lr*grad_1 
        W0 += lr*grad_0
    
outputs = model(X, W0, W1, W2, W3, W4, 'sigmoid')
plot_hist(outputs, xlim=(0, 1), ylim=(0, 1))


Carry out a forward propagation with sigmoid first, as shown in the figure above
The learning rate is 0.01, and the back propagation of training an epoch is shown in the figure below
The first picture is forward propagation. Looking from left to right, the information is basically transmitted
The second picture is back propagation. Looking from right to left, the information is basically transmitted

The Relu function is not applicable to Xavier initialization weight method
Because randomly killing half destroys the distribution of data
Relu will not have a number less than 0, so only the part greater than 0 is drawn
Basically, there is no information at layer 4 ¶

Click to view the code
W0, W1, W2, W3, W4 = init_weights(0, (2/(2+200))**0.5, (2/(200+300))**0.5, (2/(300+400))**0.5, (2/(400+300))**0.5, (2/(300+2))**0.5)

outputs = model(X, W0, W1, W2, W3, W4, 'relu')
plot_hist(outputs, xlim=(0, 0.5), ylim=(0, 2))

5. He Kaiming's initialization

If the Relu function is used, the initialization weight method of he Kaiming God is adopted
Just changing the variance in Xavier to twice can ease the use of Relu

Click to view the code
W0, W1, W2, W3, W4 = init_weights(0, (2*2/(2+200))**0.5, (2*2/(200+300))**0.5, (2*2/(300+400))**0.5, (2*2/(400+300))**0.5, (2*2/(300+2))**0.5)

outputs = model(X, W0, W1, W2, W3, W4, 'relu')
plot_hist(outputs, xlim=(0, 0.5), ylim=(0, 2))

But this is not a variant of relu, so he Kaiming made a promotion
The understanding of denominator here is that if all real numbers R are regarded as two, relu is equivalent to killing one of them, so a=0
Then leaky relu(a) is equivalent to half inhibition, and the coefficient of inhibition is a ¶

Click to view the code
a = 0.3
W0, W1, W2, W3, W4 = init_weights(0, (2*2/((1+a**2)*(2+200)))**0.5, (2*2/((1+a**2)*(200+300)))**0.5, 
                                  (2*2/((1+a**2)*(300+400)))**0.5, (2*2/((1+a**2)*(400+300)))**0.5, (2*2/((1+a**2)*(300+2)))**0.5)

outputs = model(X, W0, W1, W2, W3, W4, 'leaky_relu')
plot_hist(outputs, xlim=(-0.1, 0.5), ylim=(0, 2))

Xavier is more suitable for sigmoid and tanh functions, and he Kaiming method is more suitable for relu and leaky relu

Click to view the code
import torch
def cal(x):
    return (x.mean(), x.var())
uniform_w = torch.nn.init.uniform_(torch.empty(300, 500), a=0.0, b=1.0)
normal_w = torch.nn.init.normal_(torch.empty(300, 500), mean=0.0, std=1.0)
xavier_uniform_w = torch.nn.init.xavier_uniform_(torch.empty(300, 500), gain=1.0)
kaiming_uniform_w = torch.nn.init.kaiming_uniform_(torch.empty(300, 500), a=0,mode='fan_in',nonlinearity='relu')
xavier_normal_w = torch.nn.init.xavier_normal_(torch.empty(300, 500), gain=1.0)
kaiming_normal_w = torch.nn.init.kaiming_normal_(torch.empty(300, 500), a=0,mode='fan_in',nonlinearity='relu')
kaiming_uniform_w_l = torch.nn.init.kaiming_uniform_(torch.empty(300, 500), a=0.3,mode='fan_in',nonlinearity='leaky_relu')
kaiming_normal_w_l = torch.nn.init.kaiming_normal_(torch.empty(300, 500), a=0.3,mode='fan_in',nonlinearity='leaky_relu')


Click to view the code
print('uniform_w:', cal(uniform_w))
print('normal_w:', cal(normal_w))
print('xavier uniform distribution:', cal(xavier_uniform_w))
print('kaiming uniform distribution by torch:', cal(kaiming_uniform_w))
print('xavier Normal distribution:', cal(xavier_normal_w))
print('kaiming Normal distribution by torch:', cal(kaiming_normal_w))
print('kaiming uniform distribution leakyrelu:', cal(kaiming_uniform_w_l))
print('kaiming Normal distribution leakyrelu by torch:', cal(kaiming_normal_w_l))

uniform_w: (tensor(0.4999), tensor(0.0833))
normal_w: (tensor(-0.0016), tensor(0.9989))
xavier uniform distribution: (tensor(0.0001), tensor(0.0025))
kaiming uniform distribution by torch: (tensor(7.6385e-05), tensor(0.0040))
xavier normal distribution: (tensor(6.1033e-05), tensor(0.0025))
kaiming normal distribution by torch: (tensor(7.5544e-05), tensor(0.0040))
kaiming uniform distribution leakyrelu: (tensor(-1.6571e-05), tensor(0.0037))
kaiming normal distribution leakyrelu by torch: (tensor(-0.0001), tensor(0.0037))

Click to view the code
# The variance of the uniform distribution is the square of the interval length divided by 12
# He Kaiming multiplies 2 on the basis of xavier, but finally has an approximate processing. Assuming that the input and output dimensions are the same, you can reduce 2, but you can set whether you want to ensure the forward or reverse direction.
print('[0, 1]Mean and variance of uniform distribution', 1/2, (1-0)**2/12)
print('xavier Uniformly distributed variance', (2*((6/(800))**0.5))**2/12)
print('kaiming Uniformly distributed variance', (2*((2*6/(800))**0.5))**2/12)  # 2*6/(400+400) -> 6/400
print('kaiming Variance approximation of uniform distribution', (2*((6/(500))**0.5))**2/12)
print('xavier Variance of normal distribution', 2/(800))
print('kaiming Variance of normal distribution', 2*2/(800))     # 2*2/(400+400) -> 2/400
print('kaiming Variance approximation of normal distribution', 2/(500))
print('kaiming Uniformly distributed variance leakyrelu', (2*((2*6/((1+0.3**2)*800))**0.5))**2/12)
print('kaiming Variance approximation of uniform distribution leakyrelu', (2*((6/((1+0.3**2)*500))**0.5))**2/12)
print('kaiming Variance of normal distribution leakyrelu', 2*2/((1+0.3**2)*800))
print('kaiming Variance approximation of normal distribution leakyrelu', 2/((1+0.3**2)*500))

[0, 1] mean and variance of uniform distribution 0.5 0.08333
The variance of xavier uniform distribution is 0.0025
The variance of kaiming uniform distribution is 0.005
The variance of kaiming uniform distribution is approximately 0.004
Normal distribution of variance of Vier 0025
The variance of kaiming normal distribution is 0.005
The variance of kaiming normal distribution is approximately 0.004
kaiming uniformly distributed variance leakyrelu 0.0045871559633027525
The variance of kaiming uniform distribution is approximately leakyrelu 0.003669724770642106
Variance of kaiming normal distribution leakyrelu 0.004587155963302752
The variance of kaiming normal distribution is approximately leakyrelu 0.00366972477064202

Fan in Kaiming_ in ,fan_ Out mode

kaiming_uniform_w = torch.nn.init.kaiming_uniform_(torch.empty(300, 500), a=0,mode='fan_in',nonlinearity='relu')
Note that in this line, its denominator is not the mean 400 of 300500, but is controlled by a parameter mode. If it is fan_in is to ensure that the input denominator is 500. Take a look at this one
print('kaiming variance approximation of uniform distribution ', (2 * ((6 / (500)) 0.5)) 2 / 12)

Posted by manishsinha27 on Mon, 16 May 2022 04:26:07 +0300