Original address
https://www.bilibili.com/video/BV1ba411m72B
1. Why do you need to carefully design the weight initialization
- 1. It is prone to gradient disappearance (the gradient is particularly close to 0) and gradient explosion (the gradient is particularly large), resulting in most gradients obtained by back propagation not working or reacting
2. Design idea
The data transmission of each layer of neural network should be meaningful
The meaning is reflected in the original meaning of my expression, which can not be misinterpreted after transmission, such as a rumor that a sentence is changing from one word to another
For example, the height of a class of boys in primary school is a distribution with a variance
Get together after a few years
The height of these boys is a new distribution with a new variance
Variance describes the closeness of the distribution of this batch of data
If they are the same group of boys, their height distribution will not change
So we use variance to measure whether the input and output of neural network are the same distribution
3. Is it possible to initialize all zeros, small values and large values?
Initialize all zeros
The output of each layer is the same, so no matter how many rounds of network training
For each neuron in each layer, the weights are the same and cannot learn (extract) different features
import numpy as np from matplotlib import pyplot as plt def init_weights(u, a0, a1=None, a2=None, a3=None, a4=None): a_1 = a1 or a0 a_2 = a2 or a0 a_3 = a3 or a0 a_4 = a4 or a0 W0 = np.random.normal(u, a0, 400).reshape(2, 200) W1 = np.random.normal(u, a_1, 60000).reshape(200, 300) W2 = np.random.normal(u, a_2, 120000).reshape(300, 400) W3 = np.random.normal(u, a_3, 120000).reshape(400, 300) W4 = np.random.normal(u, a_4, 600).reshape(300, 2) return W0, W1, W2, W3, W4 def sigmoid(x): return 1 / (1 + np.exp(-x)) # Define sigmoid function def derivative_sigmoid(x): return x * (1 - x) def relu(x): return np.maximum(x, 0) def leaky_relu(x, p=0.1): return np.maximum(x, p*x) def tanh(x): return (np.exp(x)-np.exp(-x))/(np.exp(x)+np.exp(-x)) def derivative_tanh(x): return 1-tanh(x)**2 def model(X, W0, W1, W2, W3, W4, act='tanh'): # Define the forward propagation process of the model if act == 'tanh': output_0 = tanh(X @ W0) # [n, 2] @ [ 2, 200] = [n, 200] # print(X.shape, W0.shape, output_0.shape) output_1 = tanh(output_0 @ W1) # [n, 200] @ [200, 300] = [n, 300] # print(output_0.shape, W1.shape, output_1.shape) output_2 = tanh(output_1 @ W2) # [n, 300] @ [300, 400] = [n, 400] # print(output_1.shape, W2.shape, output_2.shape) output_3 = tanh(output_2 @ W3) # [n, 400] @ [400, 300] = [n, 300] # print(output_2.shape, W3.shape, output_3.shape) output_4 = tanh(output_3 @ W4) # [n, 300] @ [300, 2] = [n, 2] # print(output_3.shape, W4.shape, output_4.shape) elif act == 'relu': output_0 = relu(X @ W0) output_1 = relu(output_0 @ W1) output_2 = relu(output_1 @ W2) output_3 = relu(output_2 @ W3) output_4 = relu(output_3 @ W4) elif act == 'leaky_relu': output_0 = leaky_relu(X @ W0) output_1 = leaky_relu(output_0 @ W1) output_2 = leaky_relu(output_1 @ W2) output_3 = leaky_relu(output_2 @ W3) output_4 = leaky_relu(output_3 @ W4) else: output_0 = sigmoid(X @ W0) output_1 = sigmoid(output_0 @ W1) output_2 = sigmoid(output_1 @ W2) output_3 = sigmoid(output_2 @ W3) output_4 = sigmoid(output_3 @ W4) return [output_0, output_1, output_2, output_3, output_4] def plot_hist(outputs, xlim=(-1, 1), ylim=(0, 1)): n = len(outputs) fig, axes=plt.subplots(1, n, figsize=(3*n, 3), sharex=True, sharey=True) for i in range(n): axes[i].hist(outputs[i].flatten(),bins=50,histtype="stepfilled",density=True,alpha=0.6) plt.xlim(*xlim) plt.ylim(*ylim) plt.show() W0, W1, W2, W3, W4 = init_weights(0, 0) X = np.random.normal(0, 1, 1000).reshape(-1, 2) # Initialize X for all of the following tests outputs = model(X, W0, W1, W2, W3, W4, 'tanh') plot_hist(outputs, xlim=(-1, 1), ylim=(0, 1))
Initialization of smaller random weight
Ensure that the updates are not the same
However, the information is still unable to pass through when it is almost gathered at 0, and the local gradient is also 0. The gradient is too small
W0, W1, W2, W3, W4 = init_weights(0, 0.01) outputs = model(X, W0, W1, W2, W3, W4, 'tanh') plot_hist(outputs, xlim=(-1, 1), ylim=(0, 1)) outputs = model(X, W0, W1, W2, W3, W4, 'sigmoid') plot_hist(outputs, xlim=(0, 1), ylim=(0, 1))
Initialization with a larger value
Part of the information passed, but due to the action of tanh function, the value is too large and the gradient is too small
W0, W1, W2, W3, W4 = init_weights(0, 1) outputs = model(X, W0, W1, W2, W3, W4, 'tanh') plot_hist(outputs, xlim=(-1, 1), ylim=(0, 1)) outputs = model(X, W0, W1, W2, W3, W4, 'sigmoid') plot_hist(outputs, xlim=(0, 1), ylim=(0, 1))
4.Xavier initialization
There is something wrong with this formula. The cumulative symbol should be set with a brace to wrap the three items inside
sigmoid/tanh activation function verification
Click to view the code# The parameter passed in here is the reciprocal of the standard deviation of the input data dimension, because numpy creates a normal distribution by standard deviation W0, W1, W2, W3, W4 = init_weights(0, (1/2)**0.5, (1/200)**0.5, (1/300)**0.5, (1/400)**0.5, (1/300)**0.5) outputs = model(X, W0, W1, W2, W3, W4, 'tanh') plot_hist(outputs, xlim=(-1, 1), ylim=(0, 1)) outputs = model(X, W0, W1, W2, W3, W4, 'sigmoid') plot_hist(outputs, xlim=(0, 1), ylim=(0, 1)) # It is obvious that the distribution basically belongs to Zhengtai distribution, and the forward propagation of data is smooth
It is not enough to consider forward propagation alone. We should also consider back propagation update gradient
If the effect is ideal, it should be that what the input is, it can be transmitted from right to left
Notice that you are looking from right to left
W0, W1, W2, W3, W4 = init_weights(0, (1/2)**0.5, (1/200)**0.5, (1/300)**0.5, (1/400)**0.5, (1/300)**0.5) lr = 0.01 epochs = 1 Y_onehot = [] # Randomly generate a y-tag heat only code because it is used to calculate the loss back-propagation update gradient for i in range(X.shape[0]): temp = np.random.randint(0, 2) Y_onehot.append([temp, abs(1-temp)]) Y_onehot = np.array(Y_onehot) for epoch in range(epochs): for j in range(X.shape[0]): [output_0, output_1, output_2, output_3, output_4] = model(X[j], W0, W1, W2, W3, W4, 'tanh') # Back propagation calculates the gradient. The dimension of the gradient is the same as the weight dimension, because the weight will be updated later loss_4 = derivative_tanh(output_4) * (Y_onehot[j] - output_4) grad_4 = output_3.reshape(-1,1) @ loss_4.reshape(1,-1) loss_3 = derivative_tanh(output_3) * (W4 @ loss_4) grad_3 = output_2.reshape(-1,1) @ loss_3.reshape(1,-1) loss_2 = derivative_tanh(output_2) * (W3 @ loss_3) grad_2 = output_1.reshape(-1,1) @ loss_2.reshape(1,-1) loss_1 = derivative_tanh(output_1) * (W2 @ loss_2) grad_1 = output_0.reshape(-1,1) @ loss_1.reshape(1,-1) loss_0 = derivative_tanh(output_0) * (W1 @ loss_1) grad_0 = X[j].reshape(-1,1) @ loss_0.reshape(1,-1) # Gradient update W4 += lr*grad_4 W3 += lr*grad_3 W2 += lr*grad_2 W1 += lr*grad_1 W0 += lr*grad_0 outputs = model(X, W0, W1, W2, W3, W4, 'tanh') plot_hist(outputs, xlim=(-0.5, 0.5), ylim=(0, 4))
Obviously, looking from right to left, the information of back propagation cannot be transmitted well, so Xavier reconsiders back propagation
Conclusion: compromise forward propagation and back propagation
First, verify that the variance of weight initialization is twice the reciprocal of the sum of input dimension and output dimension 2/(Nin+Nout)
Verification with tanh and sigmoid
If the effect is ideal, it should be that what kind of input can be passed on
# The parameter passed in here is the reciprocal of the standard deviation of the input data dimension, because numpy creates a normal distribution by standard deviation W0, W1, W2, W3, W4 = init_weights(0, (2/(2+200))**0.5, (2/(200+300))**0.5, (2/(300+400))**0.5, (2/(400+300))**0.5, (2/(300+2))**0.5) outputs = model(X, W0, W1, W2, W3, W4, 'tanh') plot_hist(outputs, xlim=(-0.75, 0.75), ylim=(0, 1)) # It is obvious that the distribution basically belongs to Zhengtai distribution, and the forward propagation of data is smooth lr = 0.01 epochs = 1 for epoch in range(epochs): for j in range(X.shape[0]): [output_0, output_1, output_2, output_3, output_4] = model(X[j], W0, W1, W2, W3, W4, 'tanh') # Back propagation calculates the gradient. The dimension of the gradient is the same as the weight dimension, because the weight will be updated later loss_4 = derivative_tanh(output_4) * (Y_onehot[j] - output_4) grad_4 = output_3.reshape(-1,1) @ loss_4.reshape(1,-1) loss_3 = derivative_tanh(output_3) * (W4 @ loss_4) grad_3 = output_2.reshape(-1,1) @ loss_3.reshape(1,-1) loss_2 = derivative_tanh(output_2) * (W3 @ loss_3) grad_2 = output_1.reshape(-1,1) @ loss_2.reshape(1,-1) loss_1 = derivative_tanh(output_1) * (W2 @ loss_2) grad_1 = output_0.reshape(-1,1) @ loss_1.reshape(1,-1) loss_0 = derivative_tanh(output_0) * (W1 @ loss_1) grad_0 = X[j].reshape(-1,1) @ loss_0.reshape(1,-1) # Gradient update W4 += lr*grad_4 W3 += lr*grad_3 W2 += lr*grad_2 W1 += lr*grad_1 W0 += lr*grad_0 outputs = model(X, W0, W1, W2, W3, W4, 'tanh') plot_hist(outputs, xlim=(-0.75, 0.75), ylim=(0, 1))
First carry out a forward propagation with tanh as shown in the figure above
The learning rate is 0.01, and the back propagation of training an epoch is shown in the figure below
The first picture is forward propagation. Looking from left to right, the information is basically transmitted
The second picture is back propagation. Looking from right to left, the information is basically transmitted ¶
# The parameter passed in here is the reciprocal of the standard deviation of the input data dimension, because numpy creates a normal distribution by standard deviation W0, W1, W2, W3, W4 = init_weights(0, (2/(2+200))**0.5, (2/(200+300))**0.5, (2/(300+400))**0.5, (2/(400+300))**0.5, (2/(300+2))**0.5) outputs = model(X, W0, W1, W2, W3, W4, 'sigmoid') n = len(outputs) fig, axes=plt.subplots(1, n, figsize=(3*n, 3), sharex=True, sharey=True) for i in range(n): axes[i].hist(outputs[i].flatten(),bins=25,histtype="stepfilled",density=True,alpha=0.6) plt.xlim(0, 1) plt.ylim(0, 1) plt.show() # It is obvious that the distribution basically belongs to Zhengtai distribution, and the forward propagation of data is smooth lr = 0.01 epochs = 1 # Y_onehot = [] # for i in range(X.shape[0]): # temp = np.random.randint(0, 1) # Y_onehot.append([temp, abs(1-temp)]) # Y_onehot = np.array(Y_onehot) for epoch in range(epochs): for j in range(X.shape[0]): [output_0, output_1, output_2, output_3, output_4] = model(X[j], W0, W1, W2, W3, W4, 'sigmoid') # Back propagation calculates the gradient. The dimension of the gradient is the same as the weight dimension, because the weight will be updated later # "Loss" of the last layer = sigmoid derivative brought in by the output of the last layer * (real value - output of the last layer) loss_4 = derivative_sigmoid(output_4) * (Y_onehot[j] - output_4) # Gradient of this layer = upper layer weight @ loss of this layer grad_4 = output_3.reshape(-1,1) @ loss_4.reshape(1,-1) # "Loss" of this layer other than the last layer = sigmoid derivative brought in by the output of this layer * (weight of next layer - "loss" of next layer) loss_3 = derivative_sigmoid(output_3) * (W4 @ loss_4) grad_3 = output_2.reshape(-1,1) @ loss_3.reshape(1,-1) # "Loss" of this layer other than the last layer = sigmoid derivative brought in by the output of this layer * (weight of next layer - "loss" of next layer) loss_2 = derivative_sigmoid(output_2) * (W3 @ loss_3) grad_2 = output_1.reshape(-1,1) @ loss_2.reshape(1,-1) # "Loss" of this layer other than the last layer = sigmoid derivative brought in by the output of this layer * (weight of next layer - "loss" of next layer) loss_1 = derivative_sigmoid(output_1) * (W2 @ loss_2) grad_1 = output_0.reshape(-1,1) @ loss_1.reshape(1,-1) loss_0 = derivative_sigmoid(output_0) * (W1 @ loss_1) grad_0 = X[j].reshape(-1,1) @ loss_0.reshape(1,-1) # Gradient update W4 += lr*grad_4 W3 += lr*grad_3 W2 += lr*grad_2 W1 += lr*grad_1 W0 += lr*grad_0 outputs = model(X, W0, W1, W2, W3, W4, 'sigmoid') plot_hist(outputs, xlim=(0, 1), ylim=(0, 1))
Carry out a forward propagation with sigmoid first, as shown in the figure above
The learning rate is 0.01, and the back propagation of training an epoch is shown in the figure below
The first picture is forward propagation. Looking from left to right, the information is basically transmitted
The second picture is back propagation. Looking from right to left, the information is basically transmitted
The Relu function is not applicable to Xavier initialization weight method
Because randomly killing half destroys the distribution of data
Relu will not have a number less than 0, so only the part greater than 0 is drawn
Basically, there is no information at layer 4 ¶
W0, W1, W2, W3, W4 = init_weights(0, (2/(2+200))**0.5, (2/(200+300))**0.5, (2/(300+400))**0.5, (2/(400+300))**0.5, (2/(300+2))**0.5) outputs = model(X, W0, W1, W2, W3, W4, 'relu') plot_hist(outputs, xlim=(0, 0.5), ylim=(0, 2))
5. He Kaiming's initialization
If the Relu function is used, the initialization weight method of he Kaiming God is adopted
Just changing the variance in Xavier to twice can ease the use of Relu
W0, W1, W2, W3, W4 = init_weights(0, (2*2/(2+200))**0.5, (2*2/(200+300))**0.5, (2*2/(300+400))**0.5, (2*2/(400+300))**0.5, (2*2/(300+2))**0.5) outputs = model(X, W0, W1, W2, W3, W4, 'relu') plot_hist(outputs, xlim=(0, 0.5), ylim=(0, 2))
But this is not a variant of relu, so he Kaiming made a promotion
The understanding of denominator here is that if all real numbers R are regarded as two, relu is equivalent to killing one of them, so a=0
Then leaky relu(a) is equivalent to half inhibition, and the coefficient of inhibition is a ¶
a = 0.3 W0, W1, W2, W3, W4 = init_weights(0, (2*2/((1+a**2)*(2+200)))**0.5, (2*2/((1+a**2)*(200+300)))**0.5, (2*2/((1+a**2)*(300+400)))**0.5, (2*2/((1+a**2)*(400+300)))**0.5, (2*2/((1+a**2)*(300+2)))**0.5) outputs = model(X, W0, W1, W2, W3, W4, 'leaky_relu') plot_hist(outputs, xlim=(-0.1, 0.5), ylim=(0, 2))
Xavier is more suitable for sigmoid and tanh functions, and he Kaiming method is more suitable for relu and leaky relu
Click to view the codeClick to view the codeimport torch def cal(x): return (x.mean(), x.var()) uniform_w = torch.nn.init.uniform_(torch.empty(300, 500), a=0.0, b=1.0) normal_w = torch.nn.init.normal_(torch.empty(300, 500), mean=0.0, std=1.0) xavier_uniform_w = torch.nn.init.xavier_uniform_(torch.empty(300, 500), gain=1.0) kaiming_uniform_w = torch.nn.init.kaiming_uniform_(torch.empty(300, 500), a=0,mode='fan_in',nonlinearity='relu') xavier_normal_w = torch.nn.init.xavier_normal_(torch.empty(300, 500), gain=1.0) kaiming_normal_w = torch.nn.init.kaiming_normal_(torch.empty(300, 500), a=0,mode='fan_in',nonlinearity='relu') kaiming_uniform_w_l = torch.nn.init.kaiming_uniform_(torch.empty(300, 500), a=0.3,mode='fan_in',nonlinearity='leaky_relu') kaiming_normal_w_l = torch.nn.init.kaiming_normal_(torch.empty(300, 500), a=0.3,mode='fan_in',nonlinearity='leaky_relu')
print('uniform_w:', cal(uniform_w)) print('normal_w:', cal(normal_w)) print('xavier uniform distribution:', cal(xavier_uniform_w)) print('kaiming uniform distribution by torch:', cal(kaiming_uniform_w)) print('xavier Normal distribution:', cal(xavier_normal_w)) print('kaiming Normal distribution by torch:', cal(kaiming_normal_w)) print('kaiming uniform distribution leakyrelu:', cal(kaiming_uniform_w_l)) print('kaiming Normal distribution leakyrelu by torch:', cal(kaiming_normal_w_l))
uniform_w: (tensor(0.4999), tensor(0.0833))
normal_w: (tensor(-0.0016), tensor(0.9989))
xavier uniform distribution: (tensor(0.0001), tensor(0.0025))
kaiming uniform distribution by torch: (tensor(7.6385e-05), tensor(0.0040))
xavier normal distribution: (tensor(6.1033e-05), tensor(0.0025))
kaiming normal distribution by torch: (tensor(7.5544e-05), tensor(0.0040))
kaiming uniform distribution leakyrelu: (tensor(-1.6571e-05), tensor(0.0037))
kaiming normal distribution leakyrelu by torch: (tensor(-0.0001), tensor(0.0037))
# The variance of the uniform distribution is the square of the interval length divided by 12 # He Kaiming multiplies 2 on the basis of xavier, but finally has an approximate processing. Assuming that the input and output dimensions are the same, you can reduce 2, but you can set whether you want to ensure the forward or reverse direction. print('[0, 1]Mean and variance of uniform distribution', 1/2, (1-0)**2/12) print('xavier Uniformly distributed variance', (2*((6/(800))**0.5))**2/12) print('kaiming Uniformly distributed variance', (2*((2*6/(800))**0.5))**2/12) # 2*6/(400+400) -> 6/400 print('kaiming Variance approximation of uniform distribution', (2*((6/(500))**0.5))**2/12) print('xavier Variance of normal distribution', 2/(800)) print('kaiming Variance of normal distribution', 2*2/(800)) # 2*2/(400+400) -> 2/400 print('kaiming Variance approximation of normal distribution', 2/(500)) print('kaiming Uniformly distributed variance leakyrelu', (2*((2*6/((1+0.3**2)*800))**0.5))**2/12) print('kaiming Variance approximation of uniform distribution leakyrelu', (2*((6/((1+0.3**2)*500))**0.5))**2/12) print('kaiming Variance of normal distribution leakyrelu', 2*2/((1+0.3**2)*800)) print('kaiming Variance approximation of normal distribution leakyrelu', 2/((1+0.3**2)*500))
[0, 1] mean and variance of uniform distribution 0.5 0.08333
The variance of xavier uniform distribution is 0.0025
The variance of kaiming uniform distribution is 0.005
The variance of kaiming uniform distribution is approximately 0.004
Normal distribution of variance of Vier 0025
The variance of kaiming normal distribution is 0.005
The variance of kaiming normal distribution is approximately 0.004
kaiming uniformly distributed variance leakyrelu 0.0045871559633027525
The variance of kaiming uniform distribution is approximately leakyrelu 0.003669724770642106
Variance of kaiming normal distribution leakyrelu 0.004587155963302752
The variance of kaiming normal distribution is approximately leakyrelu 0.00366972477064202
Fan in Kaiming_ in ,fan_ Out mode
kaiming_uniform_w = torch.nn.init.kaiming_uniform_(torch.empty(300, 500), a=0,mode='fan_in',nonlinearity='relu')
Note that in this line, its denominator is not the mean 400 of 300500, but is controlled by a parameter mode. If it is fan_in is to ensure that the input denominator is 500. Take a look at this one
print('kaiming variance approximation of uniform distribution ', (2 * ((6 / (500)) 0.5)) 2 / 12)