Original address

https://www.bilibili.com/video/BV1ba411m72B

# 1. Why do you need to carefully design the weight initialization

- 1. It is prone to gradient disappearance (the gradient is particularly close to 0) and gradient explosion (the gradient is particularly large), resulting in most gradients obtained by back propagation not working or reacting

# 2. Design idea

The data transmission of each layer of neural network should be meaningful

The meaning is reflected in the original meaning of my expression, which can not be misinterpreted after transmission, such as a rumor that a sentence is changing from one word to another

For example, the height of a class of boys in primary school is a distribution with a variance

Get together after a few years

The height of these boys is a new distribution with a new variance

Variance describes the closeness of the distribution of this batch of data

If they are the same group of boys, their height distribution will not change

So we use variance to measure whether the input and output of neural network are the same distribution

# 3. Is it possible to initialize all zeros, small values and large values?

Initialize all zeros

The output of each layer is the same, so no matter how many rounds of network training

For each neuron in each layer, the weights are the same and cannot learn (extract) different features

![](https://img2022.cnblogs.com/blog/2682749/202205/2682749-20220516083854776-1693978353.png)import numpy as np from matplotlib import pyplot as plt def init_weights(u, a0, a1=None, a2=None, a3=None, a4=None): a_1 = a1 or a0 a_2 = a2 or a0 a_3 = a3 or a0 a_4 = a4 or a0 W0 = np.random.normal(u, a0, 400).reshape(2, 200) W1 = np.random.normal(u, a_1, 60000).reshape(200, 300) W2 = np.random.normal(u, a_2, 120000).reshape(300, 400) W3 = np.random.normal(u, a_3, 120000).reshape(400, 300) W4 = np.random.normal(u, a_4, 600).reshape(300, 2) return W0, W1, W2, W3, W4 def sigmoid(x): return 1 / (1 + np.exp(-x)) # Define sigmoid function def derivative_sigmoid(x): return x * (1 - x) def relu(x): return np.maximum(x, 0) def leaky_relu(x, p=0.1): return np.maximum(x, p*x) def tanh(x): return (np.exp(x)-np.exp(-x))/(np.exp(x)+np.exp(-x)) def derivative_tanh(x): return 1-tanh(x)**2 def model(X, W0, W1, W2, W3, W4, act='tanh'): # Define the forward propagation process of the model if act == 'tanh': output_0 = tanh(X @ W0) # [n, 2] @ [ 2, 200] = [n, 200] # print(X.shape, W0.shape, output_0.shape) output_1 = tanh(output_0 @ W1) # [n, 200] @ [200, 300] = [n, 300] # print(output_0.shape, W1.shape, output_1.shape) output_2 = tanh(output_1 @ W2) # [n, 300] @ [300, 400] = [n, 400] # print(output_1.shape, W2.shape, output_2.shape) output_3 = tanh(output_2 @ W3) # [n, 400] @ [400, 300] = [n, 300] # print(output_2.shape, W3.shape, output_3.shape) output_4 = tanh(output_3 @ W4) # [n, 300] @ [300, 2] = [n, 2] # print(output_3.shape, W4.shape, output_4.shape) elif act == 'relu': output_0 = relu(X @ W0) output_1 = relu(output_0 @ W1) output_2 = relu(output_1 @ W2) output_3 = relu(output_2 @ W3) output_4 = relu(output_3 @ W4) elif act == 'leaky_relu': output_0 = leaky_relu(X @ W0) output_1 = leaky_relu(output_0 @ W1) output_2 = leaky_relu(output_1 @ W2) output_3 = leaky_relu(output_2 @ W3) output_4 = leaky_relu(output_3 @ W4) else: output_0 = sigmoid(X @ W0) output_1 = sigmoid(output_0 @ W1) output_2 = sigmoid(output_1 @ W2) output_3 = sigmoid(output_2 @ W3) output_4 = sigmoid(output_3 @ W4) return [output_0, output_1, output_2, output_3, output_4] def plot_hist(outputs, xlim=(-1, 1), ylim=(0, 1)): n = len(outputs) fig, axes=plt.subplots(1, n, figsize=(3*n, 3), sharex=True, sharey=True) for i in range(n): axes[i].hist(outputs[i].flatten(),bins=50,histtype="stepfilled",density=True,alpha=0.6) plt.xlim(*xlim) plt.ylim(*ylim) plt.show() W0, W1, W2, W3, W4 = init_weights(0, 0) X = np.random.normal(0, 1, 1000).reshape(-1, 2) # Initialize X for all of the following tests outputs = model(X, W0, W1, W2, W3, W4, 'tanh') plot_hist(outputs, xlim=(-1, 1), ylim=(0, 1))

Initialization of smaller random weight

Ensure that the updates are not the same

However, the information is still unable to pass through when it is almost gathered at 0, and the local gradient is also 0. The gradient is too small

W0, W1, W2, W3, W4 = init_weights(0, 0.01) outputs = model(X, W0, W1, W2, W3, W4, 'tanh') plot_hist(outputs, xlim=(-1, 1), ylim=(0, 1)) outputs = model(X, W0, W1, W2, W3, W4, 'sigmoid') plot_hist(outputs, xlim=(0, 1), ylim=(0, 1))

Initialization with a larger value

Part of the information passed, but due to the action of tanh function, the value is too large and the gradient is too small

W0, W1, W2, W3, W4 = init_weights(0, 1) outputs = model(X, W0, W1, W2, W3, W4, 'tanh') plot_hist(outputs, xlim=(-1, 1), ylim=(0, 1)) outputs = model(X, W0, W1, W2, W3, W4, 'sigmoid') plot_hist(outputs, xlim=(0, 1), ylim=(0, 1))

# 4.Xavier initialization

There is something wrong with this formula. The cumulative symbol should be set with a brace to wrap the three items inside

## sigmoid/tanh activation function verification

Click to view the code# The parameter passed in here is the reciprocal of the standard deviation of the input data dimension, because numpy creates a normal distribution by standard deviation W0, W1, W2, W3, W4 = init_weights(0, (1/2)**0.5, (1/200)**0.5, (1/300)**0.5, (1/400)**0.5, (1/300)**0.5) outputs = model(X, W0, W1, W2, W3, W4, 'tanh') plot_hist(outputs, xlim=(-1, 1), ylim=(0, 1)) outputs = model(X, W0, W1, W2, W3, W4, 'sigmoid') plot_hist(outputs, xlim=(0, 1), ylim=(0, 1)) # It is obvious that the distribution basically belongs to Zhengtai distribution, and the forward propagation of data is smooth

It is not enough to consider forward propagation alone. We should also consider back propagation update gradient

If the effect is ideal, it should be that what the input is, it can be transmitted from right to left

Notice that you are looking from right to left

W0, W1, W2, W3, W4 = init_weights(0, (1/2)**0.5, (1/200)**0.5, (1/300)**0.5, (1/400)**0.5, (1/300)**0.5) lr = 0.01 epochs = 1 Y_onehot = [] # Randomly generate a y-tag heat only code because it is used to calculate the loss back-propagation update gradient for i in range(X.shape[0]): temp = np.random.randint(0, 2) Y_onehot.append([temp, abs(1-temp)]) Y_onehot = np.array(Y_onehot) for epoch in range(epochs): for j in range(X.shape[0]): [output_0, output_1, output_2, output_3, output_4] = model(X[j], W0, W1, W2, W3, W4, 'tanh') # Back propagation calculates the gradient. The dimension of the gradient is the same as the weight dimension, because the weight will be updated later loss_4 = derivative_tanh(output_4) * (Y_onehot[j] - output_4) grad_4 = output_3.reshape(-1,1) @ loss_4.reshape(1,-1) loss_3 = derivative_tanh(output_3) * (W4 @ loss_4) grad_3 = output_2.reshape(-1,1) @ loss_3.reshape(1,-1) loss_2 = derivative_tanh(output_2) * (W3 @ loss_3) grad_2 = output_1.reshape(-1,1) @ loss_2.reshape(1,-1) loss_1 = derivative_tanh(output_1) * (W2 @ loss_2) grad_1 = output_0.reshape(-1,1) @ loss_1.reshape(1,-1) loss_0 = derivative_tanh(output_0) * (W1 @ loss_1) grad_0 = X[j].reshape(-1,1) @ loss_0.reshape(1,-1) # Gradient update W4 += lr*grad_4 W3 += lr*grad_3 W2 += lr*grad_2 W1 += lr*grad_1 W0 += lr*grad_0 outputs = model(X, W0, W1, W2, W3, W4, 'tanh') plot_hist(outputs, xlim=(-0.5, 0.5), ylim=(0, 4))

Obviously, looking from right to left, the information of back propagation cannot be transmitted well, so Xavier reconsiders back propagation

#### Conclusion: compromise forward propagation and back propagation

First, verify that the variance of weight initialization is twice the reciprocal of the sum of input dimension and output dimension 2/(Nin+Nout)

Verification with tanh and sigmoid

If the effect is ideal, it should be that what kind of input can be passed on

# The parameter passed in here is the reciprocal of the standard deviation of the input data dimension, because numpy creates a normal distribution by standard deviation W0, W1, W2, W3, W4 = init_weights(0, (2/(2+200))**0.5, (2/(200+300))**0.5, (2/(300+400))**0.5, (2/(400+300))**0.5, (2/(300+2))**0.5) outputs = model(X, W0, W1, W2, W3, W4, 'tanh') plot_hist(outputs, xlim=(-0.75, 0.75), ylim=(0, 1)) # It is obvious that the distribution basically belongs to Zhengtai distribution, and the forward propagation of data is smooth lr = 0.01 epochs = 1 for epoch in range(epochs): for j in range(X.shape[0]): [output_0, output_1, output_2, output_3, output_4] = model(X[j], W0, W1, W2, W3, W4, 'tanh') # Back propagation calculates the gradient. The dimension of the gradient is the same as the weight dimension, because the weight will be updated later loss_4 = derivative_tanh(output_4) * (Y_onehot[j] - output_4) grad_4 = output_3.reshape(-1,1) @ loss_4.reshape(1,-1) loss_3 = derivative_tanh(output_3) * (W4 @ loss_4) grad_3 = output_2.reshape(-1,1) @ loss_3.reshape(1,-1) loss_2 = derivative_tanh(output_2) * (W3 @ loss_3) grad_2 = output_1.reshape(-1,1) @ loss_2.reshape(1,-1) loss_1 = derivative_tanh(output_1) * (W2 @ loss_2) grad_1 = output_0.reshape(-1,1) @ loss_1.reshape(1,-1) loss_0 = derivative_tanh(output_0) * (W1 @ loss_1) grad_0 = X[j].reshape(-1,1) @ loss_0.reshape(1,-1) # Gradient update W4 += lr*grad_4 W3 += lr*grad_3 W2 += lr*grad_2 W1 += lr*grad_1 W0 += lr*grad_0 outputs = model(X, W0, W1, W2, W3, W4, 'tanh') plot_hist(outputs, xlim=(-0.75, 0.75), ylim=(0, 1))

First carry out a forward propagation with tanh as shown in the figure above

The learning rate is 0.01, and the back propagation of training an epoch is shown in the figure below

The first picture is forward propagation. Looking from left to right, the information is basically transmitted

The second picture is back propagation. Looking from right to left, the information is basically transmitted ¶

# The parameter passed in here is the reciprocal of the standard deviation of the input data dimension, because numpy creates a normal distribution by standard deviation W0, W1, W2, W3, W4 = init_weights(0, (2/(2+200))**0.5, (2/(200+300))**0.5, (2/(300+400))**0.5, (2/(400+300))**0.5, (2/(300+2))**0.5) outputs = model(X, W0, W1, W2, W3, W4, 'sigmoid') n = len(outputs) fig, axes=plt.subplots(1, n, figsize=(3*n, 3), sharex=True, sharey=True) for i in range(n): axes[i].hist(outputs[i].flatten(),bins=25,histtype="stepfilled",density=True,alpha=0.6) plt.xlim(0, 1) plt.ylim(0, 1) plt.show() # It is obvious that the distribution basically belongs to Zhengtai distribution, and the forward propagation of data is smooth lr = 0.01 epochs = 1 # Y_onehot = [] # for i in range(X.shape[0]): # temp = np.random.randint(0, 1) # Y_onehot.append([temp, abs(1-temp)]) # Y_onehot = np.array(Y_onehot) for epoch in range(epochs): for j in range(X.shape[0]): [output_0, output_1, output_2, output_3, output_4] = model(X[j], W0, W1, W2, W3, W4, 'sigmoid') # Back propagation calculates the gradient. The dimension of the gradient is the same as the weight dimension, because the weight will be updated later # "Loss" of the last layer = sigmoid derivative brought in by the output of the last layer * (real value - output of the last layer) loss_4 = derivative_sigmoid(output_4) * (Y_onehot[j] - output_4) # Gradient of this layer = upper layer weight @ loss of this layer grad_4 = output_3.reshape(-1,1) @ loss_4.reshape(1,-1) # "Loss" of this layer other than the last layer = sigmoid derivative brought in by the output of this layer * (weight of next layer - "loss" of next layer) loss_3 = derivative_sigmoid(output_3) * (W4 @ loss_4) grad_3 = output_2.reshape(-1,1) @ loss_3.reshape(1,-1) # "Loss" of this layer other than the last layer = sigmoid derivative brought in by the output of this layer * (weight of next layer - "loss" of next layer) loss_2 = derivative_sigmoid(output_2) * (W3 @ loss_3) grad_2 = output_1.reshape(-1,1) @ loss_2.reshape(1,-1) # "Loss" of this layer other than the last layer = sigmoid derivative brought in by the output of this layer * (weight of next layer - "loss" of next layer) loss_1 = derivative_sigmoid(output_1) * (W2 @ loss_2) grad_1 = output_0.reshape(-1,1) @ loss_1.reshape(1,-1) loss_0 = derivative_sigmoid(output_0) * (W1 @ loss_1) grad_0 = X[j].reshape(-1,1) @ loss_0.reshape(1,-1) # Gradient update W4 += lr*grad_4 W3 += lr*grad_3 W2 += lr*grad_2 W1 += lr*grad_1 W0 += lr*grad_0 outputs = model(X, W0, W1, W2, W3, W4, 'sigmoid') plot_hist(outputs, xlim=(0, 1), ylim=(0, 1))

Carry out a forward propagation with sigmoid first, as shown in the figure above

The learning rate is 0.01, and the back propagation of training an epoch is shown in the figure below

The first picture is forward propagation. Looking from left to right, the information is basically transmitted

The second picture is back propagation. Looking from right to left, the information is basically transmitted

The Relu function is not applicable to Xavier initialization weight method

Because randomly killing half destroys the distribution of data

Relu will not have a number less than 0, so only the part greater than 0 is drawn

Basically, there is no information at layer 4 ¶

W0, W1, W2, W3, W4 = init_weights(0, (2/(2+200))**0.5, (2/(200+300))**0.5, (2/(300+400))**0.5, (2/(400+300))**0.5, (2/(300+2))**0.5) outputs = model(X, W0, W1, W2, W3, W4, 'relu') plot_hist(outputs, xlim=(0, 0.5), ylim=(0, 2))

# 5. He Kaiming's initialization

If the Relu function is used, the initialization weight method of he Kaiming God is adopted

Just changing the variance in Xavier to twice can ease the use of Relu

W0, W1, W2, W3, W4 = init_weights(0, (2*2/(2+200))**0.5, (2*2/(200+300))**0.5, (2*2/(300+400))**0.5, (2*2/(400+300))**0.5, (2*2/(300+2))**0.5) outputs = model(X, W0, W1, W2, W3, W4, 'relu') plot_hist(outputs, xlim=(0, 0.5), ylim=(0, 2))

But this is not a variant of relu, so he Kaiming made a promotion

The understanding of denominator here is that if all real numbers R are regarded as two, relu is equivalent to killing one of them, so a=0

Then leaky relu(a) is equivalent to half inhibition, and the coefficient of inhibition is a ¶

a = 0.3 W0, W1, W2, W3, W4 = init_weights(0, (2*2/((1+a**2)*(2+200)))**0.5, (2*2/((1+a**2)*(200+300)))**0.5, (2*2/((1+a**2)*(300+400)))**0.5, (2*2/((1+a**2)*(400+300)))**0.5, (2*2/((1+a**2)*(300+2)))**0.5) outputs = model(X, W0, W1, W2, W3, W4, 'leaky_relu') plot_hist(outputs, xlim=(-0.1, 0.5), ylim=(0, 2))

Xavier is more suitable for sigmoid and tanh functions, and he Kaiming method is more suitable for relu and leaky relu

Click to view the codeClick to view the codeimport torch def cal(x): return (x.mean(), x.var()) uniform_w = torch.nn.init.uniform_(torch.empty(300, 500), a=0.0, b=1.0) normal_w = torch.nn.init.normal_(torch.empty(300, 500), mean=0.0, std=1.0) xavier_uniform_w = torch.nn.init.xavier_uniform_(torch.empty(300, 500), gain=1.0) kaiming_uniform_w = torch.nn.init.kaiming_uniform_(torch.empty(300, 500), a=0,mode='fan_in',nonlinearity='relu') xavier_normal_w = torch.nn.init.xavier_normal_(torch.empty(300, 500), gain=1.0) kaiming_normal_w = torch.nn.init.kaiming_normal_(torch.empty(300, 500), a=0,mode='fan_in',nonlinearity='relu') kaiming_uniform_w_l = torch.nn.init.kaiming_uniform_(torch.empty(300, 500), a=0.3,mode='fan_in',nonlinearity='leaky_relu') kaiming_normal_w_l = torch.nn.init.kaiming_normal_(torch.empty(300, 500), a=0.3,mode='fan_in',nonlinearity='leaky_relu')

print('uniform_w:', cal(uniform_w)) print('normal_w:', cal(normal_w)) print('xavier uniform distribution:', cal(xavier_uniform_w)) print('kaiming uniform distribution by torch:', cal(kaiming_uniform_w)) print('xavier Normal distribution:', cal(xavier_normal_w)) print('kaiming Normal distribution by torch:', cal(kaiming_normal_w)) print('kaiming uniform distribution leakyrelu:', cal(kaiming_uniform_w_l)) print('kaiming Normal distribution leakyrelu by torch:', cal(kaiming_normal_w_l))

uniform_w: (tensor(0.4999), tensor(0.0833))

normal_w: (tensor(-0.0016), tensor(0.9989))

xavier uniform distribution: (tensor(0.0001), tensor(0.0025))

kaiming uniform distribution by torch: (tensor(7.6385e-05), tensor(0.0040))

xavier normal distribution: (tensor(6.1033e-05), tensor(0.0025))

kaiming normal distribution by torch: (tensor(7.5544e-05), tensor(0.0040))

kaiming uniform distribution leakyrelu: (tensor(-1.6571e-05), tensor(0.0037))

kaiming normal distribution leakyrelu by torch: (tensor(-0.0001), tensor(0.0037))

# The variance of the uniform distribution is the square of the interval length divided by 12 # He Kaiming multiplies 2 on the basis of xavier, but finally has an approximate processing. Assuming that the input and output dimensions are the same, you can reduce 2, but you can set whether you want to ensure the forward or reverse direction. print('[0, 1]Mean and variance of uniform distribution', 1/2, (1-0)**2/12) print('xavier Uniformly distributed variance', (2*((6/(800))**0.5))**2/12) print('kaiming Uniformly distributed variance', (2*((2*6/(800))**0.5))**2/12) # 2*6/(400+400) -> 6/400 print('kaiming Variance approximation of uniform distribution', (2*((6/(500))**0.5))**2/12) print('xavier Variance of normal distribution', 2/(800)) print('kaiming Variance of normal distribution', 2*2/(800)) # 2*2/(400+400) -> 2/400 print('kaiming Variance approximation of normal distribution', 2/(500)) print('kaiming Uniformly distributed variance leakyrelu', (2*((2*6/((1+0.3**2)*800))**0.5))**2/12) print('kaiming Variance approximation of uniform distribution leakyrelu', (2*((6/((1+0.3**2)*500))**0.5))**2/12) print('kaiming Variance of normal distribution leakyrelu', 2*2/((1+0.3**2)*800)) print('kaiming Variance approximation of normal distribution leakyrelu', 2/((1+0.3**2)*500))

[0, 1] mean and variance of uniform distribution 0.5 0.08333

The variance of xavier uniform distribution is 0.0025

The variance of kaiming uniform distribution is 0.005

The variance of kaiming uniform distribution is approximately 0.004

Normal distribution of variance of Vier 0025

The variance of kaiming normal distribution is 0.005

The variance of kaiming normal distribution is approximately 0.004

kaiming uniformly distributed variance leakyrelu 0.0045871559633027525

The variance of kaiming uniform distribution is approximately leakyrelu 0.003669724770642106

Variance of kaiming normal distribution leakyrelu 0.004587155963302752

The variance of kaiming normal distribution is approximately leakyrelu 0.00366972477064202

### Fan in Kaiming_ in ,fan_ Out mode

kaiming_uniform_w = torch.nn.init.kaiming_uniform_(torch.empty(300, 500), a=0,mode='fan_in',nonlinearity='relu')

Note that in this line, its denominator is not the mean 400 of 300500, but is controlled by a parameter mode. If it is fan_in is to ensure that the input denominator is 500. Take a look at this one

print('kaiming variance approximation of uniform distribution ', (2 * ((6 / (500)) 0.5)) 2 / 12)