# 1, Back propagation overview

Back propagation can be said to be the most important algorithm in the history of neural networks - without effective back propagation, it is impossible to train the deep learning network to the depth we see today. Back propagation can be considered as the cornerstone of modern neural networks and deep learning.

The original version of back propagation was introduced in the 1970s, but it was not until Rumelhart, Hinton and Williams expressed through back propagation error learning in a groundbreaking paper in 1986 that we were able to design faster algorithms and be better at training deeper networks.

The network has several tutorials on back propagation. Some of my favorites include:

1. Andrew ng discusses back propagation in Coursera's machine learning course.

2. Mathematically motivated Chapter 2 – how Michael Nielsen's neural networks and back propagation algorithms in deep learning work.

3. The cs231n of Stanford University explores and analyzes back propagation.

4. Matt Mazur's excellent concrete example demonstrates how back propagation works.

As you can see, there is no lack of explanation of the principle of back propagation - here we use Python language to build an intuitive and easy to understand back propagation algorithm implementation.

We'll build a real neural network and train it using a back propagation algorithm, and you'll understand how back propagation works - and perhaps more importantly, you'll have a deeper understanding of how to use this algorithm to train neural networks from scratch.

# 2, Back propagation algorithm

The back propagation algorithm consists of two stages:

1. Our input passes through the network and obtains the forward transmission of output prediction (also known as propagation stage).

2. Back propagation, we calculate the gradient of the loss function of the last layer of the network (i.e. the prediction layer), and use the gradient to recursively apply the chain rule to update the weight in our network (also known as the weight update stage).

Next, we will use Python to implement the back propagation algorithm.

Then, use the following two data sets to demonstrate how to use back propagation and Python to train custom neural networks in the following two aspects:

1. XOR data set

2. MNIST dataset

## 1. Forward pass

The purpose of forward transfer is to propagate our input through the network by applying a series of dot products and activation until we reach the output layer of the network (i.e. our prediction). To visualize this process, let's first consider the XOR dataset (Table 2, left).

Each entry in the two-dimensional matrix is represented by two numbers on the left. For example, the first data point is represented by an eigenvector (0, 0), and the second data point is represented by (0, 1), etc. Then we take the output value y as the right column. Our target output value is the class label. Given the input from the design matrix, our goal is to correctly predict the target output value.

In order to achieve perfect classification accuracy on this problem, we need a feedforward neural network with at least a single hidden layer, so let's continue and start the architecture from 2-2-1 (Fig. 9, top). This is a good start; However, we forget to include the deviation term. We know from Chapter 9 that there are two ways to include the deviation term b in our network. We can:

1. Use separate variables.

2. By inserting a column 1 into the eigenvector, the deviation is regarded as a trainable parameter in the weight matrix.

Inserting the eigenvector of a column 1 is programmed, but to ensure that we understand this, we visualize the XOR design matrix to see this clearly (Table 2, right). It can be seen that a column of 1 has been added to our eigenvector. In fact, you can insert this column anywhere you like, but we usually put it at the front or back of the eigenvector.

Since we changed the size of the input eigenvector (usually performed inside the neural network implementation itself, so we do not need to explicitly modify our design matrix), this changed our (perceptual) network architecture from 2-2-1 to (internal) 3-3-1 (Figure 9, bottom).

We still call this network architecture 2-2-1, but in implementation, it is actually 3-3-1 because of the bias term embedded in the weight matrix.

Finally, recall that our input layer and all hidden layers need an offset term; However, the final output layer does not need to be biased. The advantage of applying the deviation technique is that we no longer need to explicitly track the deviation parameter - it is now a trainable parameter in the weight matrix, making the training more effective and easier to implement. See Chapter 9 for a more in-depth discussion of why this deviation technique works.

To see the role of forward transfer, we first initialize the weights in the network, as shown in Figure 10. Notice how each arrow in the weight matrix has a value associated with it -- this is the current weight value of a given node, indicating how much a given input is amplified or reduced. This weight value will be updated in the back propagation phase.

On the far left of Figure 10, we show the eigenvector (0,1,1) (and the target output value 1 of the network). Here we can see that 0, 1 and 1 have been assigned to three input nodes in the network. In order to propagate the value through the network and obtain the final classification, we need to take the dot product between the input and the weight value, and then apply the activation function (in this case, sigmoid function, σ).

Let's calculate the input of three nodes in the hidden layer:

1,σ((0×0.351)+(1×1.076)+(1×1.116))=0.899

2,σ((0×0.097)+(1×0.165)+(1×0.542))=0.593

3,σ((0×0.457)+(1×0.165)+(1×0.331))=0.378

Looking at the node value of the hidden layer (Figure 10, middle), we can see that the node has been updated to reflect our calculation.

We now have input to the hidden layer node. In order to calculate the output prediction, we calculate the dot product again, and then sigmoid activation:

Therefore, the output of the network is 0.506. We can apply a step function to determine whether this output is the correct classification:

Using the step function of net=0.506, we can see our network prediction 1, which is actually the correct class label. However, our network is not very confident in this class of tags - the predicted value of 0.506 is very close to the threshold of the step. Ideally, this prediction should be closer to 0.98-0.99, This means that our network has really understood the potential patterns in the data set. In order for our network to really "learn", we need to apply back propagation.

## 2. Backward propagation

In order to apply the back propagation algorithm, our activation function must be differentiable so that we can calculate the partial derivative of the error relative to the given weight wi,j; Loss(E), node output oj and network output netj.

Since the calculus behind back propagation has been explained in detail many times in previous works (see Andrew ng, Michael Nielsen, Matt Mazur), the derivation of back propagation chain rule update is skipped here, but it is explained through the following code.

For those who are proficient in mathematics, please refer to the resources above for more information about the chain rule and its role in back propagation algorithms. By explaining this process in the code, we can understand back propagation more intuitively.

## 3. Back propagation using Python

Let's create one named neuralnetwork py.

# Import package import numpy as np class NeuralNetwork: # layers, an integer list representing the actual architecture of the feedforward network # alpha, learning rate, which is applied in the weight update phase. def __init__(self, layers, alpha=0.1): # Initialize weight list W for each layer # Then storage tier and learning rate self.W = [] self.layers = layers self.alpha = alpha # Initialize weight for i in np.arange(0, len(layers) - 2): # Each layer in the network is initialized randomly by constructing MxN weight matrix by sampling values from standard normal distribution # In order to solve the deviation, we add one to the number of layers [i] and the number of layers [i+1] - this will change our weight matrix w to a shape of 3x3 w = np.random.randn(layers[i] + 1, layers[i + 1] + 1) # Divide by the square root of the number of nodes in the current layer to scale w, so as to standardize the variance of the output of each neuron self.W.append(w / np.sqrt(layers[i])) # the last two layers are a special case where the input # connections need a bias term but the output does not w = np.random.randn(layers[-2] + 1, layers[-1]) self.W.append(w / np.sqrt(layers[-2])) def __repr__(self): # construct and return a string that represents the network # architecture return "NeuralNetwork: {}".format("-".join(str(l) for l in self.layers)) def sigmoid(self, x): # compute and return the sigmoid activation value for a # given input value return 1.0 / (1 + np.exp(-x)) def sigmoid_deriv(self, x): # compute the derivative of the sigmoid function ASSUMING # that 'x' has already been passed through the 'sigmoid' # function return x * (1 - x) def fit(self, X, y, epochs=1000, displayUpdate=100): # insert a column of 1's as the last entry in the feature # matrix -- this little trick allows us to treat the bias # as a trainable parameter within the weight matrix X = np.c_[X, np.ones((X.shape[0]))] # loop over the desired number of epochs for epoch in np.arange(0, epochs): # loop over each individual data point and train # our network on it for (x, target) in zip(X, y): self.fit_partial(x, target) # check to see if we should display a training update if epoch == 0 or (epoch + 1) % displayUpdate == 0: loss = self.calculate_loss(X, y) print("[INFO] epoch={}, loss={:.7f}".format(epoch + 1, loss)) def fit_partial(self, x, y): # construct our list of output activations for each layer # as our data point flows through the network; the first # activation is a special case -- it's just the input # feature vector itself A = [np.atleast_2d(x)] # FEEDFORWARD: # loop over the layers in the network for layer in np.arange(0, len(self.W)): # feedforward the activation at the current layer by # taking the dot product between the activation and # the weight matrix -- this is called the "net input" # to the current layer net = A[layer].dot(self.W[layer]) # computing the "net output" is simply applying our # nonlinear activation function to the net input out = self.sigmoid(net) # once we have the net output, add it to our list of # activations A.append(out) # BACKPROPAGATION # the first phase of backpropagation is to compute the # difference between our *prediction* (the final output # activation in the activations list) and the true target # value error = A[-1] - y # from here, we need to apply the chain rule and build our # list of deltas 'D'; the first entry in the deltas is # simply the error of the output layer times the derivative # of our activation function for the output value D = [error * self.sigmoid_deriv(A[-1])] # once you understand the chain rule it becomes super easy # to implement with a 'for' loop -- simply loop over the # layers in reverse order (ignoring the last two since we # already have taken them into account) for layer in np.arange(len(A) - 2, 0, -1): # the delta for the current layer is equal to the delta # of the *previous layer* dotted with the weight matrix # of the current layer, followed by multiplying the delta # by the derivative of the nonlinear activation function # for the activations of the current layer delta = D[-1].dot(self.W[layer].T) delta = delta * self.sigmoid_deriv(A[layer]) D.append(delta) # since we looped over our layers in reverse order we need to # reverse the deltas D = D[::-1] # WEIGHT UPDATE PHASE # loop over the layers for layer in np.arange(0, len(self.W)): # update our weights by taking the dot product of the layer # activations with their respective deltas, then multiplying # this value by some small learning rate and adding to our # weight matrix -- this is where the actual "learning" takes # place self.W[layer] += -self.alpha * A[layer].T.dot(D[layer]) def predict(self, X, addBias=True): # initialize the output prediction as the input features -- this # value will be (forward) propagated through the network to # obtain the final prediction p = np.atleast_2d(X) # check to see if the bias column should be added if addBias: # insert a column of 1's as the last entry in the feature # matrix (bias) p = np.c_[p, np.ones((p.shape[0]))] # loop over our layers in the network for layer in np.arange(0, len(self.W)): # computing the output prediction is as simple as taking # the dot product between the current activation value 'p' # and the weight matrix associated with the current layer, # then passing this value through a nonlinear activation # function p = self.sigmoid(np.dot(p, self.W[layer])) # return the predicted value return p def calculate_loss(self, X, targets): # make predictions for the input data points then compute # the loss targets = np.atleast_2d(targets) predictions = self.predict(X, addBias=False) loss = 0.5 * np.sum((predictions - targets) ** 2) # return the loss return loss

## 4. Example 1 - XOR data

Create file nn_xor.py

# import the necessary packages from neuralnetwork import NeuralNetwork import numpy as np # Building an XOR dataset X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]]) y = np.array([[0], [1], [1], [0]]) # Define and train neural networks nn = NeuralNetwork([2, 2, 1], alpha=0.5) nn.fit(X, y, epochs=20000) # Make predictions for (x, target) in zip(X, y): # Cloud survey and print results pred = nn.predict(x)[0][0] step = 1 if pred > 0.5 else 0 print("[INFO] data={}, ground-truth={}, pred={:.4f}, step={}".format(x, target[0], pred, step))

Execute the file, and the output is as follows

For each data point, the neural network can correctly learn the XOR mode, which proves that our multilayer neural network can learn nonlinear functions.

In order to prove that learning XOR function requires at least one hidden layer, we modify the above code nn = NeuralNetwork([2, 2, 1], alpha=0.5) to NeuralNetwork([2, 1], alpha=0.5), and then retrain to obtain the following output.

No matter how you adjust the learning rate or weight initialization, you can never approach the XOR function. This is why multilayer networks with nonlinear activation functions trained by back propagation are so important.

## 5. Example 2 - MINST dataset

The second example is a subset of MNIST data set (below) for handwritten numeral recognition. This subset of the MNIST dataset is built into the scikit learn library and includes 1797 sample numbers, each 8 × 8 grayscale images (original image is 28) × 28). When flattened, each image is an 8 × 8 = 64 dimensional vector.

We created a named nn_mnist.py file.

from neuralnetwork import NeuralNetwork from sklearn.preprocessing import LabelBinarizer from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn import datasets # Load the MNIST dataset and apply min / max scaling to reduce the pixel intensity value to the [0, 1] range (each image is represented by an 8 x 8 = 64 dim eigenvector) print("[INFO] loading MNIST (sample) dataset...") digits = datasets.load_digits() data = digits.data.astype("float") data = (data - data.min()) / (data.max() - data.min()) print("[INFO] samples: {}, dim: {}".format(data.shape[0],data.shape[1])) # Build training set and test set (trainX, testX, trainY, testY) = train_test_split(data, digits.target, test_size=0.25) # Convert labels from integers to vectors trainY = LabelBinarizer().fit_transform(trainY) testY = LabelBinarizer().fit_transform(testY) # train the network print("[INFO] training network...") nn = NeuralNetwork([trainX.shape[1], 32, 16, 10]) print("[INFO] {}".format(nn)) nn.fit(trainX, trainY, epochs=1000) # evaluate the network print("[INFO] evaluating network...") predictions = nn.predict(testX) predictions = predictions.argmax(axis=1) print(classification_report(testY.argmax(axis=1), predictions))

Run the file and get the following output

We obtained about 98% classification accuracy on the test set;

# 3, Summary

We learned how to use Python to implement the back propagation algorithm from scratch. Back propagation is a generalization of gradient descent algorithm series, which is specially used to train multilayer feedforward networks.

The back propagation algorithm consists of two stages:

1. Forward transfer, we transfer input through the network to obtain our output classification.

2. Back propagation (i.e. weight update stage), we calculate the gradient of the loss function and iteratively apply the chain rule to update the weight in the network.

Back propagation is used to train both simple feedforward neural networks and complex deep convolution neural networks. This is achieved by ensuring that the activation function within the network is differentiable, allowing the application of chain rules. In addition, any other layers in the network that need to update weights / parameters must also be compatible with back propagation.

We use python programming language to implement a simple back-propagation algorithm and design a multilayer feedforward NeuralNetwork class. Then, the implementation is trained on XOR data set to prove that our neural network can learn nonlinear functions by applying back propagation algorithm with at least one hidden layer. Then, we apply the same back propagation + Python implementation to a subset of MNIST data set to prove that the algorithm can also be used to process image data.

In practice, back propagation is not only difficult to achieve (due to the error of calculating the gradient), but also difficult to make it efficient without special optimization libraries. This is why we often use libraries such as Keras, TensorFlow and mxnet, because these libraries have correctly used the optimization strategy to achieve back propagation.