
Neural Networks to Solve MultiClassification Problems Example: Digit Recognition
 1. Visualizing the data
 2. Designing Nural Network
 3. Write the cost function calculation function (nnCostFunction)
 4. Advanced optimization training neural network (Learning parameters using fmincg)
 5. Calculate empirical error / adjust parameters / model evaluation (Trying out different learning settings)
 6. Visualizing the hidden layer
Neural Networks to Solve MultiClassification Problems Example: Digit Recognition
1. Visualizing the data
The training set provides 5000 digital images, each image is 20x20 pixels, and is converted into a 1x400 vector storage. The sample input is a 5000x400 matrix and the output is a 5000x1 vector. coursera provides functions to convert grayscale values to pictures, but this does not help us substantially to solve the problem.
2. Designing Nural Network
Since each sample input is a 1x400 vector, there should be 400 input neurons. We predict a total of 10 numbers, so the output neuron should consist of 10. Since the problem is not complicated, just use 1 hidden layer (the neural network used in the early automatic driving test is 3 hidden layers), and the number of neurons in the hidden layer is 25.
Define three variables to store the number of neurons in each layer, which is convenient for subsequent calls.
input_layer_size = 400; hidden_layer_size = 25; output_layer_size = 10;
3. Write the cost function calculation function (nnCostFunction)
3.1 Implement the necessary tool functions
In the implementation process of BP neural network, several operations are often used. If you don't encapsulate it as a function, it's easy to confuse yourself by writing it.

sigmoid function
function g = sigmoid(z) g = 1 ./ (1+exp(z)); end

addBias function
Since we did not consider the bias neuron when designing the neural network, and during forward propagation, the parameters of the bias neuron must be used to calculate the input value of the next layer\(\theta_0\), so it is encapsulated here as function.
function ans = addBias(X) ans = [ones(size(X,1)),X]; end

oneHot function
onehotencoding translates to onehot encoding. It is used to convert m*1 sample output y to a matrix of m*output_layer_size. For each row, there are output_layer_size numbers, corresponding to whether each output neuron is 0 or 1.
function ans = oneHot(y,output_layer_size) ans = zeros(size(y,1),output_layer_size); for i = 1:size(y,1) ans(i,y(i)) = 1; end end
3.2 Normalized matrix definition
In the implementation process of neural network, the error of matrix dimension may be one of the most frequent problems encountered by beginners. Every time the matrix dimension is found to be wrong, it is often necessary to derive the correct matrix dimension from the input layer step by step, and remodify the calculation method and calculation order of the matrix in the code. When this happens, the logic is often confused when writing the code, and the matrix is written as m*n at once, and the matrix is written as n*m at once. If you can design the representation of each matrix in advance and standardize the actual meaning of rows and columns before writing code, you can quickly correct errors after they are found, or avoid errors directly.
Here we follow the specification of the matrix definition in coursera's lecture notes
matrix name  Common code  Matrix scale  row meaning  column meaning 

input value matrix  \(X,z^{(i)}\)  m * n  number of samples  Number of neurons in this layer (number of attributes) 
parameter matrix  \(\Theta_{(j)}\)  (a+1) * b  The number of neurons in this layer (the number of activation functions in this layer)  The number of neurons in the previous layer + 1 (the number of attributes of each activation function + the bias attribute) 
output value matrix  \(Y,a^{(i)}\)  m * k  number of samples  Number of neurons in this layer (number of output values) 
Another short note: input_layer_size = n, hidden_layer_size = l, output_layer_size = K
3.3 Implement forward propagation (Feedforward)
Forward propagation is the process of computing the output value of each output neuron. It can vectorize computations.
Define the forward pass of the neural network in the nnCostFunction function
Incoming parameters [ nn_params (expanded vector of all parameter values\(\theta\)), input_layer_size, hidden_layer_size, output_layer_size, X, y, lambda]
function [J grad] = nnCostFunction(nn_params, ... input_layer_size, ... hidden_layer_size, ... output_layer_size, ... X, y, lambda) % use reshape The function converts the vector nn_params restructured into Theta1,Theta2 two matrices. Notice, Theta1，Theta2 two matrices % All take into account the bias neurons. Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ... hidden_layer_size, (input_layer_size + 1)); Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ... output_layer_size, (hidden_layer_size + 1)); % number of samples m = size(X, 1); % cost value J = 0; % gradient value Theta1_grad = zeros(size(Theta1)); Theta2_grad = zeros(size(Theta2)); % loading finished % ============================================================================== % ============================================================================== % Part1: implement forward propagation % forward propagation z2 = addBias(X) * Theta1'; a2 = sigmoid(z2); z3 = addBias(a2) * Theta2'; a3 = sigmoid(z3); % onehot encoding encodeY = oneHot(y,output_layer_size); % tempTheta Used to calculate the regular term, which will bias the corresponding neuron theta All set to 0 tempTheta2 = Theta2; tempTheta2(:,1) = 0; tempTheta1 = Theta1; tempTheta1(:,1) = 0; J = 1/m * sum(sum((encodeY .* log(a3)  (1encodeY) .* log(1a3)))) + 1/(2*m) * lambda * (sum(sum(tempTheta1 .^2)) + sum(sum(tempTheta2 .^ 2)) );
Cost function calculation formula
The learning ability of neural networks is too strong to fit highly complex models. Therefore, if it is not regularized, it is likely to overfit. At this time, the empirical error is small and the generalization error is not small enough
3.4 Implementing Backpropagation
This section is the core part of the backpropagation algorithm, and it is also the most difficult to understand and the most complicated part of writing code. It must have been a headache for beginners, I was stuck on this part for 3 days and couldn't understand it.
This part designs the chain rule and matrix derivation, in which the chain rule must be known. If the matrix derivation is not possible, the result can be deduced first through the dimension of the matrix (the derivation is relatively slow).
A video is recommended at station b, which talks about the gradient value calculation of the backpropagation algorithm in detail. She also posted the Python implementation of Ng Enda's neural network solution for digit recognition. I haven't seen it, but the general idea is definitely similar to the matlab implementation, and it can be used as an explanation video for understanding the back propagation algorithm.
Teach everyone to realize Wu Enda's deep learning homework second week 06backpropagation derivation
Backpropagation algorithm is used to solve the gradient calculation. The gradient of each parameter\(\theta\) is derived by the chain rule and matrix calculation
Theta2_grad = 1/m*(a3encodeY)' * addBias(a2) + lambda * tempTheta2/m; Theta1_grad = 1/m*(Theta2(:,2:end)' * (a3encodeY)' .* a2' .* (1a2') * addBias(X)) + lambda * tempTheta1 / m; % Unroll gradients grad = [Theta1_grad(:) ; Theta2_grad(:)];
4. Advanced optimization training neural network (Learning parameters using fmincg)
4.1 Random initialization parameters
function W = randInitializeWeights(L_in, L_out) W = zeros(L_out, 1 + L_in); % Randomly initialize the weights to small values epsilon_init = 0.12; W = rand(L_out, 1 + L_in) * 2 * epsilon_init  epsilon_init; end
Needless to say, the importance of random initialization parameters for neural networks is basic knowledge. If you don't understand it, you can revisit Wu Enda's class.
initial_Theta1 = randInitializeWeights(input_layer_size, hidden_layer_size); initial_Theta2 = randInitializeWeights(hidden_layer_size, output_layer_size); % Expand into a vector, easy to pass parameters initial_nn_params = [initial_Theta1(:) ; initial_Theta2(:)];
4.2 Calling Advanced Optimization Functions to Train Neural Networks
This is the advanced optimization function fmincg given by coursera. Learn how to use it before using it. If you can't use it, use fminunc, which is slightly slower.
function [X, fX, i] = fmincg(f, X, options, P1, P2, P3, P4, P5) % Minimize a continuous differentialble multivariate function. Starting point % is given by "X" (D by 1), and the function named in the string "f", must % return a function value and a vector of partial derivatives. The Polack % Ribiere flavour of conjugate gradients is used to compute search directions, % and a line search using quadratic and cubic polynomial approximations and the % WolfePowell stopping criteria is used together with the slope ratio method % for guessing initial step sizes. Additionally a bunch of checks are made to % make sure that exploration is taking place and that extrapolation will not % be unboundedly large. The "length" gives the length of the run: if it is % positive, it gives the maximum number of line searches, if negative its % absolute gives the maximum allowed number of function evaluations. You can % (optionally) give "length" a second component, which will indicate the % reduction in function value to be expected in the first linesearch (defaults % to 1.0). The function returns when either its length is up, or if no further % progress can be made (ie, we are at a minimum, or so close that due to % numerical problems, we cannot get any closer). If the function terminates % within a few iterations, it could be an indication that the function value % and derivatives are not consistent (ie, there may be a bug in the % implementation of your "f" function). The function returns the found % solution "X", a vector of function values "fX" indicating the progress made % and "i" the number of iterations (line searches or function evaluations, % depending on the sign of "length") used. % % Usage: [X, fX, i] = fmincg(f, X, options, P1, P2, P3, P4, P5) % % See also: checkgrad % % Copyright (C) 2001 and 2002 by Carl Edward Rasmussen. Date 20020213 % % % (C) Copyright 1999, 2000 & 2001, Carl Edward Rasmussen % % Permission is granted for anyone to copy, use, or modify these % programs and accompanying documents for purposes of research or % education, provided this copyright notice is retained, and note is % made of any changes that have been made. % % These programs and documents are distributed without any warranty, % express or implied. As the programs were written for research % purposes only, they have not been tested to the degree that would be % advisable in any important application. All use of these programs is % entirely at the user's own risk. % % [mlclass] Changes Made: % 1) Function name and argument specifications % 2) Output display % % Read options if exist('options', 'var') && ~isempty(options) && isfield(options, 'MaxIter') length = options.MaxIter; else length = 100; end RHO = 0.01; % a bunch of constants for line searches SIG = 0.5; % RHO and SIG are the constants in the WolfePowell conditions INT = 0.1; % don't reevaluate within 0.1 of the limit of the current bracket EXT = 3.0; % extrapolate maximum 3 times the current bracket MAX = 20; % max 20 function evaluations per line search RATIO = 100; % maximum allowed slope ratio argstr = ['feval(f, X']; % compose string used to call function for i = 1:(nargin  3) argstr = [argstr, ',P', int2str(i)]; end argstr = [argstr, ')']; if max(size(length)) == 2, red=length(2); length=length(1); else red=1; end S=['Iteration ']; i = 0; % zero the run length counter ls_failed = 0; % no previous line search has failed fX = []; [f1 df1] = eval(argstr); % get function value and gradient i = i + (length<0); % count epochs?! s = df1; % search direction is steepest d1 = s'*s; % this is the slope z1 = red/(1d1); % initial step is red/(s+1) while i < abs(length) % while not finished i = i + (length>0); % count iterations?! X0 = X; f0 = f1; df0 = df1; % make a copy of current values X = X + z1*s; % begin line search [f2 df2] = eval(argstr); i = i + (length<0); % count epochs?! d2 = df2'*s; f3 = f1; d3 = d1; z3 = z1; % initialize point 3 equal to point 1 if length>0, M = MAX; else M = min(MAX, lengthi); end success = 0; limit = 1; % initialize quanteties while 1 while ((f2 > f1+z1*RHO*d1)  (d2 > SIG*d1)) && (M > 0) limit = z1; % tighten the bracket if f2 > f1 z2 = z3  (0.5*d3*z3*z3)/(d3*z3+f2f3); % quadratic fit else A = 6*(f2f3)/z3+3*(d2+d3); % cubic fit B = 3*(f3f2)z3*(d3+2*d2); z2 = (sqrt(B*BA*d2*z3*z3)B)/A; % numerical error possible  ok! end if isnan(z2)  isinf(z2) z2 = z3/2; % if we had a numerical problem then bisect end z2 = max(min(z2, INT*z3),(1INT)*z3); % don't accept too close to limits z1 = z1 + z2; % update the step X = X + z2*s; [f2 df2] = eval(argstr); M = M  1; i = i + (length<0); % count epochs?! d2 = df2'*s; z3 = z3z2; % z3 is now relative to the location of z2 end if f2 > f1+z1*RHO*d1  d2 > SIG*d1 break; % this is a failure elseif d2 > SIG*d1 success = 1; break; % success elseif M == 0 break; % failure end A = 6*(f2f3)/z3+3*(d2+d3); % make cubic extrapolation B = 3*(f3f2)z3*(d3+2*d2); z2 = d2*z3*z3/(B+sqrt(B*BA*d2*z3*z3)); % num. error possible  ok! if ~isreal(z2)  isnan(z2)  isinf(z2)  z2 < 0 % num prob or wrong sign? if limit < 0.5 % if we have no upper limit z2 = z1 * (EXT1); % the extrapolate the maximum amount else z2 = (limitz1)/2; % otherwise bisect end elseif (limit > 0.5) && (z2+z1 > limit) % extraplation beyond max? z2 = (limitz1)/2; % bisect elseif (limit < 0.5) && (z2+z1 > z1*EXT) % extrapolation beyond limit z2 = z1*(EXT1.0); % set to extrapolation limit elseif z2 < z3*INT z2 = z3*INT; elseif (limit > 0.5) && (z2 < (limitz1)*(1.0INT)) % too close to limit? z2 = (limitz1)*(1.0INT); end f3 = f2; d3 = d2; z3 = z2; % set point 3 equal to point 2 z1 = z1 + z2; X = X + z2*s; % update current estimates [f2 df2] = eval(argstr); M = M  1; i = i + (length<0); % count epochs?! d2 = df2'*s; end % end of line search if success % if line search succeeded f1 = f2; fX = [fX' f1]'; fprintf('%s %4i  Cost: %4.6e\r', S, i, f1); s = (df2'*df2df1'*df2)/(df1'*df1)*s  df2; % PolackRibiere direction tmp = df1; df1 = df2; df2 = tmp; % swap derivatives d2 = df1'*s; if d2 > 0 % new slope must be negative s = df1; % otherwise use steepest direction d2 = s'*s; end z1 = z1 * min(RATIO, d1/(d2realmin)); % slope ratio but max RATIO d1 = d2; ls_failed = 0; % this line search did not fail else X = X0; f1 = f0; df1 = df0; % restore point from before failed line search if ls_failed  i > abs(length) % line search failed twice in a row break; % or we ran out of time, so we give up end tmp = df1; df1 = df2; df2 = tmp; % swap derivatives s = df1; % try steepest d1 = s'*s; z1 = 1/(1d1); ls_failed = 1; % this line search failed end if exist('OCTAVE_VERSION') fflush(stdout); end end fprintf('\n');
call part
options = optimset('MaxIter', 50); % You should also try different values of lambda lambda = 1; % Create "short hand" for the cost function to be minimized costFunction = @(p) nnCostFunction(p, ... input_layer_size, ... hidden_layer_size, ... output_layer_size, X, y, lambda); % Now, costFunction is a function that takes in only one argument (the % neural network parameters) [nn_params, cost] = fmincg(costFunction, initial_nn_params, options); % Obtain Theta1 and Theta2 back from nn_params Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ... hidden_layer_size, (input_layer_size + 1)); Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ... output_layer_size, (hidden_layer_size + 1));
5. Calculate empirical error / adjust parameters / model evaluation (Trying out different learning settings)
function p = predict(Theta1, Theta2, X) %PREDICT Predict the label of an input given a trained neural network % p = PREDICT(Theta1, Theta2, X) outputs the predicted label of X given the % trained weights of a neural network (Theta1, Theta2) % Useful values m = size(X, 1); num_labels = size(Theta2, 1); % You need to return the following variables correctly p = zeros(size(X, 1), 1); h1 = sigmoid([ones(m, 1) X] * Theta1'); h2 = sigmoid([ones(m, 1) h1] * Theta2'); [dummy, p] = max(h2, [], 2); % ========================================================================= end
call function
pred = predict(Theta1, Theta2, X); fprintf('\nTraining Set Accuracy: %f\n', mean(double(pred == y)) * 100);
Adjust the parameters, you can observe the change of the empirical error
\(\lambda\)  \(Max \ \ Iterations\)  \(accuracy\) 

1  50  94.34% ~ 96.00% 
1  100  98.64% 
1.5  1000  99.04% 
1.5  5000 (torture computer)  99.26% 
2  2000  98.68% 
These are empirical errors, and our goal is to reduce generalization errors. However, coursera does not provide a test set, and we will not test the generalization error for the time being. Strictly speaking there should also be a measurement part of the generalization error.
6. Visualizing the hidden layer
This part is not so important, we can see what the hidden layer neurons are doing.
For each hidden layer neuron, find a set of input vector[1,400] to make its activation value close to 1 (at this time, it means that it is most likely to be in a certain state, and at this time the rest of the neurons are close to 0). Then convert it into a 20x20 pixel image.
fprintf('\nVisualizing Neural Network... \n') displayData(Theta1(:, 2:end)); fprintf('\nProgram paused. Press enter to continue.\n'); pause;