# Neural Networks to Solve Multi-Classification Problems Example: Digit Recognition

## 1. Visualizing the data

The training set provides 5000 digital images, each image is 20x20 pixels, and is converted into a 1x400 vector storage. The sample input is a 5000x400 matrix and the output is a 5000x1 vector. coursera provides functions to convert grayscale values ​​to pictures, but this does not help us substantially to solve the problem.

## 2. Designing Nural Network

Since each sample input is a 1x400 vector, there should be 400 input neurons. We predict a total of 10 numbers, so the output neuron should consist of 10. Since the problem is not complicated, just use 1 hidden layer (the neural network used in the early automatic driving test is 3 hidden layers), and the number of neurons in the hidden layer is 25.

Define three variables to store the number of neurons in each layer, which is convenient for subsequent calls.

input_layer_size = 400;
hidden_layer_size = 25;
output_layer_size = 10;


## 3. Write the cost function calculation function (nnCostFunction)

### 3.1 Implement the necessary tool functions

In the implementation process of BP neural network, several operations are often used. If you don't encapsulate it as a function, it's easy to confuse yourself by writing it.

• sigmoid function

function g = sigmoid(z)
g = 1 ./ (1+exp(-z));
end


Since we did not consider the bias neuron when designing the neural network, and during forward propagation, the parameters of the bias neuron must be used to calculate the input value of the next layer$$\theta_0$$, so it is encapsulated here as function.

function ans = addBias(X)
ans = [ones(size(X,1)),X];
end

• oneHot function

one-hot-encoding translates to one-hot encoding. It is used to convert m*1 sample output y to a matrix of m*output_layer_size. For each row, there are output_layer_size numbers, corresponding to whether each output neuron is 0 or 1.

function ans = oneHot(y,output_layer_size)
ans = zeros(size(y,1),output_layer_size);
for i = 1:size(y,1)
ans(i,y(i)) = 1;
end
end


### 3.2 Normalized matrix definition

In the implementation process of neural network, the error of matrix dimension may be one of the most frequent problems encountered by beginners. Every time the matrix dimension is found to be wrong, it is often necessary to derive the correct matrix dimension from the input layer step by step, and re-modify the calculation method and calculation order of the matrix in the code. When this happens, the logic is often confused when writing the code, and the matrix is ​​written as m*n at once, and the matrix is ​​written as n*m at once. If you can design the representation of each matrix in advance and standardize the actual meaning of rows and columns before writing code, you can quickly correct errors after they are found, or avoid errors directly.

Here we follow the specification of the matrix definition in coursera's lecture notes

matrix name Common code Matrix scale row meaning column meaning
input value matrix $$X,z^{(i)}$$ m * n number of samples Number of neurons in this layer (number of attributes)
parameter matrix $$\Theta_{(j)}$$ (a+1) * b The number of neurons in this layer (the number of activation functions in this layer) The number of neurons in the previous layer + 1 (the number of attributes of each activation function + the bias attribute)
output value matrix $$Y,a^{(i)}$$ m * k number of samples Number of neurons in this layer (number of output values)

Another short note: input_layer_size = n, hidden_layer_size = l, output_layer_size = K

### 3.3 Implement forward propagation (Feedforward)

Forward propagation is the process of computing the output value of each output neuron. It can vectorize computations. Define the forward pass of the neural network in the nnCostFunction function

Incoming parameters [ nn_params (expanded vector of all parameter values$$\theta$$), input_layer_size, hidden_layer_size, output_layer_size, X, y, lambda]

function [J grad] = nnCostFunction(nn_params, ...
input_layer_size, ...
hidden_layer_size, ...
output_layer_size, ...
X, y, lambda)
% use reshape The function converts the vector nn_params restructured into Theta1,Theta2 two matrices. Notice, Theta1，Theta2 two matrices
% All take into account the bias neurons.
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));

Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
output_layer_size, (hidden_layer_size + 1));
% number of samples
m = size(X, 1);
% cost value
J = 0;

% ==============================================================================
% ==============================================================================
% Part1: implement forward propagation

% forward propagation
a2 = sigmoid(z2);
a3 = sigmoid(z3);
% one-hot encoding
encodeY = oneHot(y,output_layer_size);

% tempTheta Used to calculate the regular term, which will bias the corresponding neuron theta All set to 0
tempTheta2 = Theta2;
tempTheta2(:,1) = 0;
tempTheta1 = Theta1;
tempTheta1(:,1) = 0;

J = 1/m * sum(sum((-encodeY .* log(a3) - (1-encodeY) .* log(1-a3)))) + 1/(2*m) * lambda * (sum(sum(tempTheta1 .^2)) + sum(sum(tempTheta2 .^ 2)) );


Cost function calculation formula

$J(\theta) = \frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K}[-y_k^{(i)}\log{(h_\theta(x^{(i)})_k)} - (1-y_k^{(i)})\log{(1-h_\theta(x^{(i)})_k)}] + \frac{\lambda}{2m}[\sum_{i=1}^l\sum_{j=1}^n\Theta_{ij}^2 + \sum_{i=1}^K\sum_{j=1}^l\Theta_{ij}^2]$

The learning ability of neural networks is too strong to fit highly complex models. Therefore, if it is not regularized, it is likely to overfit. At this time, the empirical error is small and the generalization error is not small enough

### 3.4 Implementing Backpropagation

This section is the core part of the backpropagation algorithm, and it is also the most difficult to understand and the most complicated part of writing code. It must have been a headache for beginners, I was stuck on this part for 3 days and couldn't understand it.

This part designs the chain rule and matrix derivation, in which the chain rule must be known. If the matrix derivation is not possible, the result can be deduced first through the dimension of the matrix (the derivation is relatively slow).

A video is recommended at station b, which talks about the gradient value calculation of the backpropagation algorithm in detail. She also posted the Python implementation of Ng Enda's neural network solution for digit recognition. I haven't seen it, but the general idea is definitely similar to the matlab implementation, and it can be used as an explanation video for understanding the back propagation algorithm.

Teach everyone to realize Wu Enda's deep learning homework second week 06-backpropagation derivation

Backpropagation algorithm is used to solve the gradient calculation. The gradient of each parameter$$\theta$$ is derived by the chain rule and matrix calculation

Theta2_grad = 1/m*(a3-encodeY)' * addBias(a2) + lambda * tempTheta2/m;
Theta1_grad = 1/m*(Theta2(:,2:end)' * (a3-encodeY)' .* a2' .* (1-a2') * addBias(X)) + lambda * tempTheta1 / m;


## 4. Advanced optimization training neural network (Learning parameters using fmincg)

### 4.1 Random initialization parameters

function W = randInitializeWeights(L_in, L_out)

W = zeros(L_out, 1 + L_in);

% Randomly initialize the weights to small values
epsilon_init = 0.12;
W = rand(L_out, 1 + L_in) * 2 * epsilon_init - epsilon_init;

end


Needless to say, the importance of random initialization parameters for neural networks is basic knowledge. If you don't understand it, you can revisit Wu Enda's class.

initial_Theta1 = randInitializeWeights(input_layer_size, hidden_layer_size);
initial_Theta2 = randInitializeWeights(hidden_layer_size, output_layer_size);
% Expand into a vector, easy to pass parameters
initial_nn_params = [initial_Theta1(:) ; initial_Theta2(:)];


### 4.2 Calling Advanced Optimization Functions to Train Neural Networks

This is the advanced optimization function fmincg given by coursera. Learn how to use it before using it. If you can't use it, use fminunc, which is slightly slower.

function [X, fX, i] = fmincg(f, X, options, P1, P2, P3, P4, P5)
% Minimize a continuous differentialble multivariate function. Starting point
% is given by "X" (D by 1), and the function named in the string "f", must
% return a function value and a vector of partial derivatives. The Polack-
% Ribiere flavour of conjugate gradients is used to compute search directions,
% and a line search using quadratic and cubic polynomial approximations and the
% Wolfe-Powell stopping criteria is used together with the slope ratio method
% for guessing initial step sizes. Additionally a bunch of checks are made to
% make sure that exploration is taking place and that extrapolation will not
% be unboundedly large. The "length" gives the length of the run: if it is
% positive, it gives the maximum number of line searches, if negative its
% absolute gives the maximum allowed number of function evaluations. You can
% (optionally) give "length" a second component, which will indicate the
% reduction in function value to be expected in the first line-search (defaults
% to 1.0). The function returns when either its length is up, or if no further
% progress can be made (ie, we are at a minimum, or so close that due to
% numerical problems, we cannot get any closer). If the function terminates
% within a few iterations, it could be an indication that the function value
% and derivatives are not consistent (ie, there may be a bug in the
% implementation of your "f" function). The function returns the found
% solution "X", a vector of function values "fX" indicating the progress made
% and "i" the number of iterations (line searches or function evaluations,
% depending on the sign of "length") used.
%
% Usage: [X, fX, i] = fmincg(f, X, options, P1, P2, P3, P4, P5)
%
%
% Copyright (C) 2001 and 2002 by Carl Edward Rasmussen. Date 2002-02-13
%
%
% (C) Copyright 1999, 2000 & 2001, Carl Edward Rasmussen
%
% Permission is granted for anyone to copy, use, or modify these
% programs and accompanying documents for purposes of research or
% education, provided this copyright notice is retained, and note is
%
% These programs and documents are distributed without any warranty,
% express or implied.  As the programs were written for research
% purposes only, they have not been tested to the degree that would be
% advisable in any important application.  All use of these programs is
% entirely at the user's own risk.
%
% 1) Function name and argument specifications
% 2) Output display
%

if exist('options', 'var') && ~isempty(options) && isfield(options, 'MaxIter')
length = options.MaxIter;
else
length = 100;
end

RHO = 0.01;                            % a bunch of constants for line searches
SIG = 0.5;       % RHO and SIG are the constants in the Wolfe-Powell conditions
INT = 0.1;    % don't reevaluate within 0.1 of the limit of the current bracket
EXT = 3.0;                    % extrapolate maximum 3 times the current bracket
MAX = 20;                         % max 20 function evaluations per line search
RATIO = 100;                                      % maximum allowed slope ratio

argstr = ['feval(f, X'];                      % compose string used to call function
for i = 1:(nargin - 3)
argstr = [argstr, ',P', int2str(i)];
end
argstr = [argstr, ')'];

if max(size(length)) == 2, red=length(2); length=length(1); else red=1; end
S=['Iteration '];

i = 0;                                            % zero the run length counter
ls_failed = 0;                             % no previous line search has failed
fX = [];
[f1 df1] = eval(argstr);                      % get function value and gradient
i = i + (length<0);                                            % count epochs?!
s = -df1;                                        % search direction is steepest
d1 = -s'*s;                                                 % this is the slope
z1 = red/(1-d1);                                  % initial step is red/(|s|+1)

while i < abs(length)                                      % while not finished
i = i + (length>0);                                      % count iterations?!

X0 = X; f0 = f1; df0 = df1;                   % make a copy of current values
X = X + z1*s;                                             % begin line search
[f2 df2] = eval(argstr);
i = i + (length<0);                                          % count epochs?!
d2 = df2'*s;
f3 = f1; d3 = d1; z3 = -z1;             % initialize point 3 equal to point 1
if length>0, M = MAX; else M = min(MAX, -length-i); end
success = 0; limit = -1;                     % initialize quanteties
while 1
while ((f2 > f1+z1*RHO*d1) || (d2 > -SIG*d1)) && (M > 0)
limit = z1;                                         % tighten the bracket
if f2 > f1
z2 = z3 - (0.5*d3*z3*z3)/(d3*z3+f2-f3);                 % quadratic fit
else
A = 6*(f2-f3)/z3+3*(d2+d3);                                 % cubic fit
B = 3*(f3-f2)-z3*(d3+2*d2);
z2 = (sqrt(B*B-A*d2*z3*z3)-B)/A;       % numerical error possible - ok!
end
if isnan(z2) || isinf(z2)
z2 = z3/2;                  % if we had a numerical problem then bisect
end
z2 = max(min(z2, INT*z3),(1-INT)*z3);  % don't accept too close to limits
z1 = z1 + z2;                                           % update the step
X = X + z2*s;
[f2 df2] = eval(argstr);
M = M - 1; i = i + (length<0);                           % count epochs?!
d2 = df2'*s;
z3 = z3-z2;                    % z3 is now relative to the location of z2
end
if f2 > f1+z1*RHO*d1 || d2 > -SIG*d1
break;                                                % this is a failure
elseif d2 > SIG*d1
success = 1; break;                                             % success
elseif M == 0
break;                                                          % failure
end
A = 6*(f2-f3)/z3+3*(d2+d3);                      % make cubic extrapolation
B = 3*(f3-f2)-z3*(d3+2*d2);
z2 = -d2*z3*z3/(B+sqrt(B*B-A*d2*z3*z3));        % num. error possible - ok!
if ~isreal(z2) || isnan(z2) || isinf(z2) || z2 < 0 % num prob or wrong sign?
if limit < -0.5                               % if we have no upper limit
z2 = z1 * (EXT-1);                 % the extrapolate the maximum amount
else
z2 = (limit-z1)/2;                                   % otherwise bisect
end
elseif (limit > -0.5) && (z2+z1 > limit)         % extraplation beyond max?
z2 = (limit-z1)/2;                                               % bisect
elseif (limit < -0.5) && (z2+z1 > z1*EXT)       % extrapolation beyond limit
z2 = z1*(EXT-1.0);                           % set to extrapolation limit
elseif z2 < -z3*INT
z2 = -z3*INT;
elseif (limit > -0.5) && (z2 < (limit-z1)*(1.0-INT))  % too close to limit?
z2 = (limit-z1)*(1.0-INT);
end
f3 = f2; d3 = d2; z3 = -z2;                  % set point 3 equal to point 2
z1 = z1 + z2; X = X + z2*s;                      % update current estimates
[f2 df2] = eval(argstr);
M = M - 1; i = i + (length<0);                             % count epochs?!
d2 = df2'*s;
end                                                      % end of line search

if success                                         % if line search succeeded
f1 = f2; fX = [fX' f1]';
fprintf('%s %4i | Cost: %4.6e\r', S, i, f1);
s = (df2'*df2-df1'*df2)/(df1'*df1)*s - df2;      % Polack-Ribiere direction
tmp = df1; df1 = df2; df2 = tmp;                         % swap derivatives
d2 = df1'*s;
if d2 > 0                                      % new slope must be negative
s = -df1;                              % otherwise use steepest direction
d2 = -s'*s;
end
z1 = z1 * min(RATIO, d1/(d2-realmin));          % slope ratio but max RATIO
d1 = d2;
ls_failed = 0;                              % this line search did not fail
else
X = X0; f1 = f0; df1 = df0;  % restore point from before failed line search
if ls_failed || i > abs(length)          % line search failed twice in a row
break;                             % or we ran out of time, so we give up
end
tmp = df1; df1 = df2; df2 = tmp;                         % swap derivatives
s = -df1;                                                    % try steepest
d1 = -s'*s;
z1 = 1/(1-d1);
ls_failed = 1;                                    % this line search failed
end
if exist('OCTAVE_VERSION')
fflush(stdout);
end
end
fprintf('\n');



call part

options = optimset('MaxIter', 50);

%  You should also try different values of lambda
lambda = 1;

% Create "short hand" for the cost function to be minimized
costFunction = @(p) nnCostFunction(p, ...
input_layer_size, ...
hidden_layer_size, ...
output_layer_size, X, y, lambda);

% Now, costFunction is a function that takes in only one argument (the
% neural network parameters)
[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);

% Obtain Theta1 and Theta2 back from nn_params
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
hidden_layer_size, (input_layer_size + 1));

Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
output_layer_size, (hidden_layer_size + 1));


## 5. Calculate empirical error / adjust parameters / model evaluation (Trying out different learning settings)

function p = predict(Theta1, Theta2, X)
%PREDICT Predict the label of an input given a trained neural network
%   p = PREDICT(Theta1, Theta2, X) outputs the predicted label of X given the
%   trained weights of a neural network (Theta1, Theta2)

% Useful values
m = size(X, 1);
num_labels = size(Theta2, 1);

% You need to return the following variables correctly
p = zeros(size(X, 1), 1);

h1 = sigmoid([ones(m, 1) X] * Theta1');
h2 = sigmoid([ones(m, 1) h1] * Theta2');
[dummy, p] = max(h2, [], 2);

% =========================================================================
end


call function

pred = predict(Theta1, Theta2, X);
fprintf('\nTraining Set Accuracy: %f\n', mean(double(pred == y)) * 100);


Adjust the parameters, you can observe the change of the empirical error

$$\lambda$$ $$Max \ \ Iterations$$ $$accuracy$$
1 50 94.34% ~ 96.00%
1 100 98.64%
1.5 1000 99.04%
1.5 5000 (torture computer) 99.26%
2 2000 98.68%

These are empirical errors, and our goal is to reduce generalization errors. However, coursera does not provide a test set, and we will not test the generalization error for the time being. Strictly speaking there should also be a measurement part of the generalization error.

## 6. Visualizing the hidden layer

This part is not so important, we can see what the hidden layer neurons are doing.

For each hidden layer neuron, find a set of input vector[1,400] to make its activation value close to 1 (at this time, it means that it is most likely to be in a certain state, and at this time the rest of the neurons are close to 0). Then convert it into a 20x20 pixel image.

fprintf('\nVisualizing Neural Network... \n')

displayData(Theta1(:, 2:end));

fprintf('\nProgram paused. Press enter to continue.\n');
pause; Tags: Machine Learning

Posted by jaoudestudios on Mon, 16 May 2022 01:00:12 +0300