Basics of Artificial neural network
A neural network is a network of neurons or, in a contemporary context, an artificial neural network made up of artificial neurons or nodes. An artificial neural network is influenced by a biological neural network. As a biological neural network is made up of true biological neurons, in the same manner, an artificial neural network is made from artificial neurons called “Perceptrons“. An artificial neural network is developed for solving artificial intelligence (AI) problems. Artificial neuron links are termed as weights. All inputs are weight-modified and summed up. This activity is called a linear combination. Finally, the output is controlled by an activation function applied over that linear combination.
Neural networks take several input, processes it through multiple neurons through several hidden layers and return the result with the help of an output layer. This process of estimation of output is technically called “Forward Propagation“.
Next, we compare our obtained output with the actual output. This is done to obtain the variance of our obtained output with the actual output “technically called as error or loss”. We have to minimize that error. But the question arises, how do you reduce that error?
To minimize the error, we change the weights associated with each of the neurons that are contributing to the error and this happens by traveling back to the neurons of the neural network and finding the errors associated with the hidden layers. This technique or process is called “Backward Propagation“.
In order to minimize the error or loss in the neural networks, we use a common iterative algorithm known as “Gradient Descent
”, Which helps to optimize the task quickly and efficiently.
That’s it – this is how neural network works in general!
Now let’s dive a bit deeper and cover some important topics for making an artificial neural network and understand how it works.
Multi-Layer Perceptron and its basics
Perceptrons are defined as the basic unit (You can also call them building blocks) of an artificial neural network. It can be understood as anything (let’s say it as a machine) that takes multiple inputs and produces one output. The image below shows the typical structure of a perceptron.
The above perceptron takes several inputs and produces one output from that supplied inputs. But a question arises here, What is the relationship between those inputs and the output?
Let’s understand it.
Every input is connected with the output by a neural connection carrying some weights as shown in the figure above. Weights give importance to input. To calculate the output, we multiply inputs with their respective weights and compare that output with the threshold value as w1*x1 + w2*x2 + w3*x3+… > threshold. But this is not sufficient. There is something called bias (b) which is to be added to the obtained sum of the product of weight and inputs (w1*x1 + w2*x2 + w3*x3+…+b). The bias is somehow similar to the constant b of a linear function y = ax + b. It allows us to move the line-up and down to fit the prediction with the data better. Without b the line will always go through the origin (0, 0) and you may get a poorer fit. As in the above perceptron, it has three inputs, in that case, it requires three weights and the bias. Now linear representation of input will look like, w1*x1 + w2*x2 + w3*x3 + 1*b.
But, all of this is still linear. For a neural network to make a successful prediction, the obtained linearity must be converted into non-linearity. For this, we use something called activation function
(or simply called non-linearity)
What is an activation function?
Activation Function is a function that takes the sum of weighted input (w1*x1 + w2*x2 + w3*x3 + 1*b) as an input and returns the output of the neuron.
Clearly, the argument to the activation function is the sum of the product of weight and inputs.
The activation function is used to make a non-linear transformation which allows us to fit non-linear hypotheses or to estimate the complex functions. There are several activation functions, like “Sigmoid”, “Tanh”, “ReLu”,” Softmax” and many others. You can use any of them. Sigmoid, ReLu, and Softmax activation functions are more commonly used than others.
For more on the activation function visit here
Forward Propagation, Back Propagation, and Epochs
We have calculated the output so far and this process is called “Forward Propagation.” But what if the projected output is far away (high error) from the real output. What we are doing in the neural network is updating the biases and weights to minimize that error. This method of updating weight and bias is called “Back Propagation.”
The Back-propagation algorithms work by determining the error (or loss) at the output and then propagating it back into the network. The weights and biases are updated to minimize the error obtained from each neuron. The initial step in minimizing the loss is to determine the gradient (Derivatives) of error w.r.t. the weights and biases at each layer.
This one round of forwarding and backpropagation iteration is known as “Epoch “or one complete training iteration.
Multi-layer perceptron (MLP)
So far, we have seen just a single layer consisting of 3 input nodes i.e. x1, x2, and x3, and an output layer consisting of a single neuron. But, in practical applications, the single-layer neural network may not be sufficient to meet our network. An MLP contains additional layers in between the input layer and output called a hidden layer as shown below. You can use as many hidden layers as you wish, but introducing 2/3 hidden layers is sufficient in most situations. In addition, the use of a higher number of hidden layers is computationally expensive. A simple diagrammatic expression of an MLP is shown below.
|Multi-layer perceptron (MLP)
The image above shows just a single hidden layer but as I have previously said in practice can contain multiple hidden layers. Another important point to remember in the case of an MLP is that all the layers are fully connected (i.e. every node in a layer except the input and the output layer is connected to every node in the previous layer and the following layer).
Full Batch Gradient Descent and Stochastic Gradient Descent
Full Batch Gradient Descent and Stochastic Gradient Descent algorithms are the variants of Gradient Descent. Both of them perform the same work i.e. updating the weights and bias of the MLP by using the same updating algorithm but the difference is in the number of training samples that deal with in an iteration used to update the weights and biases.
Full Batch Gradient Descent Algorithm as the name implies uses all the training data points to update each of the weights once Whereas Stochastic Gradient uses 1 but never the entire training data to update the weights once.
Let us understand this with a simple example of a dataset of 20 data points with two weights w1 and w2.
In the Full Batch Gradient Descent algorithm, you use 20 data points (i.e. the entire training data) and calculate the change in w1 (Δw1) and change in w2 (Δw2) and update w1 and w2 accordingly.
In the Stochastic Gradient Descent algorithm, you use 1st data point and calculate the change in w1 (Δw1) and change in w2 (Δw2) and update w1 and w2. Next, when you use 2nd data point, you will work on the updated weights
For a more in-depth explanation of both the methods, you can have a look at this article
Algorithm for solving problems using Artificial neural network (ANN)
Let’s look at the Algorithm of Neural Network (MLP with one hidden layer, similar to the above-shown architecture). At the output layer, we have only one neuron as we are solving a binary classification problem (predict 0 or 1). We could also have two neurons for predicting each of both classes.
The steps are as follows:
1.) We take the training dataset (i.e. input). The input is in the form of matrix.
X as an input matrix
2.) Then, we initialize weights and biases with some random values. This is one-time initiation that means, from the next iteration, we will use updated weights, and biases.
Let us define:
W_hidden as weight matrix to the hidden layer
b_hidden as bias matrix to the hidden layer
w_out as weight matrix to the output layer
b_out as bias matrix to the output layer
3.) We take the matrix dot product of the input matrix(x) and weight matrix (W_hidden) assigned to edges between the input and hidden layer then add biases (b_hidden) of the hidden layer neurons to respective inputs, this is known as linear transformation as mentioned above.
Thus the hidden layer inputs are obtained as:
hidden_layer_input = matrix_dot_product(x,W_hidden) + b_hidden
4.) Now the linear transformation is to be changed into non-linear, so to perform non-linear transformation we use an activation function (we can use any of the non-linearity. For convenience I am using sigmoid activation function).
The Sigmoid(x) will return the output as 1/ (1 + Exp(-x)).
Hidden_layer_output = sigmoid(hidden_layer_input)
5.) We perform the same linear transformation as we have done above. Now we take the matrix dot product of Hidden_layer_output with weights and then add a bias of the output layer neuron then apply an activation function to predict the output.
Finally we get the input for output layer i.e. for the final output as follows:
output_layer_input = matrix_dot_product (Hidden_layer_output , w_out) + b_out
Output = sigmoid (output_layer_input)
NOTE: All above steps are known as “Forward Propagation” and the upcoming steps are known as “Back-propagation”.
6.) Compare prediction (obtained output) with actual output to get an error and calculate the gradient of error (Actual – Predicted). Error is the mean square loss = ((y-output)2)/2.
Thus, the gradient of error is, E = y – output.
7.) Now, Compute the slope/gradient/derivative of hidden and output layer neurons (To compute the slope, we calculate the derivatives of non-linear activation x at each layer for each neuron). Gradient of sigmoid(x) can be obtained as x * (1 – x).
I have written this directly. The derivation is simple. (Hint– Take natural log on both sides and after that using simple chain rule will give you the result)
grad_output_layer = derivatives_sigmoid(output)
grad_hidden_layer = derivatives_sigmoid(Hidden_layer_output)
8.) calculate the change factor(delta) at the output layer. It is obtained as the product of the gradient of error and the gradient of output layer activation.
delta_output = E * grad_output_layer
9.) Here the back-propagation starts, i.e. the error will now propagate back into the neural network. Now the loss at the hidden layer is obtained as the dot product of output layer delta with the transpose of weight parameters between the hidden and output layer (w_out.Transpose).
loss_hidden_layer = matrix_dot_product(delta_output, w_out.Transpose)
10.) In a similar pattern as we have done in step 8, we compute the change factor(delta) at the hidden layer. It is obtained as the product of the error at the hidden layer with a gradient of hidden layer output
delta_hidden_layer = Error_hidden_layer * grad_hidden_layer
11.) Now the final step is to update the weights and biases at the output and hidden layer: The weights and bias are updated as follows:
w_out = w_out + matrix_dot_product(Hidden_layer_output.Transpose, delta_output)*learning_rate
W_hidden = W_hidden + matrix_dot_product(x.Transpose,delta_hidden_layer)*learning_rate
b_hidden = b_hidden + sum(delta_hidden_layer(row wise)) * learning_rate
b_out = b_out + sum(delta_output, (row wise))*learning_rate
learning_rate: The amount that weights are updated is controlled by a configuration parameter called the learning rate. The value of the learning rate should be chosen wisely. If we take the learning rate very small then the learning process is very slow (however the accuracy level will be high) and if we take the learning rate very large then there may be the problem of overshooting and we may not get the minimum value of error.
We perform these steps several times until the error is minimized sufficiently and the accuracy level is sufficiently high.