Neural Networks with Linear Algebra II

Part III

Welcome back ! This is a continuation of the essay Neural Networks through Linear Algebra, where I construct a Neural Network purely through mathematics. Picking up where we left off, the process of looking for derivatives of variables in order to minimize a function is known as Gradient Descent. Gradient, as mentioned previously, refers to the derivative of the variables and descent simply refers to the geometry of the graph of derivatives, where we have to “descent” down the slope in order to reach the lowest point of the graph (where the loss function is zero). The Gradient Descent Optimization is also known as Stochastic Gradient Descent, as we calculate the gradient after each indivisual step (also known as a stochastic process). Up until now, we have only done some Linear Regression with Stochastic Gradient Descent, we have to go through one more step before reaching an actual Neural Network. What is this next step ? Predicting, using the predictions ! Let’s review what we have up until now. We have the input, a weight matrix, a bias vector, which gives us the output, which we use to compare with the true values and calculate the gradients. But what if …. we use those weights again ? On what ? The output. Multiply the weights with the output, add a bias vector and see what happens ? If you have paid attention till now, you would notice we can’t really do that. The dimensions of the matrices and vectors do not match (the weight matrix and the output vector). But do not be discouraged ! It’s just dimensions, we can change that to our liking. All we have to do is pick a number for dimensions, and the law of matrices will eventually lead us to where we want ! Let me explain. Lets decide our starting parameters (weights and biases) alongside the input and output.

$\begin{array}{l} I = {[\begin{array}{c} i_{1} \\ i_{2} \\ i_{3} \\ ⋮ \\ i_{z} \end{array}]}_{1 * n}^{T} W_{1} = {[\begin{array}{ccccc} w_{11} & w_{12} & w_{13} & \dots & w_{1 m} \\ w_{21} \\ w_{31} & ⋱ & ⋮ \\ ⋮ \\ w_{n 1} & \dots & w_{n m} \end{array}]}_{n * m} \\ B_{1} = {[\begin{array}{c} b_{1} \\ b_{2} \\ b_{3} \\ ⋮ \\ b_{m} \end{array}]}_{1 * m}^{T} O = {[\begin{array}{c} o_{1} \\ o_{2} \\ o_{3} \\ ⋮ \\ o_{z} \end{array}]}_{1 * z}^{T} \end{array}$

Now all we will do, is land on another form of output, before going to our final output (the one which we will actually compare with the true values). The dimension of this new “middle” output vector will be 1 x m (can you guess why?). The calculation will be pretty similiar,

$H = {[\begin{array}{c} i_{1} \\ i_{2} \\ i_{3} \\ ⋮ \\ i_{n} \end{array}]}_{1 * n}^{T} {[\begin{array}{ccccc} w_{11} & w_{12} & w_{13} & \dots & w_{1 m} \\ w_{21} \\ w_{31} & ⋱ & ⋮ \\ ⋮ \\ w_{n 1} & \dots & w_{n m} \end{array}]}_{n * m} + {[\begin{array}{c} b_{1} \\ b_{2} \\ b_{3} \\ ⋮ \\ b_{m} \end{array}]}_{1 * m}^{T} = {[\begin{array}{c} h_{1} \\ h_{2} \\ h_{3} \\ ⋮ \\ h_{m} \end{array}]}_{1 * m}^{T}$

This particular vector, is called the hidden vector/layer in Machine Learning lingo. We shall call it the hidden vector. Now we have a hidden vector, and we simply continue the process, i.e, multiply by a weight matrix (W₂) and add a bias vector (B₂) to get the output. Remember : the dimension of the output is already determined (by you, depending on what you’re using the network for). In this case it’s (1 x z). Therefore the dimension of the weight matrix this time will be (1 x m) and that of the bias vector will be (1 x z). Thus completing the final step we get,

$O = {[\begin{array}{c} h_{1} \\ h_{2} \\ h_{3} \\ ⋮ \\ h_{m} \end{array}]}_{1 * m}^{T} {[\begin{array}{ccccc} w_{11} & w_{12} & w_{13} & \dots & w_{1 z} \\ w_{21} \\ w_{31} & ⋱ & ⋮ \\ ⋮ \\ w_{m 1} & \dots & w_{m z} \end{array}]}_{m * z} + {[\begin{array}{c} b_{1} \\ b_{2} \\ b_{3} \\ ⋮ \\ b_{z} \end{array}]}_{1 * z}^{T} = {[\begin{array}{c} o_{1} \\ o_{2} \\ o_{3} \\ ⋮ \\ o_{z} \end{array}]}_{1 * z}^{T}$

So just for viewing pleasure, let’s pull everything together in one picture !

$\begin{array}{l} ({[\begin{array}{c} i_{1} \\ i_{2} \\ i_{3} \\ ⋮ \\ i_{n} \end{array}]}_{1 * n}^{T} {[\begin{array}{ccccc} w_{11} & w_{12} & w_{13} & \dots & w_{1 m} \\ w_{21} \\ w_{31} & ⋱ & ⋮ \\ ⋮ \\ w_{n 1} & \dots & w_{n m} \end{array}]}_{n * m} + {[\begin{array}{c} b_{1} \\ b_{2} \\ b_{3} \\ ⋮ \\ b_{m} \end{array}]}_{1 * m}^{T}) \\ {[\begin{array}{ccccc} w_{11} & w_{12} & w_{13} & \dots & w_{1 z} \\ w_{21} \\ w_{31} & ⋱ & ⋮ \\ ⋮ \\ w_{m 1} & \dots & w_{m z} \end{array}]}_{m * z} + {[\begin{array}{c} b_{1} \\ b_{2} \\ b_{3} \\ ⋮ \\ b_{z} \end{array}]}_{1 * z}^{T} \end{array}$

So basically, we still get an input, we multiply it with some matrices and vectors and we get the output ! But there’s a small catch. The second process ? Of getting the hidden vector ? That adds absolutely no value to our little network. Adding the hidden vector has no advantage at all. But why ? or why did we even assume it did in the first place ? This is where a good knowledge of Linear Algebra helps. A transformation in Linear Algebra means we change or transform one matrix/vector into another matrix/vector. How ? By multiplying or adding another matrix or vector to it (keeping in mind the dimensions). So the above process can simply be seen as the transformation of the input into the output (or an estimation of the output). But here’s the problem: any transformation in Linear Algebra, which is possible, can be done in one matrix multiplication operation. By this rule of transformation, the entire series of steps, that we did above, can actually be achieved in just one matrix multiplication, there was no need for a second one at all. These two transition matrices (weights and baises) can simply be represented as a single matrix (even the biases). You may wonder, isn’t that good ? That just reduces the cost of computation by reducing the number of operations we have to perform in the first place. This just brings us back to the original process, and we have a new barrier to overcome before we hit upon a deep neural network. Now to address the second question, why do even want a hidden vector at all ? The answer lies in correlation. The input and output are simply numbers that represent real world data, and the weight matrix (and bias vector) tend to capture the relationship, or correlation between those data points, by adjusting and re-adjusting to minimize the loss function, and find correlation among the input and output data. But it’s the real world, and hence, in some cases, correlations simply do not exist, no matter what weight matrices or bias vectors we use, if there is no correlation, at the end the network will be able to learn nothing. In such cases, a solution is to create our own correlation. That is the purpose of the hidden vector. To create a sort of artificial correlation between first the input and the hidden vector, and then with the hidden and output vector. By doing so, the network will be able to learn on any data, regardless of the existance of direct correlation between the input and the output. But here, we have the above mentioned problem again. Hidden vectors (as they are now) do not add any new information, they just sort of jumble or mix-match the existent correlations. Overall, the above given equation is linear. The equation being linear means there is always a straight-forward connection, or a direct relation between input and output variables (which can be impossible in real-life). Therefore what we need is a little randomness. Randomness, or any sort of condition, that we can add into the equation, can make it non-linear. Non-linearity does exactly what we want, creates artificial correlation. This process is know as activation. We use an activation function, on our hidden vector (we shall see what it is in a movement), in order to introduce Non-Linearity in our equations. One of the industry standards is the ReLu (Rectified Linear Unit) activation function. This function is indeed very simple, but makes a world of a difference in our model. It helps us to go from simple linear regression to deep neural networks. The function is this : if the number (element in hidden vector) is below zero, set it to zero, or leave it be otherwise. This simple function does the job, as it forces some random weights in the matrix to be zero, which goes on forward to reflect in the loss function and through gradient descent affects the entire network in different ways, and more importantly, the condition takes away the linearity of the equation, making it non-linear. How exactly do we use this activation function ? We simply do a vector (since it is the hidden vector) transformation. As you would recall, a transformation requires a vector or a matrix. Thus for a ReLu actiavtion, we can define our own ReLu function like so,

$R e L u a c t i v a t i o n f u n c t i o n : \frac{1}{2} ({[\begin{array}{c} h_{1} \\ h_{2} \\ h_{3} \\ ⋮ \\ h_{m} \end{array}]}_{1 * m}^{T} + {[\begin{array}{c} | h_{1} | \\ | h_{2} | \\ | h_{3} | \\ ⋮ \\ | h_{m} | \end{array}]}_{1 * m}^{T})$
Note: The “||” sign signifies modulus, it removes the minus sign, i.e, |-x| = x.

The above function should achieve what we want: it sets anything less than zero to zero, and changes nothing otherwise. And ounce again, for viewing pleasure, we shall see the entire equation again, with the ReLu activation function.

$\begin{array}{l} {[\begin{array}{c} i_{1} \\ i_{2} \\ i_{3} \\ ⋮ \\ i_{n} \end{array}]}_{1 * n}^{T} {[\begin{array}{ccccc} w_{11} & w_{12} & w_{13} & \dots & w_{1 m} \\ w_{21} \\ w_{31} & ⋱ & ⋮ \\ ⋮ \\ w_{n 1} & \dots & w_{n m} \end{array}]}_{n * m} + {[\begin{array}{c} b_{1} \\ b_{2} \\ b_{3} \\ ⋮ \\ b_{m} \end{array}]}_{1 * m}^{T} = {[\begin{array}{c} h_{1} \\ h_{2} \\ h_{3} \\ ⋮ \\ h_{m} \end{array}]}^{T} \\ \frac{1}{2} ({[\begin{array}{c} h_{1} \\ h_{2} \\ h_{3} \\ ⋮ \\ h_{m} \end{array}]}_{1 * m}^{T} + {[\begin{array}{c} | h_{1} | \\ | h_{2} | \\ | h_{3} | \\ ⋮ \\ | h_{m} | \end{array}]}_{1 * m}^{T}) {[\begin{array}{ccccc} w_{11} & w_{12} & w_{13} & \dots & w_{1 z} \\ w_{21} \\ w_{31} & ⋱ & ⋮ \\ ⋮ \\ w_{m 1} & \dots & w_{m z} \end{array}]}_{m * z} + {[\begin{array}{c} b_{1} \\ b_{2} \\ b_{3} \\ ⋮ \\ b_{z} \end{array}]}_{1 * z}^{T} = {[\begin{array}{c} o_{1} \\ o_{2} \\ o_{3} \\ ⋮ \\ o_{z} \end{array}]}_{1 * z}^{T} \end{array}$

This is the first step of the deep neural network, also known as forward propagation. We are sending the input data forward to be processed by weights and biases. When we apply the gradient descent activation, we will be sending the learned information backwards to be learned from, hence that step is called back-propagation. Ounce we achieve that, we will have a neural network at hand. But ReLu is not the only activation function we have, and thus I will be going through some more activations too. Here are a few more activation functions :

$\begin{array}{l} S i g m o i d a c t i v a t i o n f u n c t i o n : \frac{1}{1 + e^{- {[\begin{array}{ccccc} h_{1} & h_{2} & h_{3} & \dots & h_{m} \end{array}]}_{1 * m}}} \\ T a n h a c t i v a t i o n f u n c t i o n : \frac{e^{{[\begin{array}{ccccc} h_{1} & h_{2} & h_{3} & \dots & h_{m} \end{array}]}_{1 * m}} - e^{- {[\begin{array}{ccccc} h_{1} & h_{2} & h_{3} & \dots & h_{m} \end{array}]}_{1 * m}}}{e^{{[\begin{array}{ccccc} h_{1} & h_{2} & h_{3} & \dots & h_{m} \end{array}]}_{1 * m}} + e^{- {[\begin{array}{ccccc} h_{1} & h_{2} & h_{3} & \dots & h_{m} \end{array}]}_{1 * m}}} \end{array}$

The equations above may look complex and very tedious to implement, but that is not true. In fact, it is probably the most trivial thing to implement in Neural Networks from a computer science perspective (as I’ve mentioned before, I am focusing on the pure math part of Neural networks, hence the complicated equations). Most textbooks simply define a function or use in-build ones like numpy.tanh(), whereas I’ve defined the actual equations here. Now we move towards the back-propagation part : where our deep neural network will finally learn.

Part IV

After the above given equations have been solved, we ounce again find ourselves needing to solve for the gradients, but only this time there are two weight matrices (W₁,W₂) and two bias vectors (B₁,B₂), in one loss function. Let’s ounce again pull everything together to see how our loss function should look like,

$\begin{array}{l} (({[\begin{array}{c} t_{1} \\ t_{2} \\ t_{3} \\ ⋮ \\ t_{n z} \end{array}]}_{1 * z}^{T} - \frac{1}{1 + e^{- {[\begin{array}{c} i_{1} \\ i_{2} \\ i_{3} \\ ⋮ \\ i_{n} \end{array}]}_{1 * n}^{T} {[\begin{array}{ccccc} w_{11} & w_{12} & w_{13} & \dots & w_{1 m} \\ w_{21} \\ w_{31} & ⋱ & ⋮ \\ ⋮ \\ w_{n 1} & \dots & w_{n m} \end{array}]}_{n * m} + {[\begin{array}{c} b_{1} \\ b_{2} \\ b_{3} \\ ⋮ \\ b_{m} \end{array}]}_{1 * m}^{T}}}) . \\ ([\begin{array}{ccccc} w_{11} & w_{11} & w_{11} & \dots & w_{1 z} \\ w_{21} \\ w_{31} & ⋱ & ⋮ \\ ⋮ \\ w_{n 1} & \dots & w_{n z} \end{array}] + {[\begin{array}{c} b_{1} \\ b_{2} \\ b_{3} \\ ⋮ \\ b_{z} \end{array}]}_{1 * z}^{T}))^{2} \end{array}$

I have used the Sigmoid activation function in this particular case. As you can see, the loss function is made up of two sets of data we obtain from the real world: the true values and the input data, and four sets of data we can control, in order to give more accurate predictions of the real world datasets: the weight matrices and the bias vectors. Now just like before, we have to calculate the gradients for all four datasets that we can control, in relation to the loss function and thus tweak them so as to minimize loss. Let’s term the loss function as Z. Now we have to calculate, one by one, these four gradients,

$\frac{δ Z}{δ W_{1}}, \frac{δ Z}{δ W_{2}}, \frac{δ Z}{δ B_{1}}, \frac{δ Z}{δ B_{2}}$

and subtract them from their respective parameters. For the ease of calculating the derivatives, let’s replace the matrices and vectors in the Loss function with variables, with T for true values and I for input. Thus our equation is :

$Z = {(T - ((\frac{1}{1 + e^{- (I * W_{1} + B_{1})}}) * W_{2} + B_{2}))}^{2}$

This simplifies our calculation for finding partial derivatives of the paramters with respect to the function. The T and I values are known constants.

$\begin{array}{l} \frac{δ Z}{δ W_{1}} = - 2 (T - ((\frac{1}{1 + e^{- (I * W_{1} + B_{1})}}) * W_{2} + B_{2})) (\frac{1}{1 + e^{- (I * W_{1} + B_{1})}}) \\ \frac{δ Z}{δ W_{2}} = - 2 (T - ((\frac{1}{1 + e^{- (I * W_{1} + B_{1})}}) * W_{2} + B_{2})) (\frac{I W_{2} e^{- (I * W_{1} + B_{1})}}{1 + e^{- (I * W_{1} + B_{1})^{2}}}) \\ \frac{δ Z}{δ B_{1}} = - 2 (T - ((\frac{1}{1 + e^{- (I * W_{1} + B_{1})}}) * W_{2} + B_{2})) \\ \frac{δ Z}{δ B_{2}} = - 2 (T - ((\frac{1}{1 + e^{- (I * W_{1} + B_{1})}}) * W_{2} + B_{2})) (\frac{W_{2} e^{- (I * W_{1} + B_{1})}}{1 + e^{- (I * W_{1} + B_{1})^{2}}}) \end{array}$

These gradients, now just need to be subtracted from the original weights and biases to gain new weights and biases, which can be reused again, and repeat the the process until the required Loss is achieved. If you notice carefully, a lot in the above equations is already calculated before this step, and all we need to do is reuse it (this is more of a Mathematical theory, and thus is not very useful while implementing). A good approach would be to understand the underlying principles and than move on to the computer science part of building a Neural Network, you’ll be surprised to find out how simple and easy it is to implement it. For more clarity, you can substitute in the matrices/vectors back in the equations. One thing to note, that I couldn’t show in the above equations is that some of the matrices need to be transposed during this process, in order to carry out the matrix multiplications. With this, we have successfully backpropagated through our network, and the network has learned from it’s mistakes. This process can keep happening, until the loss function gets to the value of zero, i.e, there is nothing more for the network to learn from. This seems like the ideal situation, but is not. This network (or any that you design), if it gets to very low errors or zero error, means it has overfitted to the given input and output data. Overfitting is a common problem in any neural network. What exactly is overfitting ? It means the network has managed to get some weights and baises that can represent a perfect correlation between the given output and input. Now considering the real-world, that should be impossible. Nothing in this world (or just two phenomena) are perfectly correlated to each other. Then what is happening ? We must remember that we are simply representing the real world through numbers, and numbers are but a crude representation. They have certain characterstics, and properties that we humans have invented. So when we put in such numbers, our network, through it’s precise calcualtions, manages to find a perfect relation between the given data. This is troublesome. This network is too rigid to be used on any other data other than the one we trained it on, as it has tweaked and nudged it’s parameters to exactly represent the given data, hence making it useless for any other data which we might want to use it on (it will have very low accuracy or high error). Now we can always use more and more data, but again in the real world, that number can be infinity. One way to tackle this problem is to stop the training process earlier, before it reaches to the zero error level. We might get a premature network through this way, but many a times it does provide us with a cheaper way to overcome the problem of overfitting. Another method is to dropout. This is a very simple, yet effective strategy: simply set random elements of weights and biases (during the process) to zero. What this does, is kind of force the network to use different weights and biases each time (as old ones may be set to zero), sort of lke looking at the same problem from a different perspective. In this way, the network learns to capture signal, and ignore noise. Signal, is a useful feature of the dataset that every datapoint posseses (some pattern invisible to us, but not the network), while noise is any randomness that is encoded in the dataset. In case of overfitting, the network learns the noise as well, which is undesired. You may be wondering: if we turn random weights and biases to zero, what guarantees that the weights and baises will learn signal and not noise ? Here the different perspective analogy helps. Noise is something random, which means it is not repeated across the dataset. Signal is an important feature of the dataset, hence it is repeated throughout. Therefore, each iteration leads the network to learn a signal, and continues to strenghten it (cause it keeps coming up), and fails to learn noise (as it doesn’t come up, or is random). A simple way to turn random elements to zero is to use a matrix of ones and zeros,

$d r o p o u t m a t r i x = {[\begin{array}{c} D \end{array}]}_{1 * m}, a m a t r i x o f 1^{'} s a n d 0^{'} s$

There is a 50% chance that an element in matrix D will be 1 and 0 otherwise. Multiplying this matrix with the hidden vector will have the effect we want. Since this matrix eliminates about half the weights, the other weights must be multiplied by 2 to counter-balance the reduction, we must also not forget about it’s contribution in the gradients :

$H = {[\begin{array}{c} h_{1} \\ h_{2} \\ h_{3} \\ ⋮ \\ h_{m} \end{array}]}_{1 * m}^{T} {[\begin{array}{c} D \end{array}]}_{1 * m}^{T}$

$\frac{δ Z}{δ W_{1}} = - 2 (T - ((\frac{1}{1 + e^{- (I * W_{1} + B_{1})}}) * W_{2} + B_{2})) (\frac{1}{1 + e^{- (I * W_{1} + B_{1})}} * D)$

Note: In case you’re wondering why the dimensions don’t match, it is an elementwise multiplicaiton

The process/study of tackling overfitting is known as Regularization. With this, we are officially done with creating a Deep Neural Network, purely through math. Overall, each indivisual step can be a bit modified or improved upon, depending on the task at hand and thus create a new architecture or type of neural network to learn new things (for instance, we can add many more hidden vectors, or increase the size of each indivisual one), but the basic premise will remain the same. Even at this point, this network is pretty powerful. Implementing these equations is comparatively simpler (much simpler actually), but I wrote this article from a mathemtical perspective for the reasons mentioned in the intro, and also to gain a deeper understanding of the subject myself. If you have read this far, you have a deep mathematical understanding of the subject, and thank you very much !

Home