Exploring Architectures- CNN II

Building upto: Backpropagation

With our loss function defined in the previous essay, we arrive at perhaps the most important and complex part of the network (any network that is), backpropagation. We start by doing a small recap of all the operations we have done to get to our probability distribution, and put that in our loss function. As a starting point, let’s define an operation to denote a convolution, and ounce again follow the entire process (bit lenghty, but helps for revision).

Convolutional Layer: 𝐈𝐊 := 𝐂[i1,1i1,2i1,3i1,20i2,1i3,1i20,1i20,20]20* 20[k11k12k13k21k22k23k31k32k33]3* 3:= [c1,1c1,2c1,3c1,20c2,1c3,1c20,1c20,20]20* 20
Flattening: 𝐂𝐂F [c1,1c1,2c1,3c1,20c2,1c3,1c20,1c20,20]20* 20[c1,1c1,2c1,3c20,20]1* 20T
Forward propagation I: 𝐂F* 𝐖1+𝐁1=𝐎1[c1,1c1,2c1,3c20,20]1* 20T[w1,1w12w13w1mw2,1w3,1w20,1wnm]20* m+[b1,1b1,2b1,3b20,20]1* mT=[o1,1o1,2o1,3o20,20]1* mT
Activation Layer: ReLu (𝐎1) = 𝐇12([o1,1o1,2o1,3o20,20]1* mT+[|o1,1||o1,2||o1,3||o20,20|]1* mT)=[h1,1h1,2h1,3h20,20]1* mT
Forward propagation II: 𝐇* 𝐖2+𝐁2=𝐎2[h1,1h1,2h1,3h20,20]1* mT[w11w12w13w21w31wm1wm2wm3]m* 3+[b11b12b13]1* 3T=[o11o12o13]1* 3T
Activation Layer : Softmax(𝐎2) = P* (y|x;θ)where, θ = {𝐖1,𝐖2,𝐁𝟏,𝐁2,𝐊}σ([o11o12o13]1* 3T) = [eo11i=03eoieo12i=03eoieo13i=03eoi]1* 3T

In the last layer, the various parameters that 𝜃 represents are simply the weights, biases and the kernel. Parameters are objects (matrices and vectors in our case) that our model used in order to arrive at the probability distribution and the ones it is allowed to tweak and adjust. The only other variables are the input and the output, which we provide and which can’t be tampered with, hence they are always constant. For the ease of backpropagation, let’s pull the variables in the loss function.

iP(y|xi) log(P*(y|xi;θ)))

This is the cross entropy loss. With the input layer I which is the image, we utilize the parameters W1,W2,B1,B2 and K to calculate a certain probability distribution or at least our model’s initial guess at the true distribution T. After that we simply check how wrong it was by using the cross entropy loss. Here : P * ( y | x ; 𝜃) is the probability that our model came up with using the parameters, and P(y|x) is the true probability or T.

P(y|x) = T
Loss=T.log(e((12 (I𝐊)* 𝐖1+𝐁1+|(I𝐊)* 𝐖1+𝐁1|)* 𝐖2+𝐁2)ie((12 (ii  ki) *  w1i + b1i + |(ii  ki) *  w1i + b1i|) *  w2i + b2i)i)

The above loss function is made up of: input I and output T, which cannot be tampered with and are constants, and parameters which the model must modify and tweak, in order to bring down the overall loss function (W1,W2,B1,B2 and K). The reason the above equation looks so complicated is due to the fact that I put each and every variable in terms of either the input or one of the parameters. It need not be so complicated, but since mathematical theory requires it (and it’s fun as well), I tried my best. Perhaps the most difficult part is yet to come: we must find the derivative of the loss function with respect to each of the parameters, in order to perform gradient descent optimization.

Backpropagation

As discussed in previous essays, the above derived loss function must be minimized. But how ? In order to answer that question, we must remember how it was formed in the first place: through the use of certain parameters (mathematical objects of our/model’s desire) and real world data that we obtained (I and T for input and the actual true values of the output). We adjust the parameters after every iteration in such a way that it minimizes the loss function. By how much should be change the parameters? By the loss’s derivative with respect to the parameters. This is one of the most important optimization algorithms in deep learning, known as gradient descent optimization. (For further details, read my previous essayhere).

For us to solve the various gradients with respect to our inputs by hand is already a very tedious task: as we have an already very complex loss function with multiple inputs. But nevertheless, I have broken down the equation into various parts and constructed a computational graph for the same, where we can more easily see what’s happening under the hood and also easily calculate derivatives of more simpler functions and than to pile them up. First we start by assigning sub-variables and creating what we call a computational graph (I would highly recommend readingthis essay before going on, as it will make things crystal clear).

Thus we have formed our graph, with each sub-variable repesenting an operation performed on our original parameters or Input in order to reach our output. This view is much simpler, and makes it easier to get to the next step, where we want to calculate each variable’s gradient with respect to the variable that were used in creating it. Before that, let’s view notationaly what we want: the derivative of the Loss function with respect to W1, B1, W2, B2, K.

δLossδ𝐁2, δLossδ𝐖2, δLossδ𝐁1, δLossδ𝐖1, δLossδ𝐊

In order to get to these, we just need to traverse the graph backwards and get to the final output, hence backpropagation. Since the values are too complex, I have first calculated the indivisual gradients of sub-variables as follows:

δ𝐋δ𝐎L=i𝐓, δ𝐎Lδ𝐎S=ie((12( (i  k) *  w1+b1+|(i  k) * w1 + b1|)) * w2 + b2)ie((12( (I𝐊)* 𝐖1+𝐁1+|(I𝐊)* 𝐖1+𝐁1|))* 𝐖2+𝐁2)δ𝐎Sδ𝐎B=[𝐎1𝐎1.𝐎1Tjj1𝐎1.𝐎j𝐎2𝐎2.𝐎2Tjj2𝐎2.𝐎j𝐎i𝐎i.𝐎iTjji𝐎i.𝐎j] ,where 𝐎i=e((12( (I𝐊)* 𝐖1+𝐁1+|(I𝐊)* 𝐖1+𝐁1|))* 𝐖2+𝐁2)iie((12( (i  k) *  w1+b1+|(i  k) *  w1 + b1|)) *  w2 + b2)iδ𝐎δ𝐖2=12( (I𝐊)* 𝐖1+𝐁1+|(I𝐊)* 𝐖1+𝐁1|)δ𝐑δ𝐇B=12( (I𝐊)* 𝐖1+𝐁1+|(I𝐊)* 𝐖1+𝐁1|)δ𝐂δ𝐊=[c11c1ncn1cnn]n*n, δ𝐇δ𝐖1=I𝐊δ𝐇δ𝐂=𝐖1, δ𝐎δ𝐑=𝐖2δ𝐇Bδ𝐇=1, δ𝐇Bδ𝐁1=1, δ𝐎Bδ𝐎=1, δ𝐎Bδ𝐁2=1

And finally, through the chain rule, we get to our final parameter gradients:

δ𝐋δ𝐁2=δ𝐋δ𝐎Lδ𝐎Lδ𝐎Sδ𝐎Sδ𝐎Bδ𝐎Bδ𝐁2δ𝐋δ𝐖2=δ𝐋δ𝐎Lδ𝐎Lδ𝐎Sδ𝐎Sδ𝐎Bδ𝐎Bδ𝐎δ𝐎δ𝐖2δ𝐋δ𝐁1=δ𝐋δ𝐎Lδ𝐎Lδ𝐎Sδ𝐎Sδ𝐎Bδ𝐎Bδ𝐎δ𝐎δ𝐑δ𝐑δ𝐇Bδ𝐇Bδ𝐁1δ𝐋δ𝐖1=δ𝐋δ𝐎Lδ𝐎Lδ𝐎Sδ𝐎Sδ𝐎Bδ𝐎Bδ𝐎δ𝐎δ𝐑δ𝐑δ𝐇Bδ𝐇Bδ𝐇δ𝐇δ𝐖1δ𝐋δ𝐊=δ𝐋δ𝐎Lδ𝐎Lδ𝐎Sδ𝐎Sδ𝐎Bδ𝐎Bδ𝐎δ𝐎δ𝐑δ𝐑δ𝐇Bδ𝐇Bδ𝐇δ𝐇δ𝐂δ𝐂δ𝐊

By substituting the above equations with their actual values, we see that the gradients are indeed quite complex, despite us dealing with a fairly rudimentry network. All that remains is that we change the parameters by their gradients and keep continuing the process until we reach a desired loss level.

Ok. That was a lot of calculations and maths to deal with ! This entire process discussed above and in the previous essay constitutes the convolutional nueral network, which gives computers vision. More sophiticated models are far more complex and larger in their parameter size and numbers (as many might know, chatGPT has 175 billion paramters, compared to the measly five we used here). But no model strays too far away from the basic architecture, and hence we always benefit from knowing these architectres inside out. Believe it or not, the entire process above can be acheived in just a few lines of code using certain deep learning framworks (PyTorch or Tensorflow). Anyways, thank you for reading this far!

Home