Exploring Architectures- CNN

In the essays that I have written before about Neural Networks and how they can be seen as simply an implementation of Linear Algebra and Vector calculus, I found that explaining them through this perspective exposed me to some new concepts and a new way of looking at Neural Networks. Thus followig these topics, in the upcoming essays, I will explore the different architectures of Neural Networks, following the same procedure of using only math to construct them. Thus in this essay, we shall explore Convolutional Neural Networks, basically the networks that give computers vision. These can be very powerful and complex (with the modern improvements and modifications), hence I shall explore only a basic convolutional neural network.

Convolutional Neural Networks

In the previous essays, I have written about linear regression through stochastic gradient descend, which were very powerful and efficient, but only when it came to certain type of problems. Regression works well when our output is of merely one-dimension. These type of problems (where our output is one dimensional) are well suited for linear regression and the loss function we used before: mean-squared error. But that is not the case when it comes to other problems. Vision: we want computers to recognize images. Linear Regression can prove to be slow and inefficient, not to mention we can do better on the loss function as well. There are a few things we must clarify first: color of the image, how many pre-determined categories are there and a new loss function. Since we are assuming a simple® network, we shall only take grayscale values (0-255) and three categories. What about the loss funtion ? Last time we used mean-squared error, which is not exactly the worst loss function to use, but there is one better, and we benefit more from using that. Let’s start by defining our input !

Since our input will be an image, it will be represented by a matrix of numbers between 0 and 255, where black is 0 and white is 255, which means our raw input will be a matrix instead of a vector, unlike in linear regression. Lets us define the input as I.

\begin{matrix} I = {[\begin{array}{ccccc} i_{11} & i_{12} & i_{13} & \dots & i_{1 m} \\ i_{21} \\ i_{31} & ⋱ & ⋮ \\ ⋮ \\ i_{n 1} & \dots & i_{n m} \end{array}]}_{n * m} \\ W h e r e, i ϵ [0, 255] \end{matrix}

This is an (n x m) resolution image, passed in as a matrix. If we were to follow the usual steps (discussed inprevious essays) we simply multiply the input by a weight matrix, and add a bias vector/matrix to get to the output. With the addition of a hidden layer, we created a dense or deep neural network. But in this case, we are dealing with a matrix, possibly large with each value representing a pixel which contributes to a feature of the image (something unique to a particular class of image). A breakthrough was made, when we discovered that using convolutions (explained shortly), we can make the training much faster and efficient, greatly increasing the accuracy of our models.

Kernels

We have the input matrix I. This matrix has certain features in it. Let’s assume it is a picture of a dog, than some features would be an eye, ears or paws or tail. Our model , must associate each feature to the correct class of images (dog or cat or human, etc.). This is the learning part of the model. Given a cat, our model must identify it’s most prominent features, those which set it apart from dogs or humans and associate them to the same. This indeed seems like a daunting task to do, especially for a model (they lack sophisticated brain like ours). But this is where linear algebra shows it’s brilliant magic and comes to our aid, and we finish up in style with our usual gradient desecnt optimization. So what does linear algebra offer us ? In a nutshell: smaller matrices. Let me explain. Let us define a small matrix of dimensions (3 * 3), and an example matrix of random colors (20 * 20).

I = {[\begin{array}{ccccc} i_{1,1} & i_{1,2} & i_{1,3} & \dots & i_{1,20} \\ i_{2,1} \\ i_{3,1} & ⋱ & ⋮ \\ ⋮ \\ i_{20,1} & \dots & i_{20,20} \end{array}]}_{20 * 20} K = {[\begin{array}{ccc} k_{11} & k_{12} & k_{13} \\ k_{21} & k_{22} & k_{23} \\ k_{31} & k_{32} & k_{33} \end{array}]}_{3 * 3}

The smaller matrix is known as a *kernel*. The input matrix **I** simply consists of numbers ranging from 0 to 255. Applying a convolution simply means doing an **elementwise multiplication of the kernel over the submatrices of *I* of the same dimensions, and adding up the results, which effectively reduces the dimension of the resultant matrix.**. A submatrix simply means a sub-part of a larger matrix, for example :

I_{1} = {[\begin{array}{ccc} i_{11} & i_{12} & i_{13} \\ i_{21} & i_{22} & i_{23} \\ i_{31} & i_{32} & i_{33} \end{array}]}_{3 * 3}

Is a sub-matrix of I. And we simply do an elementwise multiplication of this sub-matrix with our kernel.

\begin{array}{l} [\begin{array}{ccc} i_{11} & i_{12} & i_{13} \\ i_{21} & i_{22} & i_{23} \\ i_{31} & i_{32} & i_{33} \end{array}] \times [\begin{array}{ccc} k_{11} & k_{12} & k_{13} \\ k_{21} & k_{22} & k_{23} \\ k_{31} & k_{32} & k_{33} \end{array}] = [\begin{array}{c} c \end{array}] \\ w h e r e, \\ [\begin{array}{c} c \end{array}] = i_{11} k_{11} + i_{12} k_{12} + i_{13} k_{13} + i_{21} k_{21} + i_{22} k_{22} + i_{23} k_{23} + i_{31} k_{31} + i_{32} k_{32} + i_{33} k_{33} \end{array}

The formula for the resultant matrix, given we have an input matrix I, and a kernel K, with their respective dimensions, the dimensions of the result of a convolution are given by:

d i m = (i_{h} - k_{h} + 1) \times (i_{w} - k_{w} + 1)

We follow this with moving the kernel by one column and continuing the process, until we reach the end. This is the trick behind convolutional neural networks. Ok ? But *why* ? What does it exactly do ? Remember how I said that the network must be able to tell apart features that belong to a particular class of images ? Well if we select the kernel properly, it *highlights* certain features of images. For example, here is a very simple picture in our (20 * 20) grid input (forgive my lack of talent in art):

This is a (20 * 20) image, with the white squares having value of 255, and the black ones having a value of 0. Suppose a prominent feature of the image above is it's vertical lines. Thus, we need to capture that particular feature of the image, i.e, a vertical line (or any vertical part for that matter) into a kernel. For now, let's try it out with this kernel below:

K = {[\begin{array}{ccc} 0 & 0 & 0 \\ 0.5 & 0.5 & 0.5 \\ 0 & 0 & 0 \end{array}]}_{3 * 3}

We do the above operation of doing elementwise multiplication of this kernel with each submatrix (remember: the matrix is full of 255’s or 0’s) and we end up with the following image:

\begin{array}{l} H e r e^{'} s w h a t h a p p e n e d : \\ [\begin{array}{ccc} [\begin{array}{ccc} i_{11} . k_{11} & i_{12} . k_{12} & i_{13} . k_{13} \\ i_{21} . k_{21} & i_{22} . k_{22} & i_{23} . k_{23} \\ i_{31} . k_{31} & i_{32} . k_{32} & i_{33} . k_{33} \end{array}] & \dots & i_{1,20} \\ ⋮ & ⋱ & ⋮ \\ i_{20,1} & \dots & i_{20,20} \end{array}] \to [\begin{array}{ccc} \dots & [\begin{array}{ccc} i_{12} . k_{11} & i_{13} . k_{12} & i_{14} . k_{13} \\ i_{22} . k_{21} & i_{23} . k_{22} & i_{24} . k_{23} \\ i_{32} . k_{31} & i_{33} . k_{32} & i_{34} . k_{33} \end{array}] & \dots \\ ⋮ & ⋮ \\ i_{20,1} & \dots & i_{20,20} \end{array}] \\ A n d w e c o n t i n u e t i l l t h e e n d \end{array}

Note: The values of kernel did not change!

It is pretty evident (and satisfying) to see what the kernel did: it sort of highlighted the vertical lines in the image! This means that the kernel K encaptures vertical lines in this image! Just as is the case here, every image has a unique feature which it shares with other images of the same type (every stick man has a vertical line for a torso), and thus if we correctly encapture that exact feature in a kernel, we can more easily identify a class of an image. That is the learning part in these networks: they learn to find a kernel which best represents a particular feature (as the above kernel represents vertical lines) except they do so taking into account all of the images we provide, and finds kernels which encapture features which are similiar, but also have slight variations (a dog sitting vs standing). With this, we can now finally move forward!

Forward Propagation

After we get the convoluted matrix from the previous operations, we simply follow through with the same procedure as we did for normal neural networks: multiply by weights and add a bias term. But before we do that, let us not forget we have a matrix and not a vector. Why not convert our matrix into one than ? It makes it much simpler to deal with, as we always prefer less complicated or lower dimensional stuff ! Let C be the convoluted matrix.

\begin{array}{l} {[\begin{array}{ccccc} c_{1,1} & c_{1,2} & c_{1,3} & \dots & c_{1,20} \\ c_{2,1} \\ c_{3,1} & ⋱ & ⋮ \\ ⋮ \\ c_{20,1} & \dots & c_{20,20} \end{array}]}_{20 * 20} \to {[\begin{array}{ccccc} c_{1,1} & c_{1,2} & c_{1,3} & \dots & c_{20,20} \end{array}]}_{1 * 20} \\ a l s o w r i t t e n a s : \\ {[\begin{array}{c} c_{1,1} \\ c_{1,2} \\ c_{1,3} \\ ⋮ \\ c_{20,20} \end{array}]}_{1 * 20}^{T} \end{array}

Now we can go ahead and use weights and biases to the convolution vector to get what we want: the output of the first “layer”. Let W be the weight matrix and B be the bias vector.

\begin{array}{l} C W + B = O \\ {[\begin{array}{c} c_{1,1} \\ c_{1,2} \\ c_{1,3} \\ ⋮ \\ c_{20,20} \end{array}]}_{1 * 20}^{T} {[\begin{array}{ccccc} w_{1,1} & w_{12} & w_{13} & \dots & w_{1 m} \\ w_{2,1} \\ w_{3,1} & ⋱ & ⋮ \\ ⋮ \\ w_{20,1} & \dots & w_{n m} \end{array}]}_{20 * m} + {[\begin{array}{c} b_{1,1} \\ b_{1,2} \\ b_{1,3} \\ ⋮ \\ b_{20,20} \end{array}]}_{1 * m}^{T} = {[\begin{array}{c} o_{1,1} \\ o_{1,2} \\ o_{1,3} \\ ⋮ \\ o_{20,20} \end{array}]}_{1 * m}^{T} \end{array}

This gives us our first output. Since I have covered this process before, in previous mentioned essays, I will only go over them briefly: we put these through an activation function, and add a hidden layer, except this time, there are new functions and dimensions to think (worry) about! Anyways, here is the ReLu activation function applied to the same:

R e L u a c t i v a t i o n f u n c t i o n : \frac{1}{2} ({[\begin{array}{c} o_{1,1} \\ o_{1,2} \\ o_{1,3} \\ ⋮ \\ o_{20,20} \end{array}]}_{1 * m}^{T} + {[\begin{array}{c} | o_{1,1} | \\ | o_{1,2} | \\ | o_{1,3} | \\ ⋮ \\ | o_{20,20} | \end{array}]}_{1 * m}^{T}) = {[\begin{array}{c} h_{1,1} \\ h_{1,2} \\ h_{1,3} \\ ⋮ \\ h_{20,20} \end{array}]}_{1 * m}^{T}

With this, we can finally arrive at the last problem: what will be our output exactly? Remember that this is a classification from, therefore we do not want quantity, but instead a class of an image (whether it is a dog, cat or human). Therefore our output must represent each of these categories. A very simple way would be to assign each label a vector, such as :

\begin{array}{l} C a t : [\begin{array}{ccc} 0 & 0 & 1 \end{array}] \\ D o g : [\begin{array}{ccc} 0 & 1 & 0 \end{array}] \\ H u m a n : [\begin{array}{ccc} 1 & 0 & 0 \end{array}] \end{array}

With this, not only us, but even a computer model can distinguish between the three. Hence our output must be a (1 * 3) dimensional vector, with each number representing a class. With that, we can go ahead with our hidden layer, to calculate our final output. Let W₂ be the weigths and B₂ be the bias. (With W₁ and B₁ being the previous ones.)

{[\begin{array}{c} h_{1,1} \\ h_{1,2} \\ h_{1,3} \\ ⋮ \\ h_{20,20} \end{array}]}_{1 * m}^{T} {[\begin{array}{ccc} w_{11} & w_{12} & w_{13} \\ w_{21} \\ w_{31} & ⋱ & ⋮ \\ ⋮ \\ w_{m 1} & w_{m 2} & w_{m 3} \end{array}]}_{m * 3} + {[\begin{array}{c} b_{11} \\ b_{12} \\ b_{13} \end{array}]}_{1 * 3}^{T} = {[\begin{array}{c} o_{11} \\ o_{12} \\ o_{13} \end{array}]}_{1 * 3}^{T}

With this process, we finally arrive at our final answer. The logical next step would be to somehow compare this vector to whatever is the true answer, but the question is how? Do we simply subtract it and square, like in mean-squared error ? No. What we need here is probability: our model must give probabilities about how certain it is about the given image, and we are going to build our loss function around that as well. This is a major difference between linear regression and classification: the use of probabilities and a different loss function, so let’s get started!

Softmax Function

One problem when thinking about probabilities, is that nothing restricts our output vector from containing negative values, which can be bothersome: as there cannot be negative probability. There is also the constraint of normality: all the values in the vector must add to one. To take care of both of these problems, we pass the output through what is known as the softmax function. This particular function takes in the values and returns a probability distribution. The function is defined as :

S o f t m a x f u n c t i o n : \frac{e^{o_{i}}}{\sum_{i = 0}^{K} e^{o_{i}}}

The following function works because: it takes the exponential of every term in the vector, which deals with the negative values, and it also divides with the sum total of all the exponential values of the vector elements, hence the total of the vector after softmax equals to one. Thus we must simply pass our final output into the function to get our true final probability distribution.

σ ({[\begin{array}{c} o_{11} \\ o_{12} \\ o_{13} \end{array}]}_{1 * 3}^{T}) = {[\begin{array}{c} \frac{e^{o_{11}}}{\sum_{i = 0}^{3} e^{o_{i}}} \\ \frac{e^{o_{12}}}{\sum_{i = 0}^{3} e^{o_{i}}} \\ \frac{e^{o_{13}}}{\sum_{i = 0}^{3} e^{o_{i}}} \end{array}]}_{1 * 3}^{T}

Note: The σ symbol denotes the softmax function

Let us clearly define what we have as our final product: we have what our model thinks is the distribution of probabilities of the given image and class of categories. For example, if the vector output is [0.4, 0.3, 0.3] than the model thinks the given image is 40% human, 30% dog and 30% cat, which indicates that it is far from the true probability distribution (which could be [1,0,0] or [0,1,0] or [0,0,1]). Our next step would be to define a loss function, but only this time, around probabilities.

Cross-Entropy Loss

As mentioned above, suppose we initialze a random matrix of kernels and weights and biases, and go through the above mentioned processes to get a final probability distribution. Let’s take an example of [0.2, 0.5, 0.3] as our model’s initial guess. Let’s assume the image was of a human, so the actual probability distribution is [1,0,0]. What we need now, is to somehow calculate the distance between these two distributions, or how wrong was the model’s guess. Doing so can be a bit unintuitive, as opposed to the mean squared error. What we start with to get to our loss function is a way to find the distance or difference between two distributions, which can be calculated by using the Kullback-Leibler Divergence, which in statistical theory is used to find the distance of errors between probability distributions.

Kullback-Leibler Divergence

Let’s assume we have two coins. Coin 1 has a p₁ probability of giving heads and p₂ for tails, whereas Coin 2 has a probability of q₁ for heads and q₂ for tails. Therefore their distribution can be simply given by :

\begin{array}{l} C o i n 1 d i s t r i b u t i o n : p_{1}^{H} . p_{2}^{T} \\ C o i n 2 d i s t r i b u t i o n : q_{1}^{H} . q_{2}^{T} \end{array}

Where H is the number of heads and T is the number of tails (giving H+T as the total number of tosses or N). What we do next is very simple: get their ratio. And since we cannot forget the normalization condition, we must take the root to the Nth power. Giving our final equation the form of :

{(\frac{p_{1}^{H} . p_{2}^{T}}{q_{1}^{H} . q_{2}^{T}})}^{\frac{1}{N}}

Believe it or not, we have already arrived at the KL divergence! Now we only have to simplify the equation by taking the log of the equation, and get to the final form.

\begin{array}{l} l o g (\frac{p_{1}^{\frac{H}{N}} . p_{2}^{\frac{T}{N}}}{q_{1}^{\frac{H}{N}} . q_{2}^{\frac{H}{N}}}) \\ l o g (p_{1}^{\frac{H}{N}} . p_{2}^{\frac{T}{N}}) - l o g (q_{1}^{\frac{H}{N}} . q_{2}^{\frac{H}{N}}) \\ l o g (p_{1}^{\frac{H}{N}}) + l o g (p_{2}^{\frac{T}{N}}) - l o g (q_{1}^{\frac{H}{N}}) + l o g (q_{2}^{\frac{H}{N}}) \\ \frac{H}{N} l o g (p_{1}) + \frac{T}{N} l o g (p_{2}) - \frac{H}{N} l o g (q_{1}) + \frac{T}{N} l o g (q_{2}) \end{array}

As N approaches infinity, H/N approaches p₁ and T/N approaches p₂ making our equations to be :

\begin{array}{l} p_{1} l o g (p_{1}) + p_{2} l o g (p_{2}) - q_{1} l o g (q_{1}) - q_{2} l o g (q_{2}) \\ p_{1} l o g (p_{1}) - p_{1} l o g (q_{1}) + p_{2} l o g (p_{2}) - p_{2} l o g (q_{2}) \end{array}

p_{1} l o g (\frac{p_{1}}{q_{1}}) + p_{2} l o g (\frac{p_{2}}{q_{2}})

Hence we have arrived at our final equations, where given the probabilities of two different distributions, we can effectively quantify the difference between them, or how similiar they are. This formula can be applied to more than two probabilities as well, which gives us the technically accurate term for the KL divergence:

D_{K L} (P | | Q) = \sum_{i} P_{(i)} l o g (\frac{P_{(i)}}{Q_{(i)}})

\begin{array}{l} F o r c o n t i n o u s d i s t r i b u t i o n s : \\ D_{K L} (P | | Q) = \int P_{(x)} l o g (\frac{P_{(x)}}{Q_{(x)}}) d x \end{array}

Since we are only dealing with discrete distributions, you can ignore the second formula. (If anyone is interested,here is an excellent youtube video on the same topic!) Before we get eager and utilize this formula to calculate what we want, we must take a little detour and derive our loss function from the KL divergence.

The Loss function

There is only a tiny difference between cross entropy and the KL divergence, and that is rooted in conditional probability. As mentioned before, we want to find the true probability distribution of a given output. Let us define paramters of our model to be 𝜃 (I am assuming just one, but they can be as many as you want). Thus our model, using these parameters, gets to a distribution through the above processes. Let this probability be,

P^{*} (y | x; θ)

Which means the probability distribution of the output, given the input and the parameter 𝜃 ([0.4,0.3,0.3] in our case). On the other side, we have the true probability distribution, which is simply the probability of the output given the input ([1,0,0] in our case).

P (y | x)

We use the KL divergence on these to find how far away P^* is from P,

D_{K L} (P | | P^{*}) = \sum_{i} P (y | x) l o g (\frac{P (y | x)}{P^{*} (y | x; θ)})

And on further simplification we get,

\begin{array}{l} \sum_{i} P (y | x) (l o g (P (y | x)) - l o g (P^{*} (y | x; θ))) \\ \sum_{i} P (y | x) l o g (P (y | x)) - P (y | x) l o g (P^{*} (y | x; θ))) \\ \sum_{i} P (y | x) l o g (P (y | x)) - \sum_{i} P (y | x) l o g (P^{*} (y | x; θ))) \end{array}

Now we must remember the purpose of the loss function: to somehow tweak the parameter 𝜃, so as to decrease the loss function. Hence, we only care about the part of the equation that we can control, or the part with 𝜃 in it. Upon closer inspection, we see that the first half of the equation has no 𝜃, and thus we can completely ignore it to finally arrive at our loss function: the cross-entropy loss.

C r o s s E n t r o p y L o s s = - \sum_{i} P (y | x_{i}) l o g (P^{*} (y | x_{i}; θ)))

We shall continue with backpropagation in the next essay, or else this essay will be too long (scrolling is not that fun after a point), follow the link below for part II.

Continure Reading
Home