**Despite the rapid development of Computer Vision and Deep Learning tools, convolutional neural networks still perform great in solving applied problems in these areas. Unfortunately, using out-of-the-box models, we sometimes automatically skip understanding the model structure and general procedure of how it works. Let's try to solve this problem and go over the building blocks that usually construct the convolutional neural network.**

Hi there! Before we start, let's clarify a little bit the terminology not to be confused during the reading. To reduce the length of the term, people usually call convolutional neural networks like CNNs or ConvNets. All these names mean the same, so you shouldn't be afraid of facing any of them in the text.

In this paper, we will go over the explanation of the ConvNets structure and dive deeper into the implementation using the most popular deep learning frameworks like PyTorch and Tensorflow.Keras. All source code will be split into small snippets to increase understanding and reproducibility. I propose to highlight in this paper the following building blocks.

Convolutional layer.

Pooling layer.

Fully connected layer.

Dropout layer.

Activation function.

**Convolutional layer**

As you can understand from the name, the convolutional layer is the core of any CNN. The basis of this layer is a convolution operation that takes as the input the convolutional kernel (usually a square matrix) and part of the image with the same size as the kernel. The result of the operation is a dot product of two matrices (example in the picture below).

Let's go deeper into the convolutional layer and speak about its parameters. We are going to review *kernel size, feature maps, stride*, and *padding* parameters.

The ** kernel size** parameter determines the size of the convolutional kernel. It can be a single integer value or pair of integer values (width and height). In the majority of cases, both values are the same and represented by an odd value.

** Feature maps** parameter shows how many kernels will be applied in this layer and, as a result, how many channels will contain a processed image on the output of the current layer.

The convolution operation is applied to each image by moving over it according to the parameter named **stride.** Stride shows the size of moving our kernel over an image. It can be presented as a single integer value or pair of integer values. For example, if the stride is equal to 1, we move our kernel by 1 pixel to the right or down on each iteration. The kernel continues to move right until it is possible. Then it returns to the start of the row, moves down, and repeats the procedure of moving right.

In case we have stride like (3, 2) - for each movement to the right we shift the kernel by 3 pixels, each movement down we move the kernel by 2 pixels. It is a rare practice, usually, the stride is the same for both directions, but you should know that different values are also applicable. Of course, it will affect the output shapes.

**Padding** is another parameter of the convolutional layer. It determines the behavior of adding empty pixels around the image after convolution is performed. If we set this parameter to zero, the resulting image will have fewer sizes than the original one due to the convolution operation nature. If the value is greater than zero, the border of the corresponding size will be added around the image. For example, in the case of padding=1, the border size will be 1-pixel width, in case of padding=2 - 2 pixels width, and so on. Usually, the padded value for each pixel is 0. The general idea of this parameter is to save the size of the image between layers. As for the kernel size and the stride, padding can be presented like a single integer or pair of integers.

In the animation below you can see how the convolutional kernel is working. To simplify the image we use convolution with stride 1 and zero padding.

Now let's do some coding exercises for a general understanding of the theory.

So we are going to build a random "image fragment" with a size of 28x28 pixels and 3 channels. After this, we apply to this fragment a convolutional layer with 5 feature maps, kernel size 3, stride equal to 1, and zero padding. As a result, we will check the size of the output.

The result for tensorflow is exactly what we expected - a 26x26 image with 5 channels. The number of channels comes from the layer's parameter. The reduced image (28 -> 26) comes from the convolution operation and zero padding. The first channel of the tensor means that we have only one sample in our small set.

**TensorShape([1, 26, 26, 5])**

Let's do the same for the torch. One important thing that you should remember - the order of channels in tensorflow and torch is different.

The result is the same - 26x26 image with 5 channels.

**torch.Size([1, 5, 26, 26])**

**Pooling layer**

Polling layers are usually placed between convolutional layers. Basically, pooling is aiming to solve two issues. The first - is to downgrade the dimension of the feature map that outcomes from the convolutional layer to reduce the number of trainable parameters. The second is to fight overfitting due to the sensitivity of the network to the locations of features. Pooling by erasing some information allows the algorithm not to focus on it during the training.

The pooling layer has two main parameters - pooling window and stride. The** ***pooling window*** **determines the area over which the pooling operation is performed. The** ***stride** *is the same as for the convolutional layer - the step of moving the pooling window over an image. Both parameters can be presented as a single integer or pair of integer values.

There are a lot of different operations that can be used inside the pooling layer, but the most usable among them are maximum operation and averaging. So max pooling takes the maximum values for each pooling window, and average pooling takes the average value between all values inside the polling window. An example of both max pooling and average pooling is shown in the image below.

As we can see from the image, the results for each pooling approach are totally different. Average pooling is more sensitive to the range of values inside the window. There are no strict rules that can help to understand which approach is better for usage. So the best way is to try and decide it experimentally. The most typical value for the pooling window, as well as for the stride is 2. As for the convolutional layer, let's go over coding examples. We will generate a fragment of the image with shapes (4, 4, 1) and push it over max pooling and average pooling layers with pooling window size 2 and stride 2.

We have the following output:

**(<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[ 5., 8.],
[12., 15.]], dtype=float32)>,
<tf.Tensor: shape=(2, 2), dtype=float32, numpy=
array([[ 3. , 6.25],
[ 9.5 , 13. ]], dtype=float32)>)**

Let's do the same with the PyTorch.

As we can see results are the same as for the tensorflow implementation.

**(tensor([[[ 5., 8.],
[12., 15.]]]),
tensor([[[ 3.0000, 6.2500],
[ 9.5000, 13.0000]]]))**

**Fully Connected layer**

This layer consists of weights and classical neurons, as in a feed-forward neural network. Usually, this layer is placed between convolutional/pooling layers and the output layer. There are a lot of different approaches to constructing fully connected layers from the feature maps, but the most popular of them are ** flatten** and

*global pooling***(image below).**

Flatten takes all feature maps and unrolls them into a single vector. So, for example, if we have 5 feature maps of size 3x3, flatten will produce 5x3x3 = 75 units. In the case of global pooling (it can also be average, maximum, etc.) we take each feature map, apply the pooling operation and take the output to the resulting vector. Only one thing you should remember - global pooling has a pooling window size equal to the feature map size. So in the same case with 5 feature maps of size 3x3, we will have a vector of 5 units as a result (the output from the pooling operation with pooling window = 3 over feature map 3x3 is a single value). According to this information, global polling layers can be used when the last convolution output contains a lot of feature maps with large sizes. In such cases the flatten layer can call overfitting due to the huge amount of parameters. Global pooling will significantly reduce the number of trainable parameters. Let's do a coding exercise. We generate a random image with 3 channels of 9x9 size. Push it into a convolutional layer with 8 feature maps and kernel size 3. After this, we apply two layers in parallel - flatten and global average pooling.

As we can see from the outputs, in the first case we have 392 units (8 filters of size 7x7), and in the second case just 8 (8 feature maps).

**(TensorShape([1, 392]), TensorShape([1, 8]))**

We will do the same for the torch but with only one exception. Pytorch doesn't have an implementation for global average pooling. So we do this manually. We will use the average pooling function with the pooling window equal to the feature map size. This trick will allow us to take the average over the whole feature map and provide results the same as global average pooling.

As we can see, we have the same output as for the Tensorflow.Keras implementation.

**(torch.Size([1, 392]), torch.Size([1, 8])**

**Dropout layer**

The dropout layer is a construction that was developed to deal with overfitting in neural networks. The idea is quite simple - during the training phase, we randomly ignore the signals from the layer where dropout was applied. The amount of skipped signals is determined by the *dropout_rate*** **parameter. Its value shows the percentage of the units that will be reset to zero. The rest values will be increased according to the dropout_rate value. In this way, we just trying to erase paved paths to make the network more flexible.
Please remember that dropout layers are active only during the training phase. They are not working during the inference.
Below we do quick experiments. We will create a random tensor of 10 elements and apply a dropout layer with a dropout rate of 0.5.

As we can see, five units now have zero values. The rest values were multiplied by 2 due to the dropout rate of 0.5.

**tf.Tensor(
[[-1.1899604 1.1257381 0.8904383 0.812129 0.36669895 0.56058276
-0.48239076 -0.29940066 -0.39049834 0.69480914]], shape=(1, 10), dtype=float32)
tf.Tensor(
[[-2.3799207 0. 0. 1.624258 0. 0.
-0.9647815 -0.5988013 -0. 1.3896183]], shape=(1, 10), dtype=float32)**

The same thing is with Pytorch.

And results are the same, except for the difference in initial random values.

**tensor([[ 0.4601, 0.0929, 0.9562, -0.4803, -0.5623, -1.1717, -1.6377, 0.7694,
-0.5219, -0.7005]])
tensor([[ 0.9201, 0.0000, 1.9124, -0.9606, -0.0000, -0.0000, -0.0000, 0.0000,
-0.0000, -1.4010]])**

**Activation function**

Now it's time to speak about the last part of the convolutional neural network architecture - the activation function. Basically, it is a mathematical operation that should add a non-linearity to our network.
Theoretically, we can apply any function that we can come up with. But there are some recommendations for CNNs, like using *Relu** or *** Tanh** functions. In the case of feed-forward networks, it is pretty clear, how to apply the activation function. We summarize all input signals to the current neuron and push this information into the function. But how it works for the convolutional layers? It is quite simple - we apply the activation function for each unit that we have.
Let's check it on our coding exercise. We will prepare a random sample, push it into the convolutional layer, and after this apply the relu function.

As we can see it works as we discussed previously. All values greater than zero are still the same, but all negative values became zeros. It is a typical behavior of the relu function.

**(<tf.Tensor: shape=(1, 3, 3, 2), dtype=float32, numpy=
array([[[[-0.72591144, -0.35904017],
[ 0.89689165, -1.9028585 ],
[-0.4273774 , 0.08491676]],
[[ 0.3625314 , -0.7813939 ],
[ 0.85972095, -0.9588212 ],
[-0.24983402, -0.40457973]],
[[ 0.5333179 , 0.37630466],
[ 0.6402214 , 0.07430702],
[ 0.0438032 , -0.9578871 ]]]], dtype=float32)>,
<tf.Tensor: shape=(1, 3, 3, 2), dtype=float32, numpy=
array([[[[0. , 0. ],
[0.89689165, 0. ],
[0. , 0.08491676]],
[[0.3625314 , 0. ],
[0.85972095, 0. ],
[0. , 0. ]],
[[0.5333179 , 0.37630466],
[0.6402214 , 0.07430702],
[0.0438032 , 0. ]]]], dtype=float32)>)**

The same code for the torch.

As we can see exactly the same behavior as with Tensorflow.Keras.

**(tensor([[[[ 9.3237e-02, -1.3700e+00, -2.3280e-01],
[-3.3886e-01, 2.7572e-01, 3.2930e-01],
[ 2.5606e-01, -2.5666e-01, 7.6732e-01]],
[[ 4.5599e-01, 1.3371e-03, -5.7177e-01],
[-3.0741e-01, -4.1037e-01, 1.2240e+00],
[ 8.4792e-01, -9.5745e-01, 1.8718e-01]]]],
grad_fn=<ThnnConv2DBackward0>),
tensor([[[[0.0932, 0.0000, 0.0000],
[0.0000, 0.2757, 0.3293],
[0.2561, 0.0000, 0.7673]],
[[0.4560, 0.0013, 0.0000],
[0.0000, 0.0000, 1.2240],
[0.8479, 0.0000, 0.1872]]]], grad_fn=<ReluBackward0>))**

Finally, we got a complete overview of the main components of convolutional neural networks. Now we can in a simple way build networks like the ones that are present in our previous paper with a complete understanding of things.

As usual, all source codes are available in the GitHub repository. I hope this paper was useful for you and you got at least something interesting. Please ask your questions in the comments and thanks for reading!

## Comments