Comparative analysis of various models for image classification on Cifar-100 dataset

Nowadays, people developed various convolutional neural network (CNN) based models for computer vision. Some famous models, such as GoogLeNet, Residual Network (ResNet), Visual Geometry Group (VGG), and You Only Look Once (YOLO), have different architecture and performances. Determining which model to use may be a troublesome problem for those just starting to study image classification. To solve this problem, we introduce the GoogLeNet, ResNet-18, and VGG-16 models, comparing their architecture, features, and performance. Then we give our suggestions based on the test results to help beginners choose a suitable model. We conducted experiments to train and test GoogLeNet, ResNet-18, and VGG-16 on the Cifar-100 datasets with the same hyperparameters. Based on the test results (test accuracy, average test loss, training loss), we analyze the figures for trends, key points, increase rate, and other features. Then we combine the architecture of each model to make our conclusions. The experimental results show that ResNet-18 can be a good choice when training the model with the Cifar-100 datasets because it performs well after training and has a low time complexity. ResNet-18 also has the fastest convergence speed. GoogLeNet would be the second choice because it functions similarly to ResNet-18 and is even better. However, training GoogLeNet is a time-consuming task. VGG is not recommended in this experiment because it has the worst performance and similar training complexity compared with ResNet-18.


Overview of Image Classification
Convolutional neural network (CNN) is a famous deep learning architecture inspired by living things' natural visual perception mechanisms.Today, it is widely accepted in academia that the earliest prototype of its concept was the new cognitive accelerator proposed by Kunihiko Fukushima in 1980.By 1989, LeCun's team had successfully built a modern CNN framework that could be trained for the first time using a back-propagation algorithm.They later introduced a multi-layer artificial neural network named LeNet-5, which can perform simple image recognition tasks, such as classifying handwritten digits [1].Since then, researchers have had more large-scale training data and computational power after deep learning has evolved.CNN has gradually become capable of handling more complex tasks and has become the typical architecture for building models and managing image classification tasks [2].
This improvement has facilitated the widespread use of CNN-based image classification techniques in computer engineering, radiology, and the military.For example, face recognition supported by electronic devices is a standard image classification technique, and image classification techniques are also used in advanced driver assistance systems and medical research [3].

The Importance of Comparing CNN Models
During the rapid development phase, researchers have proposed several well-known models through the CNN architecture, such as VGG, LeNet, GoogLeNet, ResNet, AlexNet, and YOLO.Each of these models has a different structure and exhibits its unique advantages.For example, there are apparent performance differences between well-known and established models such as GoogLeNet, VGG, and ResNet: GoogLeNet is characterized by its low computational cost and deep and wide network, which enabled it to win the 2015 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [4].In contrast, the VGG model uses small convolutional kernels to reduce the computational cost and parameters while maintaining a simple structure [5].In addition, ResNet addresses the problems of degradation and error rate saturation, achieving a deeper network structure while maintaining a low computational cost compared to GoogLeNet and VGG [6].
Furthermore, it is often because these models have different structures and performances that choosing which model to use for image classification tasks becomes a practical issue.Therefore, this paper aims to analyze and compare the systems and principles of three CNN models, namely VGG, GoogLeNet, and ResNet, and to identify their differences.By comparing their different advantages in processing images, we can make recommendations on the choice of models.
The remainder of the paper has the following structure.Section 2 mentions several well-known models for image classification tasks and briefly describes their designs and features.In Section 3, we focus on GoogLeNet, VGG, and ResNet, analyze their architecture's depth, and compare their differences in structure.Section 4.1 presents information about the experiments, such as which datasets we used, how we set the hyperparameters, and which libraries and platforms we used to run the experiments.Sections 4.2 and 4.3 analyze and summarize the results of the experiments.

Related Work
In the ILSVRC 2014, GoogLeNet, as the winner, demonstrated its powerful ability to reduce its input parameters and computational cost by increasing the depth and width of the network [4].Using the inception module helps GoogLeNet to be broader and more profound.
In the ILSVRC 2014, the VGG model also performed well and ranked second.Unlike GoogLeNet, VGG extracts feature using only convolutional and maximum pooling layers.To reduce the parameter numbers and make the decision function more discriminative, it uses multiple small receptive fields to achieve the same functionality as the larger receptive fields [5].Thanks to small convolutional filters, VGG could add more convolutional layers, increasing its network's depth to a maximum of 19 weight layers.The VGG developer found that the architecture saturated its error rate when it had 19 layers, but they believed that a deeper network would benefit large datasets.Because the structure of each convolutional layer is the same, the convolutional layers can be stacked together.These advantages allow VGG to perform well in large-scale image classification.
In 2014, having 16-19 weight layers was a significant improvement; however, the ResNet developed in 2015 has 152 layers, 8× deeper than the VGG net.Such a deep network also maintains a low complexity and won first place in the classification task at ILSVRC 2015 [6].An essential feature of ResNet is that it addresses the problem of degradation.Typically, the deeper the network, the better it performs, but the depth enhancement can also cause saturation and degradation of the model [7].The solution to this problem is to use deep residual nets and shortcut connections.Deep residual networks can be easily optimized and gain accuracy from significantly increased depths, while plain superficial stacked layers exhibit higher training errors at increasing depths [6].
In 2016, J. Random et al. introduced the You Only Look Once (YOLO) model, and its speed and accuracy made it famous.YOLO has 26 layers, consisting of 24 convolutional layers and two fully connected layers [8].Using the ImageNet 2012 validation dataset; it achieved 88% accuracy in about one week of training.The principle behind YOLO's image classification is to predict the confidence score of each bounding box created in the image.A feature of YOLO is that it sees the whole picture and reduces its background error.
As the network gets deeper and deeper, another problem arises.Important information, such as input features or gradients, may be forgotten after being processed through many layers [9].Therefore, G. Huang et al. proposed DenseNet to solve this problem.One of the features of DensNet is that it achieves maximum information flow from layer to layer by interconnecting all layers.The upper layer passes additional input to the next layer, which continues to pass information to its next layer.
The inherent sequential nature of existing models precludes parallelization within training instances, which becomes critical at longer sequence lengths as memory limits batch processing across instances [10].Therefore, Vaswoni et al. proposed a Transformer model that uses the Attention mechanism in 2017.The Attention does not consider the distance of the dependency in the input or output sequence.When there is no Attention, each element has the same weight.However, with the help of Attention, the consequences of features can be determined according to their correlation distribution, helping us to remember the information contained in the context.Many models now successfully utilize Attention to deal with different situations.

Architecture of Models and Their Difference
In this section, we will compare the architecture of GoogLeNet, VGG-16, and ResNet and discuss their difference.The inception module is essential in GoogLeNet, as shown in Figure 1.Different-sized convolution kernels are required due to the inconsistent salient part of each image.Increasing the network depth is a standard solution to this problem; however, a deeper network increases the computational cost and makes it challenging to transfer gradient updates to the entire network.Therefore, to solve this problem, the inception module was developed to widen the network [4], just as Figure 2 shows.The 1×1 convolutional kernel played an essential role in this regard.Its function is to remove computational bottlenecks, and, at the same time, it increases the width and depth of the network, giving GoogLeNet a total of 27 layers, including pooling layers.With only eight layers, GoogLeNet has 21 convolutional layers and one fully connected layer, comprising 22 layers.In traditional convolutional layers, there is only one transformation per layer, but in the inception module, each layer can perform multiple changes, and the results will be concatenated and passed on to the next layer.To cover regions of different sizes connected by clusters, filters of 1×1, 3×3, and 5×5 are supported.Each layer also has a 3×3 max-pooling operation.However, the 3×3 and 5×5 filters are computationally expensive.Therefore using the 1×1 filter first to reduce the input dimensions, followed by the 3×3 and 5×5 filters, became the solution.VGG has fewer layers and a more straightforward structure than GoogLeNet.VGG has up to 19 weight layers.All VGG models have three fully connected layers, so VGG has up to 16 convolutional layers, while VGG-16 has only 13, as shown in Figure 3 [5].It also has a softmax layer and 5 maximum pooling layers.Unlike GoogLeNet, the convolutional layers of VGG achieve the functionality of using one sizeable convolutional kernel by using multiple small convolutional kernels.For example, using two 3×3 convolutional kernels has the same functionality as one 5×5 convolutional kernel, while three 3×3 convolutional kernels have the same functionality as one 7×7 convolutional kernel.The reason for choosing a 3×3 convolutional kernel is that it is the smallest size to distinguish between left/right, top/bottom, and center.In addition, VGG uses a 1×1 filter which, unlike GoogLeNet, aims to inherit the input dimension ultimately, i.e., the input dimension is the same as the dimension of the output.Moreover, using the 1×1 filter also increases the nonlinearity of the decision function.The beauty of the ResNet design is that it solves the problem of network degradation as the network becomes deeper by using residual networks and shortcut connections [6].The principle of shortcut connection in Figure 5 is to skip layers and perform identity mapping while it does not affect the backpropagation operation.There are three methods to make shortcut connections when the input dimension is smaller than the output dimension.The first is increasing the size by filling in extra zeros and allowing the shortcut to be identity mapped.This operation does not require additional parameters.The second method is to use a 1×1 convolutional layer to make the projected shortcuts and the other shortcuts to be identity mapped.The third one is to let all shortcuts do the projection.Although the third method offers the highest accuracy, it requires more parameters and is more computationally expensive than the second method.Compared to VGG and GoogLeNet, Figure 4 indicates that ResNet significantly improves network depth while maintaining a lower complexity than VGG.It has up to 152 weight layers, consisting of one 3×3 max-pooling layer, one 7×7 convolutional layer, and 150 convolutional layers.These 150 convolutional layers consist of several blocks with three convolutional layers each, and the unions have the structure of 1×1, 3×3, and 1×1.Like VGG and GoogLeNet, the function of the 1×1 convolutional layer is to reduce or increase dimensionality, and the part of the 3×3 layer is for bottlenecks.Finally, ResNet has a layer for average pooling, total connection, and soft-max operation.

Cifar-100 Datasets
The CIFAR-100 has 100 classes of images, each with 600 images [11].These 600 images consist of 500 training and 100 test images, respectively.In total, CIFAR-100 has 10,000 test images and 50,000 training images.In addition to being divided into 100 classes (coarse labels), these images also belong to 20 different superclasses (fine tags).All images have a size of 32×32 and have three channels because they are color images.We use Google Colab to train and test the model in our experiments.The hyperparameters are set based on [12], where the batch size is 128, the learning rate is initially 0.1, and at the 60th, 120th, and 160th epochs, it is divided by 5, the weights decay to 5×10 -4 , and the Nesterov momentum is 0.9, for a total of 200 periods of training.The experiments came from the GitHub opensource project created by Kwon et al.The experiments used the PyTorch library [13].

Results
In Figure 6, the number of training rounds is on the horizontal axis, and the accuracy of the test is on the vertical axis.The figure contains three algorithms: VGG-16, ResNet-18, and GoogLeNet.In general, the accuracy of these three models increases with the number of rounds.There are three key stages where accuracy increases rapidly at rounds 5-20, 60, and 120.Because this project followed the hyperparameter setting in [12], i.e., the initial learning rate is 0.1 and divided by 5 at the 60 th , 120 th , and 160 th epochs.
The accuracy of the three curves rises rapidly from rounds 0 to 20, slows down from rounds 20 to 60, reaches a high rate of increase at round 60, and then fluctuates slightly from rounds 60 to 120 before leveling off.After 120 games, the model gets a bottleneck, and accuracy ceases to improve as the round number increases.
Figure 6 shows that ResNet-18 initially converges fastest at rounds 0-20, followed by GoogLeNet.At rounds 60-120, GoogLeNet and ResNet-18 have similar accuracy, with ResNet-18 slightly higher than GoogLeNet.The accuracy of VGG-16 is about 0.3 points lower than others.After 120 rounds, when the algorithm reaches its bottleneck, GoogLeNet's accuracy is slightly higher than ResNet-18's, while VGG-16's accuracy is still about 0.   In Figure 7, the number of training rounds is on the horizontal axis, and the average test loss is on the vertical axis.The figure contains three algorithms: VGG-16, ResNet-18, and GoogLeNet.The average test loss decreases as the number of training rounds increases.The trend corresponds to Figure 6; as the test accuracy of the three algorithms increases, the average loss of the test also decreases.At rounds 5-20, 60, and 120, there is a rapid decrease in the average loss from testing due to the change in the learning rate mentioned in [12].A rapid increase matches the rapid decline in the test loss during these three critical stages in test accuracy.
The average loss of the three curves declines rapidly from rounds 0 to 20, slows down from rounds 20 to 60, reaches a higher rate of decline at round 60, then fluctuates slightly from rounds 60 to 120 before leveling off.After the rapid decline in the average loss in round 120, the average loss for GoogLeNet and ResNet-18 gradually stopped its downward trend and leveled off.However, compared to GoogLeNet and ResNet-18, the average loss of VGG-16 dropped slightly in the 120th round and showed an upward trend until the 200th round.
The figure shows that VGG-16 always had the highest average loss in testing among the three algorithms.The average loss of GoogLeNet is second from rounds 0 to 60, and then the average loss of ResNet-18 slightly exceeds that of GoogLeNet from rounds 60 to 120.After round 150, the average losses of GoogLeNet and ResNet-18 are similar until round 200.This trend also corresponds to the accuracy of the three algorithms.VGG-16 maintains the lowest accuracy, consistently having the highest average loss.From round 0 to round 60, ResNet's accuracy is more elevated than GoogLeNet's because ResNet has a lower average loss.From round 60 to round 120, ResNet's average loss is slightly higher than GoogLeNet's, although they are still reasonably close.Therefore, the accuracy of the two algorithms is almost comparable.After round 120, both GoogLeNet and ResNet-18 reach their bottlenecks, and their precision and average loss become stable.In training, we trained 50,000 training images for 200 epochs, and we chose to use a sample size of 128 per iteration, so we got 78,200 iterations on the horizontal axis of the training loss data, as Figure 8 shows.The average loss from training is on the vertical axis.The experimental data shows that the training losses gradually decrease, indicating that the three models learn and fit the training data well and that the loss function eventually converges slowly.However, the comparison shows that VGG-16 has the slowest convergence speed under the same epoch value, while ResNet-18 has the fastest and shortest training speed.Here, the difference in model structure can explain these differences in training efficiency to a certain extent.In comparison, the GoogLeNet model has 22 weight layers and a parameter size of 6.8 million.The size of the experimentally deployed model was 286 MB.The experimental data showed that VGG-16 had the most parameters, GoogLeNet had the minor parameters, and ResNet-18 had parameters in between.However, the model size obtained in the experiments did not correlate with the number of parameters after the models were deployed and completed training.The GoogLeNet with the minor parameters has twice the model size of the other two.Moreover, ResNet-18 has the smallest model size and has almost twice as many parameters as GoogLeNet.Table 2 shows that when the termination condition was set to 70% accuracy control, the training times of the three models obtained in the experiments were 22.69, 9.73, and 18.39 minutes, respectively.Correspondingly, the data found that the fastest inference time among the three models was 31.77minutes for ResNet-18, and the slowest was 59.79 minutes for GoogLeNet; 37.29 minutes for VGG-16 was in between.By comparing the above evidence with previous data, the experiments verify that the model's size after deployment positively correlates with the inference time and also demonstrate that the correlation between the training time of the model and the number of parameters is weak.

Conclusion
After comparing the above data, the experiments show that VGG-16 has the most significant parameter number in the training process among the three models.It improves the computational efficiency by using smaller convolutional kernels and shallower neural networks, and the effect of its large number of parameters on the convergence speed is enhanced.Although VGG-16 uses smaller 3×3 convolutional kernels to maintain a concise structure and lower computational complexity, these settings consume too many computational resources due to the more significant number of parameters and two redundant fully connected layers, resulting in less efficient training than ResNet-18.GoogLeNet's network is more profound and complex than ResNet-18 and VGG-16.While having deeper network layers than the other two networks, it also uses the inception module, which contains four parallel convolutional layers.Its design significantly increases the computational complexity and slows down the convergence of training losses and test accuracy.As a result, GoogLeNet has the fewest parameters of the three networks, but GoogLeNet's training time is also the slowest.However, due to its high computational complexity, the model in the experiment has the highest accuracy rate after completing training, slightly higher than ResNet-18, with a difference of less than 1%.Finally, ResNet-18 balances the number of parameters and network design when training and identifying the Cifar-100 model compared to the two networks mentioned above.Its test accuracy is very close to that of GoogLeNet, and its inference efficiency is much higher than that of GoogLeNet and VGG-16 (for inference time, ResNet-18 is 33% higher than GoogLeNet and 15% higher than VGG-16).Therefore, when training models using datasets such as Cifar-100, we recommend ResNet-18 because it outperforms VGG-16 and requires less training time, while its performance is similar to that of GoogLeNet.Finally, the experiment verified that reducing the number of parameters does not necessarily reduce the training time and weaken the model's accuracy.Therefore, in later experiments, the focus will be more on testing the inference performance of different models on different datasets and testing the impact of algorithms for preprocessing and optimizing the network structure on these models.

Table 1 .
[5]]r, Parameter and Size Details of GoogLeNet[4], ResNet-18[14]and VGG-16[5]As Table1shows, the VGG-16 model has 16 weighting layers.These layers include 3 fully connected layers and 13 convolutional layers.In our experiments, the number of parameters of this model is 138.4 million.After deploying the training, the size of the obtained VGG-16 model is 101 MB.The ResNet-18 model has 17 convolutional layers and 1 fully-connected layer with 11.2 million parameters.After experimental deployment, the size of the obtained ResNet-18 model is 93 MB.