Research on application of model ensemble in sports image classification based on environmental information

In recent years, a deep convolutional neural network has been widely used in the field of image classification. However, training a satisfactory network is very arduous, not only to tune the hyper-parameters in the network but also to avoid the overfitting problem caused by a deep neural network. Another point is that it is difficult for a neural network to learn subtle details without human annotation. Therefore this paper proposed a preferable classification algorithm for sports classification tasks that combines the deep neural network with the object detection algorithm to obtain the prediction result. In this paper, the author compared the differences between classification directly using neural networks and a modified model ensemble approach to classification from a holistic perspective, as well as elaborating the advantages and disadvantages of the two approaches. The conclusion shows that the use of the improved model ensemble classification algorithm performs better than the direct use of neural networks and also achieves a high degree of accuracy in the test set.


Introduction
Image classification is a subfield of artificial intelligence in which the development of theory has led to the formation of a rising number of diverse approaches, all of which are crucial to the discipline [1].Deep learning is one of the most prominent fields currently.When it comes to picture categorization, the architecture that is utilized most frequently is known as a deep convolutional neural network, and the continual discovery of new insights and applications in deep learning has been the impetus behind the development of CNN.Among the most widely used CNN models for image categorization at the moment are names like AlexNet, ResNet, and VGG, amongst others.These models contain a large number of layers, which makes training them difficult, and they require debugging a large number of hyper-parameters, which makes it difficult to acquire a very excellent model.However, these models do exceptionally well in terms of image classification.Increasingly, the multi-model ensemble is employed as a technique for enhancing the accuracy of the model applied to the test set.One classifier cannot ensure that all feature descriptions can be mapped appropriately.With the combination classifier, it is possible to successfully circumvent the lack of a single classifier.The classification accuracy of a classifier might be significantly improved by choosing the optimal classifier combination and a more exhaustive set of features [2].However, since the majority of the model ensemble consists of training several CNN models and fusing the output of each CNN network to get classification results, there have been few studies on the model ensemble of the image's ambient information and CNN network.Therefore, the author will evaluate the viability of integrating environmental information with CNN network output results in this research.The paper proposed an integrated model which consists of two classifiers: the Darknet53 model, used to obtain the basic features in the image, and the object detection model using YOLOV5, used to detect the environmental information in the image, such as the presence of a baseball bat and a baseball glove in an image about baseball, and identify the presence of both wheelchair and basketball in an image of wheelchair basketball.Because of this study, there will be one more option available for model selection when it comes to the model ensemble.

Dataset overview
The first dataset which is used to train the CNN model comes from Kaggle and is titled "100 Sports Image Classification."It consists of 100 Sports Classification, with each image measuring 224*224*3 [3].The author chose 10 of the 100 Sports consisting of images depicting basketball, wheelchair basketball, volleyball, water polo, baseball, golf, skiing, surfing, field hockey, and hockey.The reason that only 10 types of sports are selected is that the similarities between these sports are so great that it is difficult for CNN to detect the specific distinctions between them, allowing the ambient information in the image to play its function.This dataset's test set will serve as the test set for the entire model.
The second dataset is used to train the object detection YOLOV3 model, which recognizes crucial environmental information in an image, such as the wheelchair basketball image, to determine whether or not the image contains a wheelchair and a basketball.The author constructed the dataset by himself, the dataset includes 11 categories that need to be tested, namely baseball bat, baseball gloves, basketball, wheelchair, snowboard, little ball, a golf stick, hockey stick, surfboard, ice hockey, and volleyball.The production approach involves manually annotating the essential information in the training set of the data set used to train the CNN model with the labelImg software and dividing the annotated data into the training set and test set of the YOLOV3 model in a 9:1 ratio.To enable input into the Yolo V3 model, these data were transformed into the shape of 416*416*3。 The third dataset is used to train the ensemble of the CNN model and object detection model.To begin, the author will need to perform some processing on the data that were received by YoloV3.To initiate, we record the findings of the detection in an array.For example, a baseball bat and a little ball are detected in a picture.Then the array will be ["baseball bat", "little ball"].Then, according to the pseudocode shown in the following figure 1, the obtained array will be converted into the vector of predicted possibilities for 10 different sports.For example, if we detect a small ball, we add one to the baseball (0), golf (3), and field hockey (2), resulting in the one_hot_res array of [1,0,1,1,0,0,0,0,0,0], which is primarily what the add_the_index function does, the algorithm applies normalization to one_hot_res in the end.The output of the CNN model( 1

Model architecture overview
The first model that is discussed here is Darknet53, which can serve as both the backbone network of the YOLO model and a CNN network for classifying sports images.This network contains 53 convolutional layers and incorporates the residual network and Batch normalizations.The function of adding residual networks is to address the issue of network deterioration caused by having too many layers [4], and batch normalization's purpose is to accelerate the speed of learning and maintain the same input distribution for each layer of a neural network [5].The model's specific architecture is depicted in Figure 2  The second model presented here is Yolo V3, a relatively traditional object detection technique.The network is built on the DarkNet53 network without the FC layer and the AvgPool layer, and the model employs upsampling as well as the fusion of the shadow layer and the deep layer feature [6].The model's essential concepts are depicted in Figure 3 [7].The model contains three outputs consisting of 13*13*48, 26*26*48, and 52*52*48, which reflect the detection of large targets, medium targets, and tiny targets, respectively.Among them, the number of filters is calculated as 3*(11+4+1), where 3 represents that there are three anchor boxes for target detection and (11+4+1) represents that each anchor box requires 16 parameters, which are the conditional probability of 11 targets to be detected, 4 parameters to determine the location of Anchor box, and a parameter indicates the possibility that an object exists in the anchor box.(1) (2) (3) The symbol S 2 denotes the number of grids in the prediction; the symbol B denotes the number of anchor boxes in the prediction; and the symbol 1 i,j obj denotes that the value is set to 1 if this (i,j) grid cell is used to make the prediction otherwise, it is set to 0. In Formula 1, it searches for the difference between the ground truth bounding box and the anchor box, which is used to detect the object.Unlike YoloV1 and YoloV2, YOLOV3's approach for identifying positive samples is determined not by defining an IOU threshold with a ground truth box, but rather by selecting from three prediction results of varying sizes.As the positive sample, the Anchor box with the largest IOU with the ground truth box was chosen.Formula 2 explains how to calculate objectiveness confidence loss.Where -log(p c ) ensures that the confidence of the positive sample is close to 1 and the BCE (Binary Cross Entropy) operation of each conditional category probability of the positive sample and the actual conditional category probability ensures that the probability of each conditional category in the positive sample approaches its actual value, where BCE(c ̂k, c k ) = −c ̂klog(c k ).The purpose of Equation 3 is to compute the loss of the objectiveness score of the negative sample (the anchor box that is not utilized for detection), since −log(1 − p c ) is used here, the p c should be approaching 0 to make the loss decreases.
The stacking model for fusing the Darknet53 and the Yolo V3 model is the third model that will be provided here.the detail is displayed in figure 4. It is a neural network with 2 groups of FC layers, input 1*20 output 1*10, and softmax as its final layer Relu as its activation function [8].

Overview of the experiment process
In this chapter, the steps of the experiment in this paper are mostly given.The first step is to make advantage of the Darknet53 network to train a sports classification model and then record the model's maximum test accuracy.After the training is finished, will begin the process of labeling the dataset that was used for training the YOLOV3 model.Following the completion of the dataset's labeling is to initiate the training process for YOLOV3.When all of these procedures have been finished, the author will start processing the data in preparation for the model ensemble.After processing, the author will begin network training for model integration.Following successful completion of the training, accuracy will be achieved.Finally, the DarkNet53 model and the ensemble of DarkNet 53 and YOLOV3 will be clearly differentiated.

The DarkNet53 model result
Darknet53 model's greatest performance on the test set of the 10 sports classification dataset was 71.9%, which was obtained by training on epoch 36 with a Learning rate of 0.1, Weight Decay of 1e-4, and Momentum of 0.9. Figure 5 depicts the model's accuracy with the number of epochs.The graph demonstrates that there is no overfitting issue in the model, which is partially attributable to the usage of data augmentation.However, there is always a maximum achievable test accuracy for the model.Even though the training accuracy has improved after epoch 36, the test accuracy declines.Table 1 illustrates the precision, recall, and f1-score for several sports within the sports dataset.The total f1score of this model is 65%, with the identification of ice hockey, field hockey, and water polo being the most successful.The F1 score is near 100%, although the detection effect for basketball, wheel basketball, and golf is poor [9].

The YoloV3 model result
For the detection of pivotal information in the environment, incorrectly categorizing an object is frequently detrimental [10].For instance, identifying a golf stick as a hockey stick will reduce the performance of the model fusion.However, it is not sufficient to improve merely the precision, as long as the threshold of confidence is set to a big value, precision will unquestionably improve, but recall will rapidly decline.If the majority of targets to be discovered are disregarded, the fusion model will make little progress.Consequently, we must identify the model with the greatest mAP when the IOU threshold is equal to 0.5.After several attempts, using the hyper-parameters LR: 0.01, Momentum: 0.937, weight decay: 0.0005 achieved the highest mAP@0.5 with 0.836.To make a better analysis, there is a need to look at the confusion matrix and the PR curve as shown in figure 6 and figure 7.

The model ensemble result analysis
This model has made substantial progress, with DarkNet53's 65%F1-score increasing to 97% in the model ensemble and the test set's accuracy increasing from 72% to 96%.By comparing Table 1 and  Table 2 (the precision, recall, and F1 score of the Model ensemble), the F1 score for recognizing surfing grew from 18% to 100%, and the F1 score for predicting wheelchair improved from 17% to 100%, and the F1 score for distinguishing basketball rose from 49% to 84%.Even though Hockey's F1 score dropped from a perfect 100% to a still-impressive 95%, the overall F1 score still increased significantly.This demonstrates the importance of fusing the critical information in the environment to the prediction of the model, as well as the viability of integrating the essential information in the environment.

Conclusion
When combined with CNN, the process of extracting environmental information from images yields superior results to those of a standard CNN network.However, from a user-friendliness perspective, this approach is less than ideal.It is time-consuming to manually annotate and extract environmental information from different types of images, for example, the effort required to detect the movement of these 10 categories is substantial, and when the number of categories to be classified is high, this approach will be challenging to achieve.Data augmentation of the Model Ensemble data to enhance the accuracy of the model fusion and the identification of the precise item to be detected for each distinct class is a potential area for future research.
*classes) and the processed YOLOV3 model(1*classes) are then concatenated into the size of 1*(classes*2).The training set and test set in the dataset used for training the CNN model are made into a new training set and test set in this way.

Figure 1 .
Figure 1.The pseudocode of Processing the data.
, where ConvB denotes the Convolutional block, which consists of one 2D Convolutional layer and one Batch normalization layer, with LeakyRelu as the activation function; ResB denotes the Residual block.This sequence combines the initial input with the results from two ConvB blocks.The Half block module is designed to halve the image's features by stride=2.For instance, the input of 224*224*in_channel becomes 112*112*out_channel.

Figure 2 .
Figure 2. The architecture of the DarkNet 53 model.The Res-n block denotes the n Residual blocks, and it first passes through a half block to halve the number of features and then passes through n ResB blocks.In the last section of the model, an Average Pool and an FC layer are present.For the AvgPool layer, no matter what the input size is, it will be converted to 1*1*out_chanel, and after the FC layer and Softmax operation, it becomes the number of categories to be classified.The second model presented here is Yolo V3, a relatively traditional object detection technique.The network is built on the DarkNet53 network without the FC layer and the AvgPool layer, and the model employs upsampling as well as the fusion of the shadow layer and the deep layer feature[6].The model's essential concepts are depicted in Figure3[7].The model contains three outputs consisting of 13*13*48, 26*26*48, and 52*52*48, which reflect the detection of large targets, medium targets, and tiny targets, respectively.Among them, the number of filters is calculated as 3*(11+4+1), where 3 represents that there are three anchor boxes for target detection and (11+4+1) represents that each anchor box requires 16 parameters, which are the conditional probability of 11 targets to be detected, 4 parameters to determine the location of Anchor box, and a parameter indicates the possibility that an object exists in the anchor box.

Figure 4 .
Figure 4.The architecture of the model for fusing.

Figure 7 .
Figure 7. Precision-recall curve.Observing figure7reveals that the detection of a Golf Stick and baseball gloves in this model is not optimal but passable; either the Golf Stick is mistaken for a Hockey stick, or the baseball gloves are not detected.Based on the misidentified images, it is impossible to distinguish between a Golf Stick and a Hockey Stick with the human eye, as the difference is not clear in the case of a 224*224 low-resolution image.sFurthermore, the reason for low recall on baseball gloves is that the training set only contains a few images regarding it, which may be easily remedied by adding more images to the training set.It is discovered that in the PR curve that when IOU=0.5, the AP of the Golf Stick and Hockey Stick were very low, only 0.398 and 0.587.However, the problem caused by the image resolution was difficult to solve, but the model performed well in other categories.We present the outcomes of several YOLO V3 model suites below in figure9.

Table 1
The results of F1-Scores.

Table 2
The results of Scores 2.