Garbage detection and classification based on improved YOLOV4

Object detection is an important research task in computer vision. With environment problems becoming seriously, garbage detection has become a hot direction in object detection. Though great progress has been made in garbage detection, there still exists challenges for general-purpose detectors with no public datasets and low recognition accuracy rate. Based on YOLOV4 model, this paper proposes an improved YOLOV4 detector for garbage detection. Specifically, CBAM is added to the feature extraction network in order to better extract features in deep networks, and focal function is integrated into the loss function to improve class imbalance. When preparing the data, environment information and random information are added to pictures to simulate the distribution of garbage in real environment, and finally an urban household garbage dataset containing 47 classes of 45,910 images named TrashSet has been produced. Experiments results on TrashSet verify that our detector has a considerable performance, and mAP reaches 97.15%.


Introduction
Object detection is a basic computer vision task and various object detectors have been devised such as YOLO [1], SSD[2] and Faster R-CNN [3]. On general datasets such as COCO [4] and VOC2007 [5], these general detectors have achieved exciting results. Garbage detection and classification is an important part of object detection. However, the characteristics of garbage detection task are different from those of ordinary object detection tasks: complicated background, irregular shape and arbitrary orientations. And most general object detectors do not pay attention to the difference between the specific scenes and the standard datasets scenes, cannot achieve the same results as general datasets. Therefore, it is of great practical significance and application value to devise a special object detector for garbage detection.
In this paper, we discuss the applicability of the improved YOLOV4 [6] object detector in the detection garbage. There are four main contributions, Firstly, there is no public dataset of urban household garbage in the object detection field at present. Therefore, one of the key points and difficulties in our experiments is to construct an urban domestic garbage dataset. Secondly, because of the complicated background, irregular shape and arbitrary orientations of objects to be detected, we add CBAM [7] to the backbone network, used to abstract effective features in deep network. Thirdly, in order to solve the problem of class imbalance, focal function [8] is added to the loss function devising. Lastly, through the contrast analysis of ablation experiment, it is proved that adding CBAM to backbone network and focal function to the loss function can improve the detection effect.

Materials and Methods
Garbage detection has been a hot but difficult research direction in object detection with the characteristics of complex background, irregular shape, and arbitrary orientations of the objects to be detected. Considering the rationality and the applicability of neural network models in urban garbage detection scene, the following part will mainly tell from the neural network structure devising and its improvement.
Based on YOLOV4 model, we propose an improved YOLOV4 detector consisting of three parts: feature extraction network, feature pyramid network and YOLO Head detector. We first give an overview of our improved YOLOV4 model in figure 1. 2.1. Devise and improvement of feature extraction network YOLOV4 object detection model adopts CSPDarknet53 network as feature extraction network. As shown in figure 2, the network is composed of a series of residuals structures. The remnant networks adopted in this part help to fuse different information of shallow and deep objects and better express object characteristics. CSPDarknet53 uses multiple feature layers for object prediction. It is located at the middle, middle and lower levels, representing different sizes of objects respectively. Therefore, the CSPDarknet53 network has a good detection effect for different size objects. Attention mechanisms are critical in human perception [9]. The purpose of adding an attention mechanism to the network is to amplify important features and suppress unwanted features so that the network can learn as we think. In the urban household garbage object detection task, there are many problems, such as complicated background, irregular shape and arbitrary orientations and these problems will seriously interfere with the correct classification and prediction of the detector. Therefore, adding the attention mechanism to the feature extraction network of detectors can make the network learning pay attention to meaningful information and ignore unimportant features.
The attention model is divided into channel and spatial attentional models. The channel attentional models aim to evaluate the importance of different channels, strengthen important channels in the process of learning and training, ignore minor channels and better abstract the features of inputs. The spatial attention models are intended to mimic the characteristics of the human eyes, focusing on pixel locations where features are prominent in the images. In this paper, the attentional model CBAM is adopted, which focuses on both channel and space.
CBAM called Convolutional Block Attention Model, is a simple but effective attention mechanism applied to feedforward neural network. CBAM infers the features passed by the middle feature maps along the two dimensions of channel and space, and multiplies the received attentional maps with the input characteristic maps to obtain the adaptive optimization features. In this paper, CBAM is added to the residual structure of backbone to extract the features better during training. Figure 3 shows the structure of CBAM.

Feature pyramid network
SPP [10] and PANet [11] networks are used for the YOLOV4 component of the feature pyramid network. Feature pyramid can combine deep / shallow features with multi-resolution prediction to improve the recognition accuracy.

Object detector head
YOLOV4 detector head is designed and combined with three parts of network outputs, loss calculation and forecast result analysis. Three feature layers of feature pyramid inputs are processed, and anchors are generated, then the loss is calculated by coding. Finally, the prediction results are decoded.

Devise and improvement of loss function
This paper uses the Pytorch neural network framework to devise the improved YOLOV4 object detector. The loss function is designed in three parts: location loss, confidence loss, classification loss. The loss function is shown in equation (1) We set Location loss is calculated using CIOU Loss [12] as shown in equation (2).
The IOU represents the ratio of the intersection and union of a prediction box and a real one.  represents the Euclidean Distance between the centre of prediction and the true one and c represents the diagonal distance that contains the minimum closed area of both the forecast and real ones. Besides, we set The confidence loss consists of two parts, one is the existence of a real object, and the prediction results are compared with those of 1.0. The other is that there are no objects existence in the picture, and the result of comparing 0 with the value of confidence in the prediction results. The devise of confidence loss is shown in equation (3): BCELoss conf mask mask BCELoss conf noobj mask noobj mask The classification loss is the result of comparing the prediction class with the real class against the actual object boxes, as is shown in equation (4).
BCELoss pred cls smooth labels  ( 4 ) Where BCELoss represent the cross-entropy loss function, as shown in equation (5): Of these, y stands for real results and ' y stands for prediction results.
In this paper, the focal function is used to deal with the problem of class imbalance in the sample. Focal function reduces the weight of a simple negative sample in training and is an improved version of the cross-entropy loss function. In this paper, in the devise of the YOLOV4 loss function, location loss and confidence loss function are replaced by focal loss function, as shown in equation (6).  .

Results & Discussion
Trainings are implemented by Pytorch on a server with Nvidia Geforce RTX 2080Ti and 8G memory. We implement experiments on the self-made dataset named TrashSet to verify the applicability of our design scheme. Finally, we analyse and discuss the experimental results. In experiments, the statistical score thresholds of precision rate and recall rate are set to 0.5, and mPrecision and mRecall represent the average value of precision rate and recall rate in categories. Note that the duration test platform is an Alibaba Cloud lightweight application server, equipped with a 1-core CPU and 2G memory.  figure 4. We process the garbage pictures obtained: set the length and width of each picture to a random number between 416 and 608; add 10 scenes and 120 backgrounds such as desktop, floor tiles, wallpaper, grass, land, and solid colours to each picture. Some small pictures are added to the background of the picture, and a random distance is used between the pictures. The pictures are also rotated at a random angle to simulate the placement of garbage in the life scene, and some pictures are added to the mosaic.

Evaluation indicators.
We use precision rate, recall rate, mean average precision (mAP) and time complexity as the evaluation indicators of the experimental results. Precision rate indicates how many of the objects predicted to be positive are true positive objects. Recall rate indicates how many positive objects in the sample are predicted correctly. MAP can comprehensively evaluate the precision and recall of experimental results. In addition, time complexity is also one of the important evaluation indicators of this experiment.

Experiment methods.
In the training phase of the improved YOLOV4 model, the cosine annealing decay method is used in learning rate. The learning rate will first rise and then fall. It rises linearly when rises, and when it falls, the simulated cosine function decreases. This process will be executed multiple times. The initial learning rate is set to 0.001, and the epochs is set to 100, 000. The loss function declining curve is shown in figure 6, and the learning rate changeable curve is shown in figure 7.

Experiment results
In order to verify the applicability of our detector, we perform many ablation experiments for the improved part of this paper. The ablation experiments all use the same training dataset and test dataset, the same training batch and learning rate. The size of the pictures is adjusted to 418418 before training, and the results of the ablation experiment are shown in figure 8.

Discussion
Through the ablation experiments, we can draw the following points:  Effect of CBAM. It can be convinced in figure 8 that the evaluation indicators have been improved to varying degrees after adding CBAM. We find mPrecision increasing by 0.14%, mRecall increasing by 0.87%, mAP increasing by 0.31%, and the time detection cost increasing by 0.8s. As discussed above, CBAM is effective to suppress the noise and highlight the information of objects both from channels and spaces.  Effect of focal loss. As seen in figure 8 that the effect of the whole experiments have been greatly improved. We find mPrecision increasing by 2.62%, mRecall increasing by 9.73%, mAP increasing by 5.20%. At the same time, time cost has not changed compared to before the focal function is added. After analysis, we find the following two reasons: Firstly, the dataset TrashSet has class imbalance problems, such as 1,251 newspaper flyer, 1,674 vegetables and tea, 98 leftovers, 74 diapers and 31 expired drugs in the original garbage pictures. The focal function has a class equilibrium factor  , which makes the network give higher weight to the category with fewer original images, and can effectively make up for the effect caused by the large difference in the number of samples in different categories. Secondly, in the characteristic diagram of one-stage object detection algorithm, there are a large number of background boxes which are unrelated to the objects, i.e. negative samples, and the samples containing the targets are only in a small part of the region. The unbalanced positive and negative sample ratio will greatly affect the convergence of loss function. The negative and positive background loss function factor  is set up in focal loss function, and we set 1   , which greatly reduces the impact of negative samples such as background frames on the loss functions, so that the network can be concentrated into the difficult samples.  Effect of CBAM and focal loss. Figure 8 shows that the experiments perform best when we add CBAM and focal function: mAP reaches 97.15%. But, compared with before adding CBAM, the time cost increases by 0.8s. From table 1, we can see that the precision rate, recall rate and mAP of most classes of garbage have reached more than 95%, but the recognition accuracy of similar-shaped garbage such as pesticide containers and plastic bottles is still not high. It is also the direction we will continue to study.

Conclusion
We have presented an improved YOLOV4 detector and a dataset for urban household garbage detection. We study the characteristics of complicated background, irregular shape and arbitrary orientations of the objects to be detected. Besides, we propose an improved YOLOV4 detector model, which makes adaptive improvements to the urban household garbage scene. By constructing TrashSet, a dataset of urban household garbage containing 47 categories, adjusting the feature layer, adding CBAM to the feature extraction network, adjusting anchor boxes through the k-means algorithm, and adding the focal function to the loss function, strengthens the ability of our detector to detect garbage. Experimental results show that the improved YOLOV4 object detector can meet the needs of real-time detection; the precision rate, recall rate and mAP are improved compared to the original model, which effectively improves the problem that the general detector does not meet expectations in garbage detecting scene. However, this article does not test the generalization of the improved YOLOV4 object detector on other urban household garbage datasets. The generalization test and improvement of the improved YOLOV4 object detector will be followed.