DC-YOLOv3: A novel efficient object detection algorithm

Feature pyramids have become an essential component in most modern object detectors, such as Mask RCNN, YOLOv3, RetinaNet. In these detectors, the pyramidal feature representations are commonly used which represent an image with multi-scale feature layers. However, the detectors can’t be used in many real world applications which require real time performance under a computationally limited circumstance. In the paper, we study network architecture in YOLOv3 and modify the classical backbone--darknet53 of YOLOv3 by using a group of convolutions and dilated convolutions (DC). Then, a novel one-stage object detection network framework called DC-YOLOv3 is proposed. A lot of experiments on the Pascal 2017 benchmark prove the effectiveness of our framework. The results illustrate that DC-YOLOv3 achieves comparable results with YOLOv3 while being about 1.32× faster in training time and 1.38× faster in inference time.


Introduction
In recent years, object detection has become a focus issue in computer vision, object detection has its own challenge to detect and localize lots of objects in different scales and different locations. Object detection has made a great progress owing to the success of convolutional neural networks (CNN). Many CNN [1][2] based object detection framework and their variants [3][4][5], have been proposed and remarkably promote the speed and the accuracy for object detection. These developments also accelerate the application of object detection in many fields, including surveillance, autonomous driving, robotics, pedestrian detection, etc.
In modern object detectors, feature pyramids have both become an essential component for detecting and localizing multiple objects across a wide range of scales and locations. The pyramidal feature representations are commonly used which represent an image with multi-scale feature layers, such as FPN [20], Mask RCNN, YOLOv3 [13], RetinaNet [18].
More recently, YOLOF [21] is presented, which studies the influence of FPN's two benefits in onestage detectors and considers FPN as a Multiple-in-Multiple-out (MiMo) encoder. Inspired by this, in the paper, based YOLOv3 we present a novel one-stage network structure called DC-YOLOv3

Feature presentation in multiscale
It is always a challenge to detect and localize lots of objects in different scales and different locations. Many works have studied to address the issue. Faster R-CNN and SSD use multiple feature maps at different resolutions to fit various scales objects. FPN builds a feature pyramid by multi-scale features summation. Based on FPN, [22] proposes an additional bottom-up pathway, [23] studies to build stronger feature pyramidal representations by using multiple U-shape modules after a backbone model. [24] presents cascaded pyramid network which can learn the pyramidal feature representation with sufficient context information. [25] proposes to combine features at all scales and generate features at each scale by a global attention operation. [26] proposes "Gated CNN" to introduce a "gate" structure injected by multi-scale feature layers to integrate multiple convolutional layers for object detection. [27] uses dilated convolutions to systematically aggregate multi-scale contextual information without losing resolution. In this paper, we investigate to use dilated convolutions to aggregate multiple different scale feature maps.

YOLOv3
YOLO series are one-stage detectors, YOLO [11] and YOLOv2 [12] only use the last output feature of the backbone. The accuracy can't reach the demand of some application. More specifically, in YOLO2 Resnet is used, in Faster R-CNN RPN is used for reference, and in YOLOv3 darknet53 [28] is used. Owing to the more depth and the more successful design of darknet53, compare with YOLO and YOLOv2, YOLOv3 is not only deeper but also semantically strong and more accurate.
Despite all this, under a computationally limited non-gpu circumstance, YOLOv3 can't be used in many real world applications which require real time performance. The main reason is the larger and deeper network in darknet53, which includes 53 convolutional layers and multiple residual blocks, with heavy computational cost. In this paper, we will investigate darknet53 and modify it to fit many real world applications

Architecture of DC-YOLOv3
Our method is similar to YOLOv3 while being faster and more efficient. While other works [29] focused on optimization technique or strategy, our method focuses on modify and optimize the backbone to shrink the size of the network without the degradation of the whole performance. Our network is one-stage, which also has two main components: a novel backbone network (often state ofthe-art feature extraction network) and a classification-regression network. The former is modified from darknet53 with dilated convolution blocks, and the latter is same to YOLOv3. Figure 1 illustrates the architecture of darknet53, the designed backbone and DC blocks. Figure 1(a) illustrates that darknet53 consists of multiple different feature layers, which can extract features from those scales using a similar concept to feature pyramid network. In YOLOv3, the author has demonstrated the strong power of darknet53 by extensive experiments---the features extracted from darknet53 have exhibited super performance. The last three feature layers of darknet53 are used for class prediction and bounding box regression in YOLOv3.

Design of our novel backbone network
In non-gpu circumstance, YOLOv3 can't support real time application mainly owing to the deeper and larger architecture of darknet53. Since only the last three features layers are used to predict small,  [30][31][32] can be used to enlarge the field of convolutional kernels, we can try to substitute dilated convolution with different dilations for the last several residual blocks.
In darknet53, the last but two layers extract fine-grained feature to predict small scale objects, and the layer is the cornerstone of the last two layers, so it keeps unchanged. The last two layers extract middle-grained and big-grained feature to predict middle and big objects, so we can substitute dilated convolution for them. However, dilated convolution poses gridding problems which lead to the completely missing of local information. As a result, the sample from the input can be sparser, which may not be good for feature extraction, and may degrade the system performance.
To solve the problem, we design two DC blocks (see figure 1(c)) in the framework. Take DC block1 for example, the output of the previous layer (256×52×52) is fed into two branches, one is a convolutional module, and the other is a dilated convolution following a convolutional module. Then concatenate the feature of two branches in channel dimension as the final output of DC block1. Similar with YOLOv3, the feature extracted from DC block1 will be fed into two branches, one is a concatenate block in DC-YOLOv3, and the other is DC block2.
In contrast to Darknet53, DC block1 involves less computation and is much simpler and more efficient. This will be verified in the subsequent experiments.

Extension of the architecture
The idea of our design can easily be extended to other object detection framework. We can substitute the DC blocks for multi scale feature extraction modules. The only thing we need to do is to fine-tune the hyper-parameters in the DC blocks, such as the value of dilation.  In addition, hybrid dilated convolution (HDC) [30] can alleviate the gridding problem in dilated convolution, so design a HDC module, and try to substitute HDC for the relevant layers is also a good idea.

Experiments on Pascal VOC 2007 dataset
In this section, we present evaluation results of the proposed method on the Pascal VOC 2007 [33]. In experiments, YOLOv3-darknet53 is used as the baseline method. The training and testing experiments were conducted both on a thinkPad t470 laptop (limited by the computation resource).
In object detection experiments, to accelerate the training speed and get the desired performance, we adopt a pretrained model on MS COCO [34]. Since darknet53 and our backbone network have the same first three layers as shown in Figure 1, we only use the parameters of the first three layers to initialize the corresponding parameters in the mode we will train.
For the Pascal VOC 2007 train set and test set, we use the 5k trainval images and the 5k test images in VOC 2007. The partial hyper-parameters are as follows: the initial learning rate and weight decay are respectively set as 0.001 and 0.92, IoU threshold for assigning ground truth is 0.3. The batch size of training phase is 8 and Adam optimization strategy is adopted.
Training phase. We train two networks in the same way, freezing the parameters of the first three layers. From Table 1, contrast to YOLOv3, training time of one epoch and param size in our method are both much fewer. This saves the memory and computation resource obviously. Figure 2 shows the total loss variation with the training time. This indicates that they may have a comparative converge speed and accuracy. Test phase. We test the two trained network model on 5k test images in Pascal VOC 2007. Our method is slightly inferior to YOLOv3 in accuracy. After about 25 training epochs in the experiments, the mAP of our designed model is 45.21%, while YOLOv3's is 47.07%. However, from table 1, our method has the obvious advantage in speed. It can be used in the real world applications which require real time performance on a non-gpu computer. Figure 3 shows some object detection results of our method from the test images in Pascal VOC 2007.

Conclusion
DC-YOLOv3 achieved its desired goal, which made object detection task work on the non-gpu computers. In addition, DC-YOLOv3 offers the following contributions to the field of object detection. First, it provides an idea to simplify a multi-scale backbone and easily extends to some network modules. Second, to increase the accuracy the DC-YOLOv3, one can substitute hybrid dilated or well-designed combinatorial dilated convolution for dilated convolution.