Lightweight Crowd Counting Network based on Depthwise Separable Convolution

Crowd counting on the image is a challenging problem. Many neural network-based methods usually use two-branch and multi-branch networks to extract high-level features of different scales or densities, and then merge these features by a fusion operation. Although these methods can reduce the error of crowd counting, it makes the amount of parameters is enormous, so that the efficiency of training and optimization of the model is low, and the calculation resource consumption is high. To this end, a residual network based on depthwise separable convolution is proposed for image crowd counting. The network can not only reduce the amount of calculation through depthwise separable convolution, but also deepen the network depth through the residual structure to extract more effective high-level features. The experiment proves that, compared with the start-of-the-art methods, the method in this paper dramatically reduces the parameter amount to 1.91 Million when the accuracy is comparable.


Introduction
The Corona Virus Disease that broke out in early 2020 swept through China, but the epidemic was controlled within a short period. This is closely related to the national prevention and control measures: preventing crowds from gathering and limiting the number of personnel in some necessary places, such as: Limiting the number of people and the distance between people in supermarkets, vegetable markets and student cafeterias. The spread of the disease caused by the gathering of people has caused irreparable damage to people's lives and safety, and it has increased the difficulty of national prevention and control. To prevent and reduce the spread of this virus caused by crowd gathering events, surveillance equipment were installed in different places to monitor people's activities [1]. Therefore, crowd counting has aroused great interest, and many researchers have tried to solve this problem. With the development of computer vision and deep learning, researchers have proposed many deep learning-based methods to solve the crowd counting problem. These methods have significantly improved the resolution of scale diversity and density diversity. But due to a large number of parameters and calculations, it is difficult to be applied in real-world applications. To solve

Related Works
Whether using regression or density maps, the Convolutional Neural Networks (CNN) methods can achieve excellent results [2][3][4]. Zhang et al. [5] proposed a deep convolutional neural network for crowd counting, confirming that CNN has higher efficiency and accuracy than traditional methods. Zhang et al. [6] proposed a three-column CNN structure. Sam et al. [7] proposed a switched CNN model by using different networks of scale information division and counting. But the distinct disadvantage is that the number of network columns limits the diversity of the crowd scale, and the number of network parameters in multi-columns is enormous. Sindagi et al. [8] proposed a cascaded CNN model to learn global correlation features and crowd density estimates. Liu et al. [9] proposed a method that combines detection and regression. Li et al. [10] proposed a network model divided into a front-end and a backend. Ranjan et al. [11] proposed a two-branch CNN structure. However, these methods have many branch networks with a large number of parameters, which is inefficient during training and optimization.

Proposed method
The current crowd counting still has the problem of scale diversity of people in the picture. When extracting features through various convolutional networks to generate density maps, the parameters are enormous, and the efficiency is low, so that excellent computing resources are required. Although using a shallow layer of a network can improve efficiency, it cannot get more useful features. Although using a deeper layer of a network can get more features, the training and optimization of the model are arduous and only used in an experimental environment. To tackle the above problems, this paper proposes a residual structure network based on depthwise separable convolution. Compared with the straight structure of VGG network, the residual structure reuses the image features better and eases with the network deepening. The introduction of depthwise separable convolutions dramatically reduces the number of network parameters.

Depthwise separable convolution
Depth separable convolution is to divide the convolution operation of the image feature matrix into two dimensions: space and channel. It is divided into two parts: depthwise convolution in spatial dimension and pointwise convolution in channel dimension. The depthwise separable convolution operation is shown in figure 1. For depthwise separable convolution and ordinary convolution, when the stride=1, the calculation amount is shown as the follows: where CD is the calculation amount of the depthwise separable convolution, CO is the calculation amount of the ordinary convolution, M is the number of channels of the input feature matrix, N is the number of output feature matrix, HK is the width of the convolution kernel, and HI is the width of the input feature matrix.

Residual structure
The traditional residual structure first performs 1×1 convolution reduction operation on the input feature matrix, then using 3×3 convolution to extract the image features, and finally it outputs the feature matrix with a higher dimension. This purpose is also to reduce the amount of calculation in the residual structure. In the new residual structure, to reduce the amount of calculation, the 3×3 ordinary convolutional layer is replaced by a depthwise separable convolutional layer. In order to extract more depth features, this paper first performs 1×1 convolution to increase the dimension of the input matrix, and then using the 3×3 depthwise separable convolution to extract the features of spatial dimension and channel dimension. The branch of shortcut guarantees the multiplexing picture features, and the depthwise separable convolution guarantees the reduction of the calculation amount and the parameter amount. The residual structure is shown in figure 2. C is the number of input feature matrix channels, D is the number of feature matrix channels in the depthwise separable convolution process, and P is the number of output matrix channels.

Crowd counting network based on depthwise separable convolution
The lightweight crowd counting network based on depthwise separable convolution proposed in this paper is shown in figure 3. The input image passes through each layer structure of the network, and finally, the density map of the picture is output, and the number of people is obtained by regression of the density map. The network includes low-level feature extraction, high-level feature extraction and density map regression. The feature remapping block is composed of a traditional convolutional layer with a convolution kernel of 3×3 to remap the image features. Multi-Residual blob is a network structure consisting of a series of new residual structures similar to residual structures for extracting advanced features. In each residual structure, the parameters of the convolution kernel of depthwise separable convolution, the number of channels and the step of the feature matrix are different. Since the final output of the residual structure is output through dimensionality reduction, but the ReLU activation function has a severe loss of low-dimensional feature information, so in this residual structure, the last layer uses Liner, and every other layer is activated with ReLU. The density map regression block is mainly composed of a traditional convolutional layer with a convolution kernel of 1×1 to generate a density map. Since the eigenvalues of the density map are always positive, the ReLU activation function can be used after the regression layer of the density map to enhance the recovery of the density map.

Experiment
In this section, we first introduce datasets and experiment evaluation metrics. Then an evaluation of transferability is reported to demonstrate the transferability of the proposed method across datasets. Finally, we give the evaluation results and perform comparisons between the proposed method with recent methods.

Dataset
ShanghaiTech dataset: a total of 1198 images including two parts A and B. As shown in figure 4, Part A consists of 482 pictures randomly selected from the Internet, the training set contains 300 pictures, and the test set contains 182 pictures; part B consists of 716 pictures taken in the streets of Shanghai city, the training set contains 400 pictures, the test set contains 316 pictures.   In formulas (4) and (5), N is the total number of pictures in the test set, gi is the ground truth value of the i-th image, and pi is the predicted value of the i-th image.

Implementation Details
In order to improve the training effect of the model, data expansion processing is performed on the samples, and the original pictures are divided into nine subset with equal size. We use 90% of the training set images as training samples and the remaining 10% as verification samples. And we set the learning rate to 0.00001 and momentum to 0.9, and using Adam optimization.

Comparisons with Recent Methods
We demonstrate the efficiency of our proposed method on challenging crowd counting dataset. Table  1 reports the results on ShanghaiTech. The cross dataset experimental results are presented in table 2. We can observe that the proposed method generalizes well to unseen datasets. The proposed method also performs better than MCNN in transferring models trained on ShanghaiTech Part A to Part B. Yet, the improvement is not as significant as the comparison with MCNN on transferring on ShanghaiTech Part B and Part A. This is probably because MCNN is easy to extract features in scenes with low crowd density. This also confirms the generalizability of the proposed method.
As shown in table 3, the number of parameters of our proposed is the least except SANet [12]. Although the error of Switch-CNN and CSRNet is slightly lower than the method in this paper, the number of parameters is almost eight times of this paper. Compared with other methods, the method in this paper have comparable results, but the number of parameters is much reduced, which proves that the method proposed in this paper is more lightweight and has higher efficiency under the condition of accuracy guarantee.

Conclusions
This paper proposes a lightweight network for crowd counting. Compared with many CNN-based methods, the method in this paper reduces the number of parameters to 1.91 Million while ensuring accuracy. However, the accuracy of this method is not high enough for crowd counting. Next, we will further study how to improve the network to extract more features of different scales to improve the accuracy of crowd counting. Achieving accurate and reliable crowd counting with fewer parameters, which makes it possible to expand to real-world application scenarios, such as embedded devices and mobile devices.