Research on Semantic Segmentation Algorithm for Multiscale Feature Images Based on Improved DeepLab v3+

Aiming at the question of erroneous segmentation and missed segmentation that occurs when the DeepLab v3+ model cannot fully utilize high-resolution shallow features, a feature image semantic segmentation algorithm based on improved DeepLab v3+ with multiscale features is proposed. First, in the backbone network, multi-scale pyramid convolution is introduced; second, the standard convolution in the pooled pyramid of null-space convolution has been taken by depth-separable convolution, which declines the number of parameters in the overall model; finally, a multi-scale approach is used in the decoding layer to capture the acquired global background, and the background features are combined with the shallow features, and the fused shallow features are enriched by the fusion-attention mechanism to provide the semantic information for images. The experimental results show that in the cityscape validation set, this paper’s method has a better edge segmentation effect, and the average cross-merge rate reaches 74.76%, which is 2.20% higher than the original algorithm. By comparing with other algorithms, the effectiveness of this paper’s method in improving mis-segmentation and omission segmentation is verified.


Introduction
Semantic segmentation of an image is pixel level categorization where pixels belonging to the same category are categorized and hence semantic segmentation understands the image from pixel level.Semantic segmentation of images covers three elements: image segmentation, image classification and target detection, which belongs to a long-standing core issue in computer vision systems.Deep convolution neural network (DCNN) [1] works very well in semantic segmentation and is able to use different colors in each region or pixel of an image to label whether it is a certain type of target or nontarget.Therefore, semantic segmentation of images plays an important role in understanding the information contained in the image and is the basis for analyzing the image work.The advantage of the deep convolutional neural network is that it can learn more than traditional manual representation learning, it has high practical value and development prospects in the fields of computer visualization and artificial intelligence [2], autonomous driving [3][4], industrial testing [5], remote sensing (technology) [6], agricultural science [7], medical image analysis.
The DeepLab family is successful and very popular in deep neural network-based semantic segmentation models, Chen et al. [8] proposed DeepLab v1 model, which first proposed and adopted the cavity convolution, which ensures the feature map size remains the same while increasing the image sensing field, thus replacing the up-sampling and down-sampling; it also uses the Conditional Random Field (CRF) algorithm to smooth the segmentation boundaries to reduce the edge noise.DeepLab v2 [9] adds the Atrous Spatial Pyramid Pooling (ASPP) to DeepLab v1, keeping the characteristic map unchanged, incorporating dilation convolution, ASPP, and fully connected CRF.The multi-scale information is utilized, and the multi-scale features of the image are obtained through different null rates, which enhances the prediction ability of different scales of the image.DeepLab v3 [10] based on DeepLab v2, the primary network is substituted with the residual network Res-Net, the serial null convolution is deepened, and the conditional random field (CRF) is removed, which makes the model more powerful.(CRF), the model is more concise and easy to understand, incorporating improved ASPP, batch normalization, and better multi-scale context encoding methods, thus further improving the segmentation performance of the model.However, DeepLab v3 only splices the shallow characteristics of the picture obtained from the deep neural network and the deep features obtained from the ASPP aggregated context information and classifies the obtained fused features to directly obtain the classification results, and the final results obtained by this method, because the acquired shallow features are only one of the multi-layer shallow features of the backbone network, resulting in the loss of some effective information, the appearance of segmentation discontinuity and roughness of the segmentation boundary.
To address the phenomena of semantic segmentation, such as the "loss" of information at the edge of the image, or even wrong segmentation and missing segmentation, this paper draws on the DeepLab v3+ semantic segmentation framework and proposes to improve the framework according to its shortcomings: (1) Multi-scale convolution is being introduced for the backbone networks Res-Net, which aggregates multi-scale information, increases the sensory field, and combines the attention mechanism, which is used to automatically learn the attention weights, obtain the correlation between the encoder's hidden state (candidate state) and the decoder's hidden state (query state), and enrich the semantic information.(2) The introduction of dynamically separable convolutions in a null-space pooled convolutional pyramid reduces the model size, reduces the parametric number of the model, speeds up the training efficiency, and even allows the model to be run on a lower-configuration server, with little impact on the results.(3) On top of the decoder, the design of multi-scale feature extraction structure helps to obtain the global and local environment situation, and to obtain richer global semantic information by integrating the characteristics of multiple scales; before merging with the shallow features, the use of the attention mechanism can effectively reduce the interference at the same time, to obtain more channel weights, so as to enhance the model's learning ability.

Improvements to the DeepLab v3+ Algorithm
As shown in Figure 1, the improved DeepLab v3+ network topology, its main architecture is still the encoderdecoder structure, the original algorithm uses the backbone network is ResNet [11], to solve the semantic segmentation task in the information "lost", wrong segmentation, missing segmentation and other problems.The main improvements are in the following 2 aspects: (1) in the area of coding layer, the PyConv [12] algorithm is introduced to improve the backbone network, and then set up convolution kernels of different depths in different layers of the backbone network, and divide the input characteristics into different groups, and each group independently carries out the convolutional computation; and then, the ordinary convolution in the convolution pooling pyramid in the empty space is replaced with depthseparable convolution, i.e., replacing all the 3*3 convolution layers with depth-separable convolutions of 3*3, and reducing the number of network layer parameters, which achieves the improvement of the training efficiency under the premise of a small error on the arithmetic results; (2) In terms of the decoding layer, the improvement of the original algorithm is mainly done by merging the different stages of the results of the primary residue network.It is well known that the feature maps produced by each layer of operation in the backbone network affect the final segmentation map results, and only the highresolution features of the first layer, i.e., the feature maps of 1/4 size, are used in the original segmentation network DeepLab v3+.In this paper, we will also use the feature maps output from the second and third layers of the backbone network, with feature map sizes of 1/4, 1/16, and 1/32, and the number of channels of 512, 1024, and 2048, respectively.Firstly, the outputs of each feature layer of the backbone network are up-sampled using bilinear interpolation, and the number of channels of the three layers is reduced to 64 by using the convolution of 1×1; then the outputs of the three feature maps are channel-stacked; then the outputs of the three feature maps are channel-stacked.of the three feature maps are channel-stacked into a feature map of 1/4 size with 128 channels, and processed through the attention mechanism module CBAM [13] (which enables the network to focus more easily on the key locations of extracted features and increases the network's extraction of edge features); next, the attention mechanism-processed feature maps are stacked with the shallow features and the quadrupleupsampled deep features are superimposed, and the number of channels is adjusted to 256 by 3*3 convolution; finally, the feature map is restored to the original image size by up-sampling, and the feature map is classified into predefined categories for output by the classifier.

Residual Networks with the Introduction of Pyramidal Convolution
DeepLab v3+ uses a backbone network of ResNet, with the original residual block shown in Figure 2, with a 3×3 convolutional kernel.Since increasing the size of convolutional kernel in ResNet brings a huge cost in the number of participants and computational complexity, for this purpose this paper uses Pyramid Convolution (PyConv) proposed by Duta, et al. [12], which contains different levels of convolutional kernels with different sizes and depths.In addition, PyConv is very efficient in that it can keep the number of arguments and computational expense similar to standard convolutionary, increasing the performance of the model without compromising computational efficiency.
In this paper, we use a residual block that introduces pyramidal convolution, and the new residual block is shown in Fig. 3, where the pyramidal convolution kernel structure is shown in Fig. 4.  The main idea behind the introduction of PyConv is to Pyramid Convolution (PyConv) divides the incoming features into different groups and performs the conversion computation independently, and when combined with the backbone network, the number of branches is gradually reduced considering the reduced spatial dimensions of the characteristic map.The initial stage characteristic map passes through four branches and the final stage feature map passes through one branch.The convolution kernel size, number of output channels and grouping corresponding to different convolution kernels used at each layer of the residual network are shown in Tab.1.Compared to normal deposition, PyConv not only expands the acceptance domain of the convolution kernel while keeping the computation cost intact, but also applies in parallel different kinds of convolution kernels to process the inputs with different spherical resistances and deployments of depths, which ensures that the new residual block captures more detailed information.Different from the traditional ordinary convolution operation, which needs to consider 2 factors of channel and region at the same time, the depth separable convolution firstly considers only the region factor, and then considers the channel factor, thus realizing the separation of 2 factors of channel and region.
In the void space pooling pyramid, normal convolution is changed to depth separable convolution.First, channel-by-channel 3×3 convolution is made to separate the channels of the input characteristic map.After that, feature splicing is performed by 1×1 convolution.The improved ASPP module is represented in the ASPP matrix in Fig. 1.Finally, the output feature map splicing is performed by reducing its channel count through depth separable convolution to remove useless features.
The replacement of the ASPP module of the deep separable constitutive network can effectively reduce the parameters of the modelling, which can improve the training effectiveness of the model with less loss of accuracy to the model.

Multi-scale Feature Fusion
The means to improve model discrimination and enhance image edge segmentation in recent years can be used in the form of multi-scale feature fusion.A typical example is a pyramid network proposed by Zhao, et al. [15], i.e., by aggregating the feature maps generated by multiple inflated convolutional blocks, the semantic information of the feature maps at different scales can be extracted and utilized to improve the model discrimination ability and enhance the image edge segmentation ability.
The representation of shallow features is made richer by utilizing features learned at multiple scales, and the method helps to coding both global and local circumstances.In this research paper, we propose a new approach of combining multi-scale attention machinery in a network, which is incorporated into the decomposition layer of conventional Deeplab v3+ through CBAM attentional machineries after combining, by drawing on the multi-scale self-guided attention network design approach for image segmentation proposed by Sinha, et al. [16] in the field of medical sciences.In particular, in the context of this paper, features at multiple scales are represented by S F , and S denotes the amount of layers in which the characteristic maps are embedded.As the features of individual layers have respective resolutions, they need to be converted to the same resolution, and the bilinear interpolative up-sampling method is used here so as to obtain the output zoomed characteristic maps represented by S F , and the numbers of layers corresponding to the corresponding characteristic maps are denoted by S.Then, on the basis of not changing the originally network construction, 1

Experiment dataset and the experiment environment
In order to test the validity of the experimental results, the validation set of the public dataset CityScapes is used in this experiment to conduct and validate the generalization ability of the model.The public dataset CityScapes, also known as the Cityscapes dataset, which contains different stereoscopic video series recorded from street scenes in 50 various cities, covers 20,000 larger weakly anchored frames, and 5,000 frames of high-quality pixel-level anchorages.There are a total of 5,000 fine annotations in the Cityscapes dataset, which can be categorized into TRAIN, VAL, test three parts are 2975 training maps, 500 verifications and 1525 test maps respectively, where each image size is 1024×2048.
In this experiment, the deep learning platform pytorch is adopted to build the network, and the configuration of the experimental machine used is shown in Tab. 2

Training Strategies
Due to the limitation of equipment performance, the CityScapes dataset was trained with the stochastic gradient descent (SGD) optimization method, with the importance decay set as 0.00011, the momentum set as 0.91, the base training rate as 0.1, and the learning rate decline with the "Poly" decay policy, and the end-to-end training through backpropagation.When performing training on the public data set CityScapes training set, since the maximum amount of steps of iteration is set to 50000.
In this paper, we use the fine-tuned PyConvResNet-50 to train the pre-training model weights in ImageNet network by means of migration learning.Through migration learning, the feature map gradient vanishing and gradient explosiveness problems can be significantly improved, and the trained speed and concentration speed of the improved network can be increased, which in turn saves the time of deep network learning and improves the learning efficiency.
The training loss function uses the Cross-Entropy function (CEF), which is employed to characterize the difference between two distribution of probabilities.The multicategorical cross-entropy loss feature is stated in Eq. ( 2) as below: In the formula, M denotes the quantity of categories; ic y denotes the indication parameter, whose value is defined as '1' if the categorization is the same as the categorization of sensor i, and '0' if not; and ic p denotes the prediction probability of observing that sensor i is in category c.
The evaluation approach used in this paper is the Mean Intersection over Union (MIoU), and the formula for the MIoU is shown in equation ( 3) below [17].
where ii p denotes the actual class of pixels is class i and the total number of pixels whose true class is also class i; ji p denotes the maximum number of pixels whose actual class is class j and whose predicted class is class i; and ij p denotes the actual class of pixels is class i and the projected class is class j.

Model Performance Comparison.
In order to measure the performance of the proposed improved method and validate its effectiveness, in addition to comparing the original algorithms proposed in this paper Z_DeepLab v3+ with DeepLab v3 and DeepLab v3+, experimental validation is conducted with the latest optimal models of HRNet [18] and CCNet [19] in the CityScapes validation set, and the prediction results are showed in Tab.3 and the model parameters and detailed quantitative information are presented in Tab.4.Combining Tab. 3 and Tab. 4, it can be seen that although DeepLab v3+ has superior results compared to other semantic segmentation models, the model constructed in this paper obtains more competitive results with an MIoU of 74.76%, which is 2.20% higher compared to DeepLab v3+, and reduces the number of model parameters due to its improved ASPP module.
Compared to DeppLab v3+, the model proposed in this paper has fewer number of parameters and obtains better training results, with only some extension in training time.

Comparison of the effect of Integrating Different Scales.
Obviously, incorporating multi-scale characteristics is an effective method to improving the accuracy in image semantic segmentation.Compared with the high-level features, the low-level characteristics have higher definition, more positional and detail information, but due to less convolutional processing of the low-level features, they are less meaningful and more noisy; the high-level features have more semantic information, but with lower resolution and poorer ability to perceive the details.Therefore, This article explores how to combine the two efficiently, taking their strengths and discarding their weaknesses in an experiment.
In this paper, in an attempt to get the results of fusing the information from different feature layers on the impact of model prediction, the effectiveness of each improved scheme is tested using the control variation method to verify the effectiveness of each improved scheme.The results of the scenario testing are shown in Tab. 5.

74.76
As can be seen from Tab.5, the characterization fusion using three Blocks of the backbone network performs better than the feature fusion using two layers of Blocks, so the method of manually setting the network to select the features of Block2, Block3 and Block4 of the backbone network for the fusion is desirable and achieves better results.

Comparison of Training Time and Model Parameters of ASPP Module with Depth Separable
Convolution.In an attempt to check the impact of the improved ASPP modules on the experimental results and the extent of improvement in model training time and training efficiency, this paper compares the effect of the ASPP module before and after the improvement, and the experimental outcomes are showed in Tab.6: As can be seen from Tab.6, the modified ASPP module has a better effect on reducing the model parameters and speeding up the efficiency of model training, which can reduce the overall model training time and has little effect on the final accuracy of the experiment.Therefore, in this paper, we choose to introduce depth-separable convolution for model design.lights on the poles in the first column and segmenting them out, in addition to this, there is no road sign in the right box of the segmentation results of this paper's model, and there is no wrong segmentation and no blurred segmentation phenomenon.Compared with the second column, this paper's model has clear segmentation results for image information such as street signs and road signs, and this paper's model avoids this kind of segmentation error for the rest of the models that incorrectly segmented the road signs in the right-side box.The prediction results are more accurate and comprehensive.

Conclusion
In this paper, corresponding improvement schemes are proposed for the phenomena of erroneous segmentation and boundary blurring of DeepLab v3+ model.By fusing the multi-scale features and attention mechanism, the multi-scale feature information of the image is effectively extracted; and by introducing the depth-separable convolutional convolution, the model parameters are reduced and the model training efficiency is accelerated.Through experiments, the model proposed in the paper achieves a MIoU of 74.76% on the publicly available dataset CityScapes validation set, while the segmentation results also prove the effectiveness of the model for boundary information extraction.Although the algorithm in this paper has improved in accuracy compared with previous algorithms, the multi-scale combination method used is set manually, and the optimal multi-scale fusion method is not explored, and in the subsequent research, it will be investigated in depth on how to further enhance the model expressiveness, and further improve the segmentation effect of the model on the actual dataset.

3. 3 . 4 .
Comparison of This Paper's Method with other Model Segmentation Methods.The method in this paper is compared with the segmentation results of the DeepLab v3+ model on the CityScapes validation set, as shown in Figure 5.

Figure 5 .
Figure 5.Comparison of segmentation results on Cityscapes validation set.As can be seen from Figure 5, the level of accuracy of the improved algorithm and the effect of the model segmentation results show a positive correlation, and we can use the level of model accuracy to judge the effect of the model segmentation results.Comparing the first column of pictures, the model boundary of this paper's algorithm is obviously clearer than the rest of the model, identifying the traffic

Table 1 .
Use distribution of pyramid convolution.
[14]h separable convolution[14]consists of two steps, i.e., decomposing the traditional deposition into a depth concentration and a 1×1 concentration, which can be operated to drastically decrease the number of arguments of the model with only a limited loss of accuracy, thus improving the training efficiency of the model being computed.

Table 2 .
Experimental machine hardware and software configuration.

Table 3 .
Performance comparison of different models on Cityscapes.

Table 4 .
Model parameters and detailed quantitative information.

Table 5 .
Comparison of different multi-scale effects of fusion.

Table 6 .
Impact of improved ASPP module.