Fast-DecoupledNet: An Improved Multi-branch Edge Enhanced Semantic Segmentation Network

There are existing semantic segmentation methods that incorporate the idea of edge detection by using multi-branch networks to focus on edges and subjects separately, but there are a large number of unfocused aspects and limited improvement. In this paper, we suggest the Fast-DecoupledNet, a semantic segmentation network. We design Edge Feature Extractor to extract the target’s edge features more accurately, and the global features obtained by joint downsampling are computed to obtain the subject features and the final features. In addition, we employ a shallower ResNet as the backbone network to reduce computational complexity while ensuring computational accuracy. Our proposed methods achieve the state-of-the-art 72.59 F-score and 77.64% mIoU on the Deepglobe Land Cover Classification dataset.


Research Significance of Semantic Segmentation
Semantic segmentation, a crucial tool in computer vision, attempts to classify every pixel in the initial photo and understand the image in terms of its individual pixels.The technique divides multiple pixels in an image that belong to the same set of labels into a region that gives the corresponding semantic category.With the continuous modernization and intelligence of life, semantic segmentation has become the technical basis of several industries and is widely used in remote sensing images, selfdriving vehicles, medical imaging, etc.For example, the domain of autonomous driving can apply semantic segmentation to distinguish between obstacles and driving areas.In medical images, it can be used to distinguish disease types in different parts of the human body.Therefore, semantic segmentation is of great research importance and has wide application value.
Deep learning has demonstrated its potent abilities in numerous sectors recently, including remote sensing.It significantly raises the precision and effectiveness of semantic segmentation.At the same time, it also brings a lot of problems and challenges.First, semantic segmentation desires an enormous amount of data for training, but the available datasets are small in quantity and low in quality.At the same time, the types and sizes of datasets vary, and the ultra-high resolution images have high requirements on the computer's arithmetic performance.In real complex situations, semantic segmentation needs to overcome two major challenges: the variability between similar items and the similarity between different classes of items, which is often accompanied by various disturbances such as oversized, undersized, and fragmented items.The effect of semantic segmentation will have a significant impact on the subsequent image processing, and further research on it in the segmentation region is necessary.

Existing Semantic Segmentation Methods and Drawbacks
Convolutional neural networks are the foundation of today's widely used semantic segmentation approaches.U-net [1] is one of the most effective strategies, which is based on fully convolutional networks for data augmentation of the dataset, incorporating more scales, but is not effective for segmenting large objects.DeeplabV3+ [2] uses null convolution and improves the Decoder structure to improve the segmentation accuracy.Some later works also adopt self-attention to improve the representation learning and feature detection ability of the model and use the residual connection to enhance the depth and stability of the model.Later research improved the segmentation of small objects by adding a separate branch to learn details about edges using two-stream CNN.Subsequently, Li et al. [3] proposed a new paradigm for semantic segmentation by decoupling the high-semantic level feature graph into two parts: subject features and boundary features.The subject features are generated by a flow-based approach, learning offsets to deform the internal pixel features of the target, and the boundary features can be obtained by subtracting the subject features from the output feature map.However, they work only to extract features under the supervision of loss function, and the extraction effect is limited.Also, the issue of feature extraction order is not further considered.To combine semantic segmentation and edge enhancement, collaborative multi-task learning architectures have been introduced that store shared latent semantics to facilitate interaction between tasks.But it focuses more on the coupling of the two tasks rather than enhancing the semantic segmentation effect with edge detection.
In summary, previous work has had the idea of joint edge enhancement and semantic segmentation, but there are a large number of unconsidered problems.
Our main contributions are summarized as follows:  We present a new network called Fast-DecoupledNet for semantic segmentation and achieve the best synthesis results on the corresponding dataset. We designed the Edge Feature Extractor module, which can extract edge features more accurately and compute subject features in combination with global features. We used a more shallow ResNet as the backbone network to maintain segmentation accuracy while reducing computational complexity.

Related Work
Here, we discuss other work related to ours, focusing mainly on the three aspects that are the focus of this paper.We mainly explain the differences between our work and others.

Semantic Segmentation
Traditional machine learning methods such as random forests have been used to perform semantic segmentation.With the ongoing advancement of deep learning methodologies, most of the research methods are now based on FCNs.This was the first network based on CNN architecture and achieved a huge accuracy leap, using deconvolution for upsampling.Chen et al. [4] proposed GLNet in 2019, which innovatively adopts a semantic segmentation method that integrates global and local information and effectively aggregates features.It uses global branches for rough segmentation and then uses local branches for fine segmentation after obtaining the foreground region, but this consideration is relatively rough.Shan et al. [5] improved GLNet's branching structure and proposed the first local feature fusion method, which allowed the cropped chunks to learn features from the surrounding.MBNet [6] designed a multi-branching structure to solve the multi-resolution input problem and also introduced an attention mechanism to enhance the fusion effect.

Muti-branch Network
The downsampling operation in fully convolutional networks leads to the blurring of edge features, so multi-branch networks have been mentioned for optimizing details.Yu et al. [7] came up with a multibranch network that uses different branches to parse the context and details and finally uses a feature fusion module to fuse them.These works have been proposed for such architectures, introducing attention mechanisms or improving their branches.
Our approach also uses a two-branch structure to capture global features through the ASPP branch, with the difference that we use the Edge Feature Extractor branch to detect edge features.And we design a shallower backbone network that can reduce computational complexity while ensuring the extraction of higher-quality edge features.

Edge Optimization
Back in 2016, Chen et al. [8] explicitly extracted edge cues as constraints and allowed CNNs to learn edge feature maps.At the same time, there have been many efforts to improve edge extraction through better structural modeling, but they do not directly process the boundary pixels.Li et al. [3] extracted the subject and edge features of the image separately inspired by Gaussian filtering, which improved the inside consistency of the object as well as the edge segmentation effect.However, Decoupled SegNet [3] only supervises the training by a special loss function and finally obtains the edge features computationally.On the other hand, our work directly designs an Edge Feature Extractor to get the edge features.We do some calculations and fuse the global features to get the final feature image, which generates better test set outcomes.

Method
In this section, our entire framework is initially introduced in 3.1, and later we will focus on the multibranch network structure and its advantages in 3.2.For the design of the whole network, we borrowed the structure of the state model Deeplabv3+ [2], using ResNet as the backbone network.In particular, we use a more shallow ResNet, which in turn reduces the computational complexity.In addition, we designed Edge Feature Extractor to extract edge information.The Decoupled module processes the resulting features and integrates them to obtain the final feature map.

Decoupled Module
The features F full of the whole image can be split into two components, F body and F edge .Using the same assumptions as in Decoupled SegNet [3], F body and F edge conform to the addition rule.module is mainly used to process the edge features obtained by the Edge Feature Extractor.In this course, it will subtract the global features extracted by the ASPP module to get the body features.Finally, the body features and the global features are integrated and output.
In the traditional Decoupled SegNet [3], its final features are generated under the supervision of two loss functions of subject and edge.Among them, the edge features are obtained by subtracting the global features from the subject features.We employ an Edge Feature Detection module to directly extract edge features, improving efficiency.At the same time, we separately optimized for the edge branch with an acceleration process and used a more shallow network.

Lose Function
where  and  are the semantic segmentation graphs predicted according to subject features and edge features, and  is the ground truth with label values.

Experiment
In this section, we evaluate our method on the well-known Deepglobe dataset.For the same dataset, we adopt Precision, Recall, F-score, and mIoU metrics to report the accuracy of segmentation.We also test the impact of different Backbone layers on computational complexity and accuracy.Finally, we vary the ratio of loss functions for different features to study its effect on segmentation accuracy.

Dataset
The dataset we use is DeepGlobe Land Cover Classification.It is a set of images with a resolution of less than 1 meter, showing mainly countryside, taken by DigitalGlobe's satellite.The dataset contains 1146 photos of 2448×2448 pixel size.The colors of the images are in RGB format with a resolution of 0.5 m.It is used for multi-class segmentation tasks where the model needs to detect urban, agricultural, pasture, woodland, water, barren, and unknown regions in the image.It is particularly difficult to categorize in the experiment due to the enormous amount of pixels and labels in the dataset.

Experimental Results and Analysis
Based on the above four evaluation indicators, we have carried out a wealth of experiments.We use Precision to evaluate the proportion of the captured results that are targeted.Recall is chosen to calculate the proportion of target categories in the domain of interest.F-score is an evaluation index that integrates these two indicators, which is used to comprehensively reflect the whole.Mean intersection over union (mIoU) is a common measure for semantic segmentation and computes the average of all categories' intersection and union ratios.Compared with ConDinet++ and other models, the earlier proposed models such as DlinkNet34 and Deeplabv3 achieve higher Precision scores, but the Precision scores of all models are lower than ours.MBNet and UhrsNet have low Recall scores.The Recall scores of ConDinet++ and our model both exceed 70.F-score is a thorough evaluation of Precision and Recall as it is the harmonic mean of both.At this point, only our model exceeds 70 points.ConDinet++, which has the highest score among other models, has only 67.96.Our baseline Decoupled SegNet has a slightly worse Precision score, but a higher Recall score.On the same dataset, our model has a small gap with SOTA in the Recall metric but has the highest Precision, Recall, and mIoU.Overall, our model performs the best.

Ablation Experiment
The two major contributions of our work are the multi-branch network and shallow backbone, so our ablation experiments are mainly aimed at demonstrating their effectiveness.Our backbone network is ResNet.What's more, we also test the effect of the ratio of the loss functions of global features, edge features, and subject features on the results.
The impact of Shallow Backbone.Since changing the number of convolutional layers has a small effect on segmentation accuracy, we mainly compare the computational complexity of ResNet with different numbers of layers in this section.We used ResNet with 6, 12, 24, and 50 layers of convolutional neural network, respectively, and the results are shown in Table 2.The shallower layers of the backbone network greatly reduce the computational complexity while maintaining accuracy.
In Fast-DecoupledNet, we generally use ResNet with 6 convolutional layers.This includes the ablation experiment using the Edge Feature Extractor module as well as tests using various loss function ratios.The effect of Edge Feature Extractor.We remove the Edge Feature Extractor module and choose to use the traditional downsampling to obtain the features of the edges.It should be noted that we have not changed any of the remaining calculations.The global feature and the body feature are still combined to produce the end result.We use the previous four indicators to measure, and the values of the four indicators Precision, Recall, F-score and mIoU are reduced to 0.4, 0.3, 0.34 and 30% respectively.The effect is significantly reduced.
The effect of loss function ratio.In line with other works, we change the ratio of the L full , L edge , and L body , and measure the effect by mIoU.As the results in Table 3 show, when the loss function ratio is changed to 1:2:2 or 1:2:4, the values of mIoU all decrease and the effect becomes worse.Although the gap between mIoU is small, it is due to the sensitivity of our network setup.Our network is somewhat robust to the weights of different loss functions.The change in the ratio still has some effect on the accuracy.Compared with Decoupled SegNet, our method improves the outcome for these typical cases.
By comparing with the Decoupled SegNet [3], the results of our model are significantly better.In terms of edge processing, our results are more precise and fluid.In terms of the whole image, our results are richer in detail and have higher accuracy.

Conclusion
In this work, we propose a novel multi-branch network for semantic segmentation with fused edge enhancement.We obtain the edge features in the image by Edge Feature Extractor and combine the global features acquired by the ASPP module to subtract the subject information.We use a modified ResNet as the backbone network, and shallow processing reduces the computational complexity of the entire network, making it lighter.We achieved state-of-the-art results on the DeepGlobe Land Cover Classification dataset (72.59 F-score, 77.64% mIoU) and proved the effectiveness of the whole network.

Figure 1 .
Figure 1.Overview of the whole network architecture.The overall network structure is shown in Figure 1.It consists of five parts, which are the Edge Feature Extractor, Backbone Network, ASPP Module, Decoupled Module, and Concatenation Module.For the design of the whole network, we borrowed the structure of the state model Deeplabv3+[2], using ResNet as the backbone network.In particular, we use a more shallow ResNet, which in turn reduces the computational complexity.In addition, we designed Edge Feature Extractor to extract edge information.The Decoupled module processes the resulting features and integrates them to obtain the final feature map.

Table 1 .
Deepglobe test set boundary region segmentation results.

Table 2 .
Comparison of the computational complexity of ResNet with different layers.

Table 3 .
Effects of different loss function ratios on mIoU.
Figure 2. Example of segmentation results on Deepglobe Land Cover Classification dataset.