Research on an underwater image segmentation algorithm based on YOLOv8

Underwater image segmentation faces complex problems such as target type, number, scale, and environmental interference. The traditional segmentation method relies on much manual annotation and expertise, extracts only the surface information of the image, and has low segmentation accuracy and efficiency, so the image segmentation method based on deep learning is chosen. Deep learning algorithms are used on top of the YOLOv8, and the adaptive self-supervised learning (Convnext V2) module, the lightweight network (Slim Neck), and the dynamic sparse attention mechanism (bilevel routing attention, Biformer) are added, which improves the YOLOv8 algorithm for single-stage instance segmentation of different underwater targets. The Convnext V2 module introduces the global response normalization (GRN) unit, which enhances the feature competition and reduces the model computation. A slim neck network is designed, which achieves a double improvement in model inference speed and accuracy by the operation of dense convolution in the lightweight convolution (GSConv) module and the design of a cross-level network structure (VoV-GSCSP). The introduction of Biformer’s dynamic sparse attention mechanism achieves more flexible computational allocation and content awareness. Comparison experiments show that the improved algorithm dramatically increases the segmentation speed while improving the segmentation accuracy. On the dual-target dataset, the mAP is improved by 3.5% from 0.753 to 0.779, and the FPS is improved by 86% from 63 to 117. The experiments validate that the improved algorithm can be used for the segmentation of different underwater images.


Introduction
Image segmentation is the extraction of the region of interest in an image by segmentation; each pixel in the segmented image corresponds to a category.Traditional image segmentation technology relies on strict manual intervention algorithms and professional knowledge, extracts only the surface information of the image, and has difficulty meeting the requirements of practical applications in terms of segmentation accuracy and segmentation efficiency [1] .In recent years, image segmentation technology based on deep learning has developed rapidly, which can design a good deep network algorithm according to the actual needs, input a large amount of raw image data to the deep network, obtain highlevel abstract features through complex processing of the image data, and ultimately output the segmented image with the same resolution as the input image [2] .In 2015, Long et al. proposed the FCN, which is one of the earliest image semantic segmentation models, which first convolves and downsamples the original image according to the whole image classification, then upsamples the feature maps of the extracted semantic information, and finally obtains a segmented image with the same size as the original image [3] .In 2017, Chen et al. proposed DeepLab, which does not downsample at the same time and increases the sensory field by designing the null convolution and learning pixel-to-pixel neighborhood relationships, thus obtaining deeper semantic information and achieving a better classification [4] .In 2018, Ronneberger et al. proposed UNet for upsampling to design jump-association structures to combine deep features with shallow features.Instance segmentation is the combination of detection and segmentation, which is divided into a single stage (one stage) and two stages (two stages) [5]   .In 2017, Mask-RCNN proposed by He et al. is a two-stage segmentation that first undergoes convolutional downsampling to obtain the candidate region, then extracts the candidate region by ROI alignment, and finally realizes detection and segmentation [6] .Two-stage instance segmentation has more model parameters and a more complex network structure, so it is difficult to deploy in embedded and mobile applications.Many landing scenarios not only require good results but also need to compress the computation at the same time.To solve this problem, research on single-stage instance segmentation in recent years has proposed many excellent network structures, and the accuracy rate has been equal to that of the two-stage instance segmentation model.In 2020, Wang et al. proposed SoLo, which is based on YoLo's single-stage segmentation, which adds a segmentation module on top of YOLO's detection module, which extracts the mask for the region of interest; the detection module detects the target edges, and the segmentation module segments the target region [7] .In 2022, the Bi-Polar mask proposed by Zhao et al. formulated the instance segmentation problem as predicting the instance contours through instance center classification and dense distance regression in polar coordinates [8] .In 2022, Poly-YOLO, proposed by Hurtik et al., built on the original idea of YOLOv3 to perform instance segmentation using bounding polygons and eliminated two of its weaknesses: heavily rewritten labels and ineffective anchor point assignments [9] .In 2022, Patchdct, proposed by Wen et al., achieved fine-grained segmentation through multistage cascading structural refinement and compression of vectors [10] .
This paper is a single-stage example segmentation of an underwater image based on YOLOv8.Only the underwater target region needs to be segmented from the background, which belongs to instance segmentation.To meet the real-time application of underwater scenes, the single-stage instance segmentation model is chosen.By adding the Convnext V2 module and Slim Neck module, the improvement of the original YOLOv8 model is achieved, which effectively improves the accuracy and speed of underwater image segmentation.As shown in Figure 1, the ConvNext V2 module introduces a Global Response Normalization (GRN) unit on top of the ConvNext V1 module and removes the LayerScale layer from ConvNext V1.The GRN layer is a new convolutional neural network layer that enhances feature competition between channels by normalizing the feature maps on each channel.The implementation of the GRN layer is very simple and has no learnable parameters.The GRN layer is divided into three steps: global feature fusion, feature normalization, and feature calibration.First, given an input feature  , a spatial feature map  is aggregated and transformed into a vector  ⋅ with a global function, as shown in Equation 1. Next, a response normalization function is applied to the aggregated vector, as shown in Equation 2 using standard division normalization.Finally, the normalized vectors are calibrated against the original spatial feature map, as shown in Equation 3. The number of network parameters for the GRN layer is 15.6 M, while the number of network parameters for the BN layer is 28.6 M. The GRN layer has two advantages over the traditional BN layer: first, it does not need additional parameters because it only normalizes the feature maps, and second, it can deal with any size of the batch, whereas the BN layer needs to dynamically adjust the parameters according to the size of the batch, which results in a larger amount of computation.is larger.The computational effort of the whole GRN layer is very small, so it is easy to add to the neural network to enhance the feature competition and improve the model performance.

Addition of Slim Neck
The C2f module in the neck network of the original YOLOv8 model is replaced with the VoV-GSCSP module, and the Conv module is replaced with the GSConv module.The structure of the slim neck is shown in Figure 2. To accelerate the computation of prediction, each spatial compression and channel expansion of feature maps in CNN leads to a partial loss of semantic information.The computational procedure of GSConv first downsamples the input with a normal convolution; then, it uses the DWConv deep convolution and splices together the results of the two convolutions; and finally, it performs a dense convolution (shuttle convolution) operation to change the number of channels so that the corresponding number of channels from the two previous convolutions are concatenated.GSConv maximizes the retention of hidden connections between each channel through the dense convolution operation, but if it is used in all phases of the model, the model's network layers will be deeper, significantly increasing inference time.When the feature map reaches the Neck, the channel dimensions are maximized, the width and height dimensions are minimized, and no further transformations are needed.Therefore, the choice was made to use GSConv only in the Neck network.At this stage, using GSConv to process the feature maps in the attention mechanism, there is less redundant and repetitive information, no compression is needed, and the attention mechanism is more effective.GSConv is a lightweight convolutional method that has a computational cost of approximately 60% of that of the standard convolution (Standard Convolution, SC) -70%.The VoV-GSCSP module reduces the complexity of computation and network structure by using a cross-level network structure, which retains as much semantic information as possible and maintains sufficient accuracy.The GPU was a Tesal T4, 100 epochs were trained, the batch size was set to 16, the image size was uniformly set to 640*640, the learning rate was set to 0.01, and the mosaic enhancement was chosen not to be switched off for the last ten steps of the training.The improved algorithm is compared with the original algorithm in an experiment, and the results are shown in Figure 3.The evaluation indexes of the experimental results are precision, recall, mean average precision (mAP), and frame per second (FPS).Precision indicates the number of correctly predicted samples divided by the number of all samples, which is used to evaluate the global accuracy of the model.Recall denotes the proportion of correctly determined positive cases to the total positive cases.Ap denotes the average precision of a category; mAP is the reaveraging of the AP of each category, which is used to denote the inference precision of the model.FPS is the number of images that can be processed in a second, which is used to denote the inference speed of the model.
From Table 1, it can be seen that the recall of the improved model increases and the precision decreases, indicating that the improved model improves accuracy and reliability.The segmentation accuracy of the improved single target remains the same, and the segmentation speed is increased by 82%.The dual-target segmentation accuracy increased by 3.5%, while the speed increased by 6%.The multitarget segmentation accuracy increased by 10%, and the segmentation speed increased by 52%.The algorithm's speed of inference on the dataset with different classes of targets reaches more than 90 FPS, which can meet the real-time application requirements.On the dataset of multitarget segmentation, the segmentation accuracy and speed are greatly improved, showing the superiority of the improved model for complex target segmentation.As shown in Figure 3, while the speed of single-target segmentation is greatly improved, it still maintains high segmentation accuracy, and the detection frame and contour of the ship are well predicted.From the results of dual-target segmentation in Figure 3, it can be seen that the improved model can segment the contours of the diver and seagrass more clearly when there is an overlap between the diver and the seagrass; the accuracy of segmentation is greatly improved when the diver dives in different postures; the improved model also achieves more effective segmentation in the dispersed seagrass region.From the results of multitarget segmentation in Figure 3, it can be seen that the improved model is significantly better than the original model in terms of segmentation accuracy: when many small fish targets are distributed on the seafloor ruins, the original model will have omission and misdetection phenomena, while the improved model can segment them better; for the segmentation of overlapping areas of the targets, the improved model deduces a more accurate target border; for similar targets, the improved model can differentiate them more finely.The improved model also distinguishes more finely for similar targets.

Conclusions
The effectiveness of the present improved algorithm is proven by designing experiments.The speed FPS of the improved underwater target segmentation reaches 117, which is 86% higher than that of the original algorithm, and can be deployed on hardware devices to achieve real-time underwater image segmentation tasks.Through comparative experiments, the effective improvement of accuracy and speed is achieved in single-target, dual-target, and multitarget segmentation tasks, which better solves the difficulties encountered in segmenting the types, numbers, and scales of underwater targets.Later, we will devote ourselves to researching how to solve the influence of external environmental interference and other factors on underwater image segmentation and further improve the accuracy of underwater image segmentation.

Figure 2 .
Figure 2. Slim Neck Structure.3.ExperimentsThree types of datasets, for instance, segmentation of underwater images are labeled using roboflow, namely, a single target segmentation dataset, a dual target segmentation dataset, and a multitarget segmentation dataset.The single-target segmentation dataset contains 1653 images of boats and their real labels in the training set and 73 images in the validation set.The biobjective segmentation dataset contains two categories of macrophytes, divers, a training set of 1080 images, and a validation set of 92 images.The multiobjective segmentation dataset contains a training set of 5163 images and a validation set of 110 images.The dataset was labeled at the pixel level for five underwater image categories: coral, debris, fish, human, and robot.The annotation is performed using roboflow, and once the images and their corresponding categories are annotated, they can be directly converted into YOLOv8 trainable format.Experiments were conducted using the cloud server collab for YOLOv8 model training and validation.The GPU was a Tesal T4, 100 epochs were trained, the batch size was set to 16, the image size was uniformly set to 640*640, the learning rate was set to 0.01, and the mosaic enhancement was chosen not to be switched off for the last ten steps of the training.The improved algorithm is compared with the original algorithm in an experiment, and the results are shown in Figure3.The evaluation indexes of the experimental results are precision, recall, mean average precision (mAP), and frame per second (FPS).Precision indicates the number of correctly predicted samples divided by the number of all samples, which is used to evaluate the global accuracy of the model.Recall denotes the proportion of correctly determined positive cases to the total positive cases.Ap denotes the average precision of a category; mAP is the reaveraging of the AP of each category, which is used to denote the inference precision of the model.FPS is the number of images that can be processed in a second, which is used to denote the inference speed of the model.From Table1, it can be seen that the recall of the improved model increases and the precision decreases, indicating that the improved model improves accuracy and reliability.The segmentation accuracy of the improved single target remains the same, and the segmentation speed is increased by 82%.The dual-target segmentation accuracy increased by 3.5%, while the speed increased by 6%.The multitarget segmentation accuracy increased by 10%, and the segmentation speed increased by 52%.The algorithm's speed of inference on the dataset with different classes of targets reaches more than 90 FPS, which can meet the real-time application requirements.On the dataset of multitarget segmentation, the segmentation accuracy and speed are greatly improved, showing the superiority of the improved model for complex target segmentation.As shown in Figure3, while the speed of single-target segmentation is greatly improved, it still maintains high segmentation accuracy, and the detection frame and contour of the ship are well predicted.From the results of dual-target segmentation in Figure3, it can be seen that the improved model can segment the contours of the diver and seagrass more clearly when there is an overlap between the diver and the seagrass; the accuracy of segmentation is greatly improved when the diver dives in different postures; the improved model also achieves more effective

Figure 3 .
Figure 3.The plot of predictions from the original and improved models.

Table 1 .
Performance metrics of the original and improved models.