Underwater target recognition algorithm based on improved convolutional neural network: YOLOV5-Improved

This paper proposes an underwater identification method based on the YOLOV5-Improved model. The MobileOne module is added to the Backbone network of the previous model to improve the lightweight model and reduce the network complexity. The amount of parameters, the amount of calculation, and the inference time; the CA full-dimensional dynamic convolution module is embedded in the MP module of the previous model to focus on the main features, and better focus on the discriminative features of fishing nets, thereby improving the accuracy of fishing net recognition; in The adaptive spatial feature fusion module ASFF is introduced into the head network of the previous model to perform multi-scale feature fusion and remove useless background information to improve the detection and mining capabilities of smaller targets in complex backgrounds under natural conditions. The final experiment proved that the YOLOV5-Improved model has a good learning effect and faster convergence speed during the training process, and the experimental accuracy can reach 99.6%, which can better detect underwater fishing nets and provide assistance for safe navigation at sea.

require further research and technological innovation to improve the performance of underwater target recognition algorithms.

Target recognition method
Existing target detection methods include traditional methods and deep learning methods.Traditional methods, such as methods based on feature extraction and machine learning such as Haar features and HOG+SVM, perform well in simple scenarios, but have limited accuracy.Deep learning methods such as Faster R-CNN, YOLOV5 and SSD extract features through convolutional neural networks and combine them with region suggestions or anchor box mechanisms.They have high accuracy and realtime performance, and perform well in complex scenes, becoming the first choice for target detection.

Convolutional neural network
Convolutional neural network [2] is a deep learning model widely used in image recognition and computer vision tasks.It extracts the features of images through multiple convolutional layers and pooling layers, and performs classification or regression through fully connected layers.Convolutional neural network can effectively capture local patterns and spatial relationships in images, and has the characteristics of translation invariance and parameter sharing, making it outstanding in processing large-scale image data.It is widely used in image classification, target detection, face recognition, etc. field.

Underwater target detection tool
Related research efforts focus on improving underwater detection tools [3].Including the improved resolution and target recognition capabilities of underwater sonar imaging technology, the development of optical-based underwater imaging tools, autonomous navigation and multi-sensor integration of underwater robotic systems, the development of underwater communication technology, and the use of deep learning and Artificial intelligence performs underwater data processing and target recognition.These works aim to improve underwater target detection capabilities and detection accuracy, and promote the development of underwater scientific research and applications.

Network Structure
In the data preprocessing part, LabelImg is used for image annotation, and the ratio of training images, verification images and test images is 7:2:1.Use the Mosaic-9 data enhancement method [4] to perform data enhancement on the training image to obtain an enhanced training image.The Mosaic-4 data enhancement method is used to randomly crop, scale and arrange the 9 training images and then splice them into a new image, as shown in Figure 1.After data enhancement, the amount of training data can be increased, the generalization ability of the model can be improved, noise data can be added, the robustness of the model can be improved, and the occurrence of overfitting can be reduced.

General framework
The enhanced training image is normalized and preprocessed to obtain a preprocessed training image, and the feature scale of the preprocessed training image is set to a preset space by dividing the length and width of the preprocessed training image.Then a YOLOV5-Improved model was built, and the starting part of the Backbone network of the YOLOV5-Improved model was set to four consecutive CBS modules and MobileOne modules, the middle part was set to two sets of improved MP modules [5] and ELEN modules, and the output part was set to the improved MP module and MobileOne module, the CA module is embedded in the improved MP module, the ASFF module is embedded in the head network of the YOLOV5-Improved model, and the YOLOV5-Improved model uses the Focal-CIoU loss function [6].Specifically, in this article, the MobileOne module is added to the Backbone network [7] of the YOLOV5 model to improve model lightweighting, the CA module is embedded in the MP module to focus on main features, and the ASFF module is embedded in the head network for multi-scale feature fusion.The YOLOV5-Improved model is shown in Figure 2.

MoblieOne module
The introduction of the MoblieOne module [8] can reduce the amount of parameters, calculations, and inference time of the network.It consists of two parts, the upper part is based on depthwise convolution, and the lower part is based on pointwise convolution.The depth convolution module consists of three branches.The leftmost branch is a 1×1 convolution, the middle branch is an over-parameterized 3×3 convolution, that is, k 3×3 convolutions, and the right part is a BN layer [9].Skip connection, the 1×1 convolution and 3×3 convolution here are both depth convolutions [10], that is, grouped convolutions.The number of groups g is equal to the number of input channels; the point convolution module consists of two branches, and the left branch is The over-parameterized 1×1 convolution is composed of k 1×1 convolutions.The right branch is a skip connection containing a BN layer, as shown in Figure 3.  IOP Publishing doi:10.1088/1742-6596/2718/1/0120665 GAP (global average pooling) + FC (fully connected layer) + ReLU (activated layer) + FC (fully connected layer) + Sigmoid (activated layer) is the attention module, W1 ， W2 ， ... ， Wn is the convolution kernel of the dynamic convolution layer, αwi represents the kernel dimension attention scalar [12] of the convolution kernel Wi along the convolution kernel space, αsi,αci,αfi represent the three newly introduced attentions [13], respectively.It is along the airspace dimension, input channel dimension and output channel dimension, and its expression is as follows: y = (α w1 ʘα f1 ʘα c1 ʘα s1 ʘW 1 +. . .+α wn ʘα fn ʘα cn ʘα sn ʘW n ) (1)

Improvements to other modules
In the pre-YOLOV5 network structure, there are three different scale feature maps to detect targets of different sizes.Feature maps of different scales are obtained through different sampling multiples and input into the subsequent network structure.The feature pyramid structure used by YOLOV5 is PaFPN [14].The feature pyramid is mainly used to improve the model's response to different input images and target detection problems.Robustness for large and small objects; In an underwater environment, some features of fishing nets are difficult to accurately identify due to factors such as a noisy environment or high-frequency noise introduced by sampling equipment.When the network itself uses feature pyramids to detect objects, there is a heuristic feature selection mechanism.Large The instance corresponds to the high-level feature map, and the small instance corresponds to the low-level feature map.When an instance of a certain feature layer belongs to a positive sample, it means that the corresponding area on other feature layers will be regarded as the background.This conflict and inconsistency between features of different levels will interfere with the gradient during training.calculation, reducing the effectiveness of the feature pyramid, so the ASFF module [15] is added based on the YOLOV5 network to solve the inconsistency problem in the single detector feature pyramid, which allows the network to learn how to spatially filter the features of other layers.Useless interference information only retains useful fishing net feature information.The operation process is differentiable and can be easily learned in backpropagation.As shown in Figure 5.The three output feature maps of the head network [16] need to be resized before fusion.for example, to fuse the output layers of 62, 63, and 64 in the network, you need to downsample the output of the 63 and 64 layers first to make them consistent with the output size of the 62 layer before fusion: (2) Among them, x ij l→n represents the input of a scale,Υ ij l represents the feature map output after spatial scale fusion, α ij l , β ij l and Υ ij l represent the weight coefficient, which is generated by the network through 1x1 convolution and the softmax function through backpropagation learning; Satisfy that α ij l , β ij l , Υ ij l ∈ [0,1] and α ij l +β ij l + Υ ij l = 1.

Loss function
The YOLOV5-Improved model uses Focal-CioU [17] as the loss function.Since the aspect ratio among the three elements of bbox regression is not taken into account, the researcher proposed the CIoU loss function as follows: ρ--Euclidean distance between two center points; c --Contains the diagonal distance of the minimum overlap area between the predicted frame and the true frame; b, b gt --The center points of the predicted box and the real box; α-weighting function; v is used to measure the consistency of the aspect ratio and is defined as follows: The complete CIoU loss function and Focal-CIoU loss function are defined as follows: ) Among them, γ represents a hyperparameter used to control the arc of the curve.The gradient of CIoU loss is similar to DIoU loss, but the gradient of v must also be considered in the application.In the case of length and width [0,1], the value of w + h is usually very small, which will cause gradient explosion.CIoU solves the problem of regression overlap consistency, and the aspect ratio is as close to the true value as possible; The introduction of Focal Loss optimizes the sample imbalance problem in the bounding box regression task, that is, reducing the contribution of a large number of anchor boxes that have less overlap with the target box to BBox regression, so that the regression process focuses on high-quality anchor boxes.

Experiment and result
Due to the limited data set, we selected the detection object as a fishing net, and obtained the underwater image to be identified through an underwater laser gating camera.The pre-trained YOLOV5-Improved model [18] is used to perform fishing net recognition on the underwater image, and the recognition result of the underwater fishing net is obtained.

Implementation Details
The experimental steps are as follows: first, use the pre-processed training image to train the constructed YOLOV5_ Improved model to obtain a pre-trained YOLOV5 model; then, based on the pre-processed training image, train the YOLOV5-Improved model until the set learning iteration is reached The training is completed when the number of times is reached.Finally, the verification image and the test image are used to verify and test the YOLOV5-Improved model, and the trained YOLOV5-Improved model is obtained.

Experimental result
As shown in the figure, we randomly select a part of the experimental results to display in the figure 6.The upper right corner indicates the confidence level of identifying it as a fishing net.The experimental results show that the confidence level is above 93%, proving that the recognition accuracy is higher.

Comparison with other algorithms
Figures 7 and 8 show the comparison experimental results between the original YOLOV5 model and the YOLOV5-Improved model.As can be seen from Figure 7, the YOLOV5-Improved model mAP@0.5 is as high as 99.6%, which is 0.6% higher than the original YOLOV5 model, and the convergence speed is much faster.Faster than the original YOLOV5; as can be seen from Figure 8, the YOLOV5_ Improved model loss curve has better convergence, lower loss rate, better learning effect, higher model robustness, and the YOLOV5_ Improved model detection rate has also been improved. .Therefore, the YOLOV5_ Improved model has good learning effects and faster convergence speed during the training process, and the experimental accuracy can reach 99.6%, which can better detect underwater fishing nets and provide assistance for safe navigation at sea.

Conclusion
This paper improves the network structure of the YOLOV5 model, absorbs the idea of heavy parameters, and reduces the amount of parameters, calculations, and reasoning time of the network; it integrates feature maps of different scales and removes useless background information to improve the performance of complex backgrounds under natural conditions.It has the ability to detect and mine smaller targets; it introduces the full-dimensional dynamic convolution module CA to better focus on the discriminative characteristics of fishing nets and improve the accuracy of fishing net recognition.

Figure 2 .
Figure 2. The overall framework of the network

Figure 3 .
Figure 3.The framework of the MoblieOne layer3.3.CA moduleCA[11] can capture not only cross-channel information, but also direction-aware and position-aware information, which can help the model locate and identify objects of interest more accurately.Second, coordinate attention is flexible and lightweight,which wo can see in Figure4.

Figure 4 .
Figure 4.The framework of the CA layer

Figure 5 .
Figure 5.The framework of the ASSF layer

Figure 7 . 8 .
Figure 7.Comparison of mAP value results Figure 8.Comparison of Loss value results