Study on autonomous ship enhancement by optimizing the object detecting algorithms

The accuracy and efficiency of object detection is the key technology to spread autonomous driving. In order to address issues such as missed and false detection in traditional target detection algorithms, deep learning technology is used to optimize the ship target detection model as well as the attention mechanisms in this article. Specifically, this paper proposes an improved model based the traditional YOLOv5s algorithm, and a self-created ship dataset is also provided. To refine ship target detection, various attention mechanisms were implemented and compared, including CBAM, ECA, GAM, SimAM, and SK-Net. Through comparative analysis of the effects of these mechanisms, GAM was identified as the optimal choice. The experimental results indicate that the mean average precision (mAP) of detecting ship targets was augmented by 1.0% following the incorporation of GAM. The YOLOv5s-GAM model is therefore deemed as an effective target detection approach in enhancing the safety of ship autopilot, with potential applications in the future.


Introduction
As ships continue to assume vital roles in China's marine economy, the pursuit of autonomous ship technology is growing.Improving the identification and control capabilities of autonomous ships can enhance their safety and intelligence, and cater for the increasing practical demands.Research focused on computer vision and deep learning are now gaining more and more attention to address the challenges of ship target detection in large-scale and complex situations.Deep learning-based target detection algorithms could be classified into two categories: two-stage algorithms and one-stage algorithms.The former, which is exemplified by R-CNN (Regions with Convolutional Neural Network), displays slow processing speed that cannot meet the real-time ship detection requirements.Meanwhile, the latter, distinguished by SSD (Single Shot MultiBox Detector) and YOLO (You Only Look Once), can meet real-time demands, but exhibit a low detection accuracy [1].
Currently, these two algorithms are being used for ship detection.The idea of the two-stage algorithm is to select candidate regions in the first stage, and the second stage is to classify or regress these candidate regions.It mainly includes R-CNN, Fast R-CNN and Faster R-CNN.Zhang et al. [2] have adopted an effective target detection framework, Faster-RCNN, and improved its original convolutional neural network (CNN), VGG16, by utilizing multi-resolution convolutional features and performing regions of interest (ROI) pooling on a larger feature map in a region proposal network (RPN).This provides a highly effective method for offshore and inland river ship detection based on high-resolution remote sensing imagery.Guo et al. [3] have proposed a rotational Libra-R convolutional neural network (CNN) method that considers resultant force and rotational invariance to predict three levels of ship positions.Experimentally obtained results on the DOTA dataset demonstrate the superior accuracy of the proposed method compared to others, resulting in state-ofthe-art accuracy.Zhao et al. [4] created a coupled CNN for detecting small and densely clustered Synthetic Aperture Radar (SAR) ships.Their proposed method has two subnetworks: (1) The exhaustive ship proposal network (ESPN) generates ship-like regions from multiple layers with multiple receptive fields.(2) The accurate ship discrimination network (ASDN) eliminates false alarms by considering the contextual information of each proposal from ESPN.Experimentally obtained results indicate that Zhao et al.'s method is more effective than the multi-step constant false alarm rate (CFAR-MS) approach.The one-stage algorithm aims to acquire dense samples at different points of an image, utilizing a neural network for quick feature extraction and real-time recognition tasks.The most commonly used methods include YOLO and SSD.Sun et al. [5] explored the spinning target detection method and introduced a ship detection model that uses YOLO to output a target ship's real length, width, and axial information.The model accurately predicts the minimum external rectangular area of the ship target to detect multiple targets, contributing to a significant improvement in detection performance.Huang et al. [6] propose a new approach, Ship-YOLOv3, which integrates several preprocessing techniques to enhance the model's performance.A guided filter is initially applied, followed by grayscale enhancement for the input image.The dimensions of the bounding box are then clustered with k-means++ to obtain favorable priors.The YOLOv3 structure is additionally customized by utilizing skip connections to minimize the redundancy of features and reduce the size of the convolution procedure.Finally, Ship-YOLOv3 is trained on the ship dataset using weights obtained from the PASCAL VOC dataset.The results indicate that this method accelerates the network's convergence rate when compared to other existing YOLO algorithms.Chen et al. [7] present an improved variation of YOLOv3, called ImYOLOv3, which utilizes an attention mechanism to achieve an optimal trade-off between detection accuracy and speed.They integrate a new and lightweight dilated attention module (DAM) to extract discriminative features for the ship targets that can be smoothly integrated into the fundamental YOLOv3 structure.This approach accurately identifies ships against various backgrounds while maintaining real-time processing speed, thereby accommodating ships of varying scales.
Despite optimizing the algorithm mentioned earlier to accommodate the characteristics of ship target detection, the detection capability is still insufficient.In real-world scenarios, ships exhibit a vast range of shapes and sizes from small-scale fishing boats to mammoth oil tankers which poses challenges, such as missed or false detections.Hence, designing feature extractors and classifiers that can differentiate between target sizes and shapes is a daunting task [8].This study explores an improved YOLOv5 model that utilizes attention mechanisms to improve the accuracy and recall rate for detecting ship targets.The proposed model incorporates an attention mechanism, emphasizing the distinguishing features of different ship types, mitigates the impact of noise on the model, and enhances both the accuracy and efficiency of ship target detection.This research is crucial in advancing the development of autonomous ship technology and can offer reliable support for safer and more efficient navigation.

Standard YOLOv5 target detection network structure
The YOLOv5 framework is developed and maintained by Ultralytics.Compared to its predecessor, YOLOv1, the YOLOv5 model incorporates and enhances effective and reliable deep learning techniques, thus elevating its popularity in the field of target detection [9].The YOLOv5 algorithm is a comprehensive end-to-end neural network model that employs a single convolutional neural network to segment the complete image and predict the category of both the boundary box and each grid, providing high accuracy while quickly detecting images.Throughout the years, YOLOv5 has undergone multiple iterations.In this paper, we used the YOLOv5s model, version 6.0, for experimental purposes.The network has three primary components, namely, backbone, neck, and head.Figure 1 illustrates the YOLOv5s network structure.[9].The purpose of the Backbone is to conduct initial feature extraction, resulting in the production of three preliminary effective feature layers, with the aid of the backbone network.In contrast, the Neck carries out improved feature extractions by drawing on the three preliminary effective feature layers, thereby allowing for feature fusion and the production of three additional effective feature layers.Lastly, the Head output layer deals with predicting, and for that function, utilizes one of the effective feature layers as input by using the prediction network to obtain its results.

Improved YOLOv5 network design with attention mechanism
To more effectively apply the YOLOv5s model to ship target detection, the model needs further customization and improvement to be tailored to specific detection tasks and datasets.Firstly, in order to extract more attention information from the feature map, we introduce five attention modules, The attention modules used are Convolutional Block Attention Module (CBAM), Efficient Channel Attention (ECA), Global Attention Module (GAM), A Simple, Parameter-Free Attention Module (SimAM), and Selective Kernel Network (SK-Net), into the Head component of the YOLOv5s model, and select the optimal attention module through comparative experiments.
The attention mechanism, which originated from simulating human visual attention processing, was first applied in the field of images.The paper titled "Recurrent Models of Visual Attention," published by the Google DeepMind team in 2014, has been widely used in many deep learning domains, such as natural language processing, target detection, and semantic segmentation [10].By introducing the attention mechanism in target detection, the model can focus on the key information in an image, filter out irrelevant information, save computing resources, and enhance the model's detection capability for small targets.In this paper, we introduce the CBAM, ECA, GAM, SimAM, and SK-Net attention modules into the three blue box positions of the Head component, as shown in Figure 2. We compare their performance improvements and determine the optimal attention module.

Improved YOLOv5 network with CBAM attention mechanism network.
The CBAM attention mechanism is composed of the channel attention module (CAM) and spatial attention module (SAM) primarily [11].With the input feature map as its foundation, CBAM progressively combines attention weights along the dimensions of channel and space to extract valuable information in the feature map The CAM process is depicted in Figure 4, demonstrating that the input feature map F undergoes global maximum pooling and global average pooling, respectively, followed by two-layer shared neural network processing.The resulting features are fused via element-wise addition, and the channel attention feature MC is acquired through applying a Sigmoid non-linear operation.The mathematical representation is presented below: where σ is the Sigmoid activation function, W0 and W1 are the weights of multi-layer perceptron , r is the dimensionality reduction factor, and r=16 in the paper.

Improved YOLOv5 network with ECA attention mechanism network.
The efficient channel attention (ECA) module avoids dimensionality reduction and effectively captures cross-channel interactions [12].The ECA structural diagram can be seen in Figure 6.Once ECA receives a feature input, the global average pooling (GAP) processes the input, primarily highlighting the regions in the image that require attention.In addition, a fast 1xk one-dimensional convolution technique is implemented to facilitate local cross-channel information interaction without dimensionality reduction.The activation function σ is utilized to create the channel weight ω, which is then multiplied by the input to yield the channel feature map.The formula used to generate the channel weight is depicted below: where C1D refers to one-dimensional convolution, y denotes channel, and σ represents the Sigmoid activation function.
It is essential to note that the larger the channel dimension, the greater the range of local crosschannel interaction.Additionally, the mapping relationship between channel dimensions C and K is presented below: To adaptively determine the kernel size K based on the channel dimension C: where |x|odd is the nearest odd number, the value of γ is set to 2, and the value of b is set to 1.  [13].GAM is based on the sequential channel-spatial attention mechanism in CBAM, and its sub-modules have been optimized to yield optimal results.The complete module is illustrated in Figure 7.Given the input feature map

Improved YOLOv5 network with GAM attention mechanism network. The Global Attention Mechanism (GAM) minimizes information reduction in the network and intensifies the influence of global dimension interaction features
, the intermediatestate F2 and the output F3 are defined as: where MC and MS are the channel and spatial attention maps, respectively; ⨂denotes element-wise multiplication.
Figure 8 demonstrates the channel attention sub-module, where the input feature F1, with dimensions C×W×H, is first transformed into W×H×C with three-dimensional channel replacement.Next, the input undergoes a two-layer multi-layer perceptron (MLP), where the first layer of coding reduces the number of channels C to C/R, and the second layer of decoding restores the channels to the initial number.Subsequently, the Sigmoid activation function is applied to acquire the weight coefficient MC, which enhances the spatial dependency of cross-dimensional channels.The spatial attention sub-module is illustrated in Figure 9.The input feature F2 leverages two 7x7 convolutional layers to blend spatial information seamlessly.Meanwhile, the channel attention submodule's scaling is implemented with the same reduction ratio r, yielding new features.Ultimately, the weight coefficient MS is determined via the Sigmoid activation function.

Improved YOLOv5 network with SimAM attention mechanism network.
A Simple, Parameter-Free Attention (SimAM) Module, illustrated in Figure 10, is a lightweight and straightforward attention module that is very effective.Unlike previous propositions, such as the channel attention and spatial attention mechanisms, SimAM does not increase network complexity since it lacks additional parameters.SimAM is a 3D attention mechanism, optimized using neuroscience theory to accentuate the ability function.Additionally, an analytical solution is derived from the ability function, leveraging the energy function to compute the weight of the attention mechanism [14].[14].where the same color represents each channel, spatial position or feature on each point using a single scalar.

Improved YOLOv5 network with SK-Net attention mechanism network. Selective Kernel
Networks (SK-Net) is a channel attention mechanism developed for feature maps.SK-Net recalibrates the weights of various channels by incorporating global context, as depicted in Figure 11 [15].The attention mechanism enables each neuron to modify the size of its receptive field (convolution kernel) adaptively, relying on the multi-scale input information.SK-Net is called Selective Kernel since it improves the capacity to capture multi-scale features present in complex image spaces.This model comprises three essential components: Split, Fuse and Select.
Split: Complete convolution operations (including efficient grouped/depthwise convolutions, Batch Normalization, ReLU function) of the input tensor X with different convolution kernel sizes.As shown in the structure diagram, the convolution operation of Kernel 3×3 and Kernel 5×5 is performed on X to obtain two outputs.
Fuse: Perform element-wise summation on the above two outputs to obtain the output feature map U, where Fgp is the global average pooling operation, and Ffc is a two-layer fully connected layer that first reduces the dimension and then increases the dimension.Select: Two different weight matrices are used to weight the two results, and then the output vector is summed.Because different convolution sums are used in the whole process, the module has the ability to adaptively adjust its own receptive field, and has higher detection ability and accuracy for multi-scale targets.

Experimental environment
The operating system of this experiment is Ubuntu 18.04, CPU model is 12 vCPU Intel (R) Xeon (R) Platinum 8255C CPU @ 2.50GHz, GPU model is RTX 3090 (24GB)*1.The YOLOv5 model is based on the Pytorch deep learning framework, the programming language is Python, and CUDA11.3 and CUDNN8 are used to accelerate the GPU.Parameter settings are shown in Table 1.

Dataset
The ship dataset is an important component for detecting targets of ships, and carries significant importance.Owing to the limited number of available public ocean datasets, it's challenging to manage them at the scale and diversity required.The development of ship target detection datasets can improve the situation and facilitate numerous practical applications benefits [16].Utilizing crawling to collect a large number of ship images and subsequently screening, cleaning, and labeling them, results in a dataset containing 17,988 images.Using an 8:2 ratio, the dataset is split into a training set and a verification set; the training set contains 14,395 images and the verification set contains 3,593 images.The dataset is categorized into nine classes: "engineering_ship" 、 "freighter" 、 "passenger_ship" 、 "public_service_vessel"、"sailboat"、"speedboat"、"submarine"、"warship"、"ship".As shown in Figure 12, the analysis visualization result of the data set is shown, where (a) is the distribution of the object category of the data set, (b) is the distribution map of the center point of the object, and the horizontal and vertical coordinates represent the position of the center point, (c) is the distribution map of the object size, and the width and height of the horizontal and vertical coordinates represent the width and height of the object.

Evaluation indicators
This paper uses Recall, Precision, average precision AP and mean average precision mAP to evaluate the accuracy of the detection model [17].To understand these metrics better, we will introduce the following concepts: Precision (P) represents the proportion of true positives to all positive predictions.It measures the model's ability to correctly identify positive samples.The formula for calculating precision is as follows: Recall (R) represents the proportion of true positives to all actual positive samples.It measures the model's ability to correctly identify positive samples from all actual positive samples.The formula for calculating recall is as follows: Average precision (AP) refers to the precision-recall curve, with recall on the horizontal axis and precision on the vertical axis.The integration of this curve provides the average precision, reflecting the model's detection ability across all categories.The formula for calculating AP is as follows: Mean Average Precision (mAP) is the average value of the average precision for all categories in the dataset.It considers the detector's accuracy and recall rate; the higher the mAP, the better the detector's performance.The formula for calculating mAP is as follows:

Experimental results and analysis
In this paper, uniform parameters and hyperparameters are employed during training.Comparative analysis is conducted to evaluate the impact of introducing five attention mechanisms on the performance of the model.Table 2 presents the comparison and analysis of multiple models based on precision, recall, and mean average precision (mAP).
To more intuitively demonstrate the superiority of the enhanced algorithm, the original YOLOv5s model and the model with five attention mechanisms is used to compare and detect the three images, as shown in Figure 13.The yellow box in the figure indicates the missed detection target, and the third photo detected by (b) YOLOv5s-CBAM, the second photo detected by (f) YOLOv5s-SK-Net and the first photo detected by (a)-(e) are missed.In the second photo detected by (a) YOLOv5s, there is a false detection that "engineering _ ship" is "public _ service _ vessel".It can be seen from Figure 1 that the YOLOv5s algorithm has been improved, which not only reduces the missed detection and false detection, but also enhances accuracy.Among them, (d) the target in YOLOv5s-GAM is basically correctly detected and the detected target's confidence score is generally higher than that of other models, indicating that the robust performance of this method is improved compared with the original algorithm.The experimental results show that the introduction of attention mechanism in the YOLOv5s model can make the model learn rich information in the feature map and improve the target detection effect of the model.The detection effect and mAP value index of the GAM attention mechanism after introducing the model are better than those of CBAM, ECA, SimAM and SK-Net modules, which can be used for ship target detection.

Conclusions
With the aim of addressing the issue of missed and false detection in ship target detection, this paper incorporated five attention mechanisms CBAM, ECA, GAM, SimAM, and SK-Net to the Head part of the YOLOv5 model and trained it on a self-built dataset.Through comparative experiments, it is found that the mAP value of the YOLOv5s-GAM model is 1.0 percentage points higher than that of the original model, which is the module with the largest increase in the five attention mechanisms.By comparing the detection effects of three different detection images, it can be seen that YOLOv5s-GAM has achieved good results in the accuracy and confidence scores of detection, which proves that the introduction of attention mechanism can effectively improve the performance of the model and the accuracy of detection.The next research focus is to realize the lightweight of the model structure.Under the premise of ensuring speed and accuracy, the model is further optimized to achieve better detection results.

Figure 1 .
Figure 1.YOLOv5s model network structure diagram[9].The purpose of the Backbone is to conduct initial feature extraction, resulting in the production of three preliminary effective feature layers, with the aid of the backbone network.In contrast, the Neck carries out improved feature extractions by drawing on the three preliminary effective feature layers, thereby allowing for feature fusion and the production of three additional effective feature layers.Lastly, the Head output layer deals with predicting, and for that function, utilizes one of the effective feature layers as input by using the prediction network to obtain its results.

Figure 2 .
Figure 2. The introduction position of the attention module.

Figure 4 .
Figure 4.The process of Channel Attention Module[11].The SAM process can be seen in Figure5, where the original input F undergoes element-wise multiplication with the MC feature generated by the channel attention module, resulting in the input F' of SAM.Next, F' undergoes channel-based global maximum pooling and global average pooling, with the two produced feature maps being merged via channel splicing, and processed via Sigmoid nonlinear operation to obtain the spatial attention feature MS.Ultimately, the input F' of SAM is multiplied element-wise with the MS feature generated by the spatial attention module, resulting in the final output feature.In summary, spatial attention can be calculated as follows:

Figure 8 .
Figure 8.The process of Channel Attention sub-module [13].The spatial attention sub-module is illustrated in Figure9.The input feature F2 leverages two 7x7 convolutional layers to blend spatial information seamlessly.Meanwhile, the channel attention submodule's scaling is implemented with the same reduction ratio r, yielding new features.Ultimately, the weight coefficient MS is determined via the Sigmoid activation function.

Figure 10 .
Figure 10.Comparisons of different attention steps.((a)Channel-wise attention; (b)Spatial-wise attention; (c)Full 3-D weights for attention) [14].where the same color represents each channel, spatial position or feature on each point using a single scalar.
true positives) refers to the number of positive samples predicted by the model as positive cases; FP (false positives) refers to the number of negative samples predicted as positive by the model; FN (false negatives) is the number of positive samples predicted to be negative by the model.m denotes the number of categories in the dataset.

Figure 12 .
Figure 12.Data set analysis.((a)Dataset object category distribution; (b)Position distribution of object center point; (c)Object size distribution).

Figure 13 .
Figure 13.Comparison of detection effects of different algorithms.((a)YOLOv5s;(b)YOLOv5s-CBAM;(c)YOLOv5s-ECA;(d)YOLOv5s-GAM;(e)YOLOv5s-SimAM;(f)YOLOv5s-SK-Net).The experimental results show that the introduction of attention mechanism in the YOLOv5s model can make the model learn rich information in the feature map and improve the target detection effect of the model.The detection effect and mAP value index of the GAM attention mechanism after introducing the model are better than those of CBAM, ECA, SimAM and SK-Net modules, which can be used for ship target detection.