Research on the classification of seabed sediments sonar images based on MoCo self-supervised learning

The discrimination of seafloor substrate type is an extremely important part of seafloor science, and the substrate information is of great significance for the development of marine science and the protection of the marine environment. Current sonar equipment can efficiently generate seafloor images and present seafloor information visually, so the seafloor substrate classification technology based on sonar images has become a hot research topic. Convolutional neural network, as one of the most important classification algorithms in seabed substrate sonar image classification, has excellent performance in most cases. However, the size of the convolutional kernel of convolutional neural network limits the global feature extraction ability, and the ability to discriminate global features in sonar images is weak. In addition, seabed substrate sonar images have labelled data acquisition difficulty and high cost, and acoustic seabed substrate classification in practice generally belongs to small sample classification scenarios. Aiming at the above problems, this thesis selects Swin Transformer, which has strong global feature extraction ability, as the classifier, and uses MoCo self-supervised learning to pre-train the unlabeled data in order to achieve better results.


Introduction
Identification of seafloor substrate types is a key part of seafloor science, which is of great significance in the fields of seafloor resource development, marine spatial planning, fishery resource surveys, underwater communications and marine environmental protection [1,2] .It is closely related to the geologic body of the seafloor, which usually includes bedrock and loose sediments.The development of accurate seafloor substrate type determination techniques has a significant impact on the advancement of marine science and the protection of the marine environment.
Due to the relative weakness of CNNs in extracting global information from images [3] , Swin Transformer is chosen as the classifier in this paper.Swin Transformer is a widely used attentional mechanism, which is able to analyze the correlation between input sequences and find the "key points", so that local features can be learned and global features can be integrated.In addition, we explored the application of MoCo self-supervised learning to Swin Transformer, considering the use of easily available unlabeled sonar images.pre-training with unlabeled data improves the overall performance of the network in a small-sample environment, and training with unlabeled data enhances the classifier's feature extraction capability.This also helped to improve the performance of seabed substrate sonar image classification and existing classifiers

Introduction to the Swin Transformer model
The Swin Transformer consists of a stack of multiple Swin Transformer blocks.Varying the number of blocks in each layer produces models with different number of parameters [4] , which can lead to differences in convergence speed and fitting ability.With an adequate training set, the larger the number of parameters, the more patterns the model learns, improving classification accuracy.However, smaller models are better in terms of convergence speed.
As shown in table 1, Swin Transformer has four sizes of models, of which Swin-Bl is the base model, while Swin-T, Swin-S and Swin-L, the model size and computational complexity of these three versions of models are 0.25 、 0.5 and 2  of the base model, respectively.Since the number of parameters and computational complexity of the Swin-L model is too large, and the amount of data for seabed substrate sonar image classification is generally small, Swin-T, Swin-S, and Swin-B are selected for experiments in this study.

MoCo self-supervised learning
Self-supervised learning is a type of unsupervised learning that does not require manually labeled labels and learns the features of the data itself, while the learned model can be migrated to downstream tasks.In order to be able to extract the feature representation of an image from different samples, it is necessary to construct a "pseudo-task", which is called a Pretext task.MoCo is a self-supervised learning model [5] , which can utilize unlabeled images to train the feature extraction ability of the network, and the basic idea of MoCo follows the general idea of self-supervised learning, which is similar to SimCLR [6] , which aims to discriminate data-enhanced versions of the same image.It solves the problem that SimCLR requires a large amount of memory to store the dictionary and reduces the difficulty of model training.

Classification of seafloor substrate sonar images based on Swin-Base network
Swin-Base network is the basic network of Swin-Transformer, which has moderate network size, higher flexibility and wider application range.The dataset used in this paper is collected from the northern sea area of Chu Island, Weihai, Shandong Province, where three types of seafloor substrates, rock, sand and mud, exist, and the collection area is about 16KM 2 .
In this paper, we use the Warm-up strategy, where the learning rate is set to 0.1 at the beginning of training, and then the learning rate is increased as the training progresses until it reaches the set value.After the initial stage, the model can use a larger learning rate to approach the stable position.Warm-up can improve the stability of training and reduce the chance of parameter bias is too large.After that, the experiment uses the cosine annealing strategy.Maintaining a high learning rate all the time will cause the network to oscillate violently around the global optimum and cannot be approximated, so a strategy is needed to reduce the learning rate between.And cosine annealing makes the learning rate follow the cosine function when reducing it, i.e., the following equation: ) where t denotes the index value of the current epoch max η and min η denote the maximum and minimum values of the learning rate respectively, which define the range of the learning rate, cur T denotes how many epochs have been executed currently, and denotes the total number of epochs.In this experiment, the Warm-up step is set to 5, max η is 0.0001，and min η is 0.00001.After 100 rounds of training and repeating the experiment five times, it was found that the test accuracy had reached an optimum of 96.24% in the 45th round, and all of them oscillated slightly around it after that, indicating that no significant overfitting had yet arisen.In the prediction results, the three classification accuracies are 92.5%,95.3% and 100%, respectively.Overall, the Swin-B model converges faster on the training set and the overall training process is stable, and the obtained model is able to achieve an overall high classification accuracy of up to 96.24%.Figure 1 shows the first 50 rounds of training loss variation for the three networks.The figure shows that Swin-T converges the fastest, although not much different from Swin-B after 20 rounds of training.Swin-S converges the slowest and takes until the 50th round to get close to the other two.The fastest convergence was achieved by Swin-T due to the fact that it had the least number of parameters.swin-S increased the number of swin blocks but not the number of hidden layers, which resulted in a decrease in the speed of convergence, although the fitting ability was increased.Swin-B, on the other hand, has the same number of swin blocks as Swin-S but more hidden layers, thus its learning effect is increased, which in turn accelerates the convergence.

Performance Analysis of Swin Transformer Image Classification
In this study, Swin-S's low classification accuracy is due to its fewer hidden layers and excessive Swin blocks, preventing full learning of data distribution.Swin-B enhances image representation and accuracy by adding hidden layers, but it's still not fully utilizing network capabilities.Therefore, Swin-T, with its smaller network, achieves the highest accuracy.To further exploit the effectiveness of Swin-B, the MoCo self-supervised learning method is used in the next section to process unlabeled samples.

Downstream categorization task performance analysis
After MoCo pre-trained the encoder using unlabeled data, the encoder has been able to extract image features initially.However, since the Pretext task is different from the downstream task, the encoder parameters still need to be fine-tuned.For the image classification task, the encoder parameters need to be migrated to the classifier and then trained using the training set data.
After Image as shown in figure 2. It can be seen that after using the self-supervised learning, the first round of training accuracy is close to 80%, and the training process converges rapidly, within 10 rounds has reached more than 90% accuracy, and after the 14th round reaches the best accuracy of 98.5%, and the subsequent training network accuracy stabilizes at about 95%.By comparing the training loss curves, the training loss decreases rapidly after the use of self-supervised learning compared to that without selfsupervised learning.And the training processes all have high stability.Self-supervised learning uses a large number of unlabeled images to obtain an encoder with feature extraction capability, and due to the large number of samples, the feature space for the encoder to learn is also larger and richer in semantic information.Therefore the Swin-B network after migration fine-tuning is able to obtain higher classification accuracy.Overall, after using MoCo to pre-train on the unlabeled training set Swin-B obtains an increase in training speed and test accuracy while maintaining training stability.Migration learning can provide additional learning pathways for the network when the number of training samples is too small, reducing overfitting and improving classification accuracy [7][8][9][10] .We performed migration learning on Swin-B as a comparison, using the Swin-B pre-trained model from the paper [11] .Comparing its training process with that after using self-supervised learning from Fig. 3, it can be found that although the migration learning converges fast on Swin-B, the accuracy starts to decline in the 10th round and the initial stability is poor.Due to the difference in the distribution of training data, the network initially makes large adjustments to the data, resulting in fluctuations [12][13] .Compared to MoCo, which is pre-trained based on self-supervised learning, the network may not be able to effectively learn more features due to data distribution differences, and the performance improvement is not as expected.While migration learning has advantages, self-supervised learning may be more effective in some cases.In the experiments in Section 3, we use three different sizes of Swin Transformer networks: Swin-T, Swin-S, and Swin-B.We use each of them to initialize MoCo's encoder for self-supervised pre-training, and migrate the results to downstream tasks.The results are shown in Table 2.The classification accuracy of both Swin-S and Swin-B improves after pre-training with MoCo, especially Swin-B performs optimally.However, the accuracy of the small Swin-T network did not change after MoCo pre-training.Possibly due to its smaller network size, more limited learned features, and ease of fast convergence, self-supervised learning did not significantly improve its performance.

Conclusion
In this paper, firstly, Swin Transformer is used in seabed substrate sonar image classification, and the experiments of three sizes of Swin networks show that Swin-T network has the fastest convergence speed and the highest classification accuracy.Secondly, MoCo self-supervised learning is introduced into Swin Transformer, which confirms that MoCo can effectively train the encoder of Swin and has the preliminary feature extraction capability.Finally, the MoCo-trained encoder is used in the downstream seabed substrate classification task, which improves the convergence speed and classification accuracy, alleviates the overfitting problem caused by the small training set, provides a solution to the small sample problem, and reduces the cost of dataset labeling.Due to the time factor, personal ability and experimental conditions on the limitations, this topic still has deficiencies, available for research on the seabed substrate side-scan sonar image dataset acquisition is difficult, this study only used a dataset for experiments.The experiment can reflect the feasibility of the algorithm and network to a certain extent, but if we can analyze and validate the experiment on a variety of datasets, it will help to find out more detailed information about the research content and help to improve the generalization ability of the algorithm.

Figure 1 .
Figure 1.Three types of Swin Transformer network training loss curves with rounds.

Figure 2 .
Figure 2. Whether to use self-supervised learning comparison curve.

Figure 3 .
Figure 3.Comparison curve between Transfer learning and self-supervised learning.

Table 2 .
Comparison of verification accuracy before and after using MoCo.