Small Object Detection Algorithm for Underwater Organisms based on Improved Transformer

In view of the problem of difficult feature extraction and target missed detection, propose an improved transformer based underwater target detection algorithm. The proposed algorithm improves the linear embedding module of the Transformer to supplement the linear embedding information of small objects and builds the local relationship model of each layer feeling field, in order to achieve end-to-end accurate detection. An experiment on URPU underwater target detection data set is conducted. Compared with Swim Transformer, the proposed algorithm detection accuracy is 77.4%, 1. 5%, detection speed is 7 frame/s, 35.6%, the number is 30.4 MB, and compression is 84.1%. Experimental results have shown that the improved algorithm improves the performance of underwater small target detection

the former, but with much faster speeds, Representative algorithms such as the YOLO [4] series, and the SSD [5] series.Currently, the use of deep learning methods to detect underwater targets has received much attention.For example, the YOLO structure is streamlined, training networks using migration methods, overcoming the limitations of the sample set, and achieving more accurate recognition and detection of fish targets [6] ; YOLOv3 networks are improved using multi-scale training, testing and a combination of fine-grained features [7] , realizing the detection of sonar image target; samples of specific class targets are generated by introducing a generative adversarial network, to alleviate the problem of reduced detection accuracy caused by an imbalanced number of targets in YOLO network, to achieve the rapid and accurate detection of underwater targets.
The underwater target detector based on deep learning shows good performance, but the anchor frame is a common component of the detector.The use of the anchor frame makes the network easy to produce an unbalanced number of positive and negative samples, which reduces the detection accuracy.For this problem, the proposed algorithm improves the linear embedding module of the Transformer to supplement the linear embedding information of small targets, constructs the image hierarchy, each layer only models the local relationship, and expands the feeling field instead of CNN to achieve the end-toend accurate detection.

A.Image enhancement model based on visual Transformer
Image reconstruction can improve the resolution of small targets, but it usually loses high-frequency edge information, which has a significant impact on the detection effect.Regarding this issue, this experimental group based on the CNN fusion super-resolution reconstruction and edge enhancement model  For the generator, the generator model of ESRGAN is adopted, which improves the generalization ability while reducing the calculation complexity.On this basis, this experimental group uses a Transformer block instead of a CNN dense connection block (Dense Block) to increase the feature representation capacity of the model (as shown in Figure 2. The input of the generator is the original low-resolution image, which is processed by linear embedding through several Transformer blocks to generate advanced features, and then obtains high-resolution images after up-sampling.For the discriminator, using the network model with the VGG 16 structure, the discriminator is generator-dependent, which predicts the real image x r in contrast to the intermediate pseudo-image x f with a more realistic probability.The discriminant is calculated as follows: where σ indicates the Sigmoid; C (x) indicates the export of arbiter; Exf [.] means that the pseudo image data is averaged over all generators.The Loss-function of the arbiter and the generator are as follows.
The losses of the generator and the discriminant are symmetric and also contain the real image x r and pseudo image x f .In the construction of the edge-enhanced network (as shown in Figure 3), the intermediate super-resolution image generated by the generator first extracts edge information via the Laplacian, then extracts features through several Transformer blocks after linear embedding, using the Sigmoid activation function to eliminate the edge noise.Finally, it is added to the input image and subtracted from the edge information extracted by the Laplacian, finally obtaining the super-resolution image.
Picture 3 edge enhancement network

Underwater target based on visual Transformer, detection model
At present, the visual Transformer target detection algorithm showed strong performance.The reason is that Transformer is not like CNN constantly stacking convolution layers to complete the image from local information to global information extraction, but the unique long-distance dependence making the model from shallow to deep can make good use of global effective information.A long attention mechanism (MSA) ensures that the network can focus on the correlation information between multiple image pixels, so as to achieve a good detection effect.Compared to the CNN-based target detection algorithm, the algorithm adopted in this experimental group is a hierarchical visual Transformer target detection algorithm with a shift window.This algorithm applies the latest Swim Transformer framework to the underwater small target detection task(Figure 4).The loss function of hx is:

ICAITA-2023
where h (x) is the Y + 1 dimension estimate of the category posterior distribution; x i is the network-inputlayer; y i is the category; L cls is the Hinge loss; N is the batch size.The label of preset x is about the piecewise function of u, which is speculated according to: where gy is the real box location tag; g is the true class.The candidate boxes have coordinates bx, by, bw, bh, and the loss function of the regression is as:
The experiment used the official data set of the 2018 underwater Target Catch Competition (URPU), such as figure 6, containing 2901 training images and 800 test images, including four target categories: scallop (scallop), sea cucumber (holothurian), starfish (starfish), and sea urchin (echinus).The different example images of categories of targets are as follows.Images are taken in the real Marine environment, and there are problems such as low contrast, color distortion, fuzzy feature information, and dense targets and occlusion, which bring great challenges to target detection.(c) (d) Figure 6.example image In the training set, four types of targets, such as scallop, sea cucumber, starfish, and sea urchin, account for 6.5%, 15.7%, 17.9%, and 60.0% of the data set respectively.The distribution of different targets is extremely unbalanced, which brings great challenges to network training.

Verification validity validation
To provide the validity of the proposed algorithm, it is compared with the mainstream target detection algorithm in the PASCAL VOC public dataset.The experimental training set was VOC07 + VOC12 travel with a total of 16551 images and VOC07 test dev-test, 4952 images in total containing 20 target categories.The results shows in table 1 [6] ResNet When the input size is 384*384, the average detection precision of the proposed algorithm reaches 78.1%, and the detection accuracy is higher and faster than the two-stage objection detection algorithms, and increasing input size to 512*512, the average detection precision reaches 79.5%.It has advantage in detect objects using the proposed algorithm, which verifies the rationality of the network structure design.

Comparing the proposed algorithm and the Swim Transformer algorithm.
We was compared to Swim Transformer algorithm from three ways: accuracy, speed and model complexity, and the analysis are as Table 3 and Table 4.
From Table 2, Compared to the Swim-T algorithm, the proposed algorithm slightly reduces the detection accuracy of starfish categories, but improves the detection accuracy of Holothurian , Echinus, and scallop by 1.5% , 1.2%, 2.8 %, mAP improvement 15% points indicate that the proposed algorithm has stronger feature extraction ability and multi-scale object detection ability, which can more accurately detect various underwater targets; From the perspective of detection speed, the proposed algorithm has a detection speed of 1.35 times, in a clear advantage.network internal repeated multi-scale feature fusion, to some extent, reduces the network reasoning speed.From the perspective of model complexity, the model size, parameters number, and floating-point operations of the proposed algorithm are only 123.0 MB, 30.4 MB, and 3. 26*10 10 .It is far lower than most mainstream target detection models, indicating that the proposed improved network is a relatively lightweight network, which can significantly decrease the sophistication model, and improve the network reasoning speed.Compared with other networks, it is easier to deploy to mobile and embedded devices, which is conducive to practical application.Comprehensive analysis, the proposed algorithm achieves higher detection and has significant advantages in detecting underwater targets.

Conclusions
(EESRGAN) idea, established a Transformer based image enhancement network model (Figure 1).It combined three parts: super-resolution generator (G), discriminator (D Ra ), and an edge enhancement Network (EEN).For the low-resolution raw traffic image (LR), an intermediate superresolution image (ISR) is generated through the generator, and then a super-resolution image (SR) with an enhanced edge.The discriminator receives the high-resolution image (HR) and the intermediate super-resolution image generated by the generator for discrimination.

Figure 1
Figure 1 framework of the image enhancement model

Figure 2
Figure 2 Generator (G) (top) and Transformer block (bottom) First, we input image through patch partition into a set of non-overlapping patches, where each patch set 4 × 4 size, the Characteristic Quantity is H / 4* W / 4, and dimension is 4*4*3; Then the divided patch Characteristic dimension is changed to 4*4*C by linear embedding and sent into multiple Swim Transformer blocks; Then we input the adjacent patches per 2*2 by patch merge (patch merging); the shape of the patch changes to H / 8 *W / 8; the characteristic dimension changes 4C; we duplicate this process for n times until the shape of blocks is changed to H / 32 *W / 32.The characteristic dimension changes to 8*C.Finally, it is sent to the regression head for target classification and localization regression .Figure4Swim Transformer Detection process3.3.Overall framework for end-to-end underwater small target detectionAlthough Transformer with hierarchical structures and shifted windows already have good performance in the detection field, this strategy is not the best use of Transformer in target detection, because the size H/P*W/P is much smaller than the original image resolution HW, which leads to the loss of low-level detail.Therefore, to compensate for this information loss, this experimental group used an architecture integrating edge-enhanced and super-resolution reconstructed visual Transformer to extract traffic image features, and then achieved accurate target detection through the regression head of Cascade Mask RCNN, as shown in Figure5.

Figure 7
Figure 7 for two algorithms on the test set of 4 class target detection effect, the proposed algorithm and CenterNet algorithm can detect most of the target, but the algorithm has the lower detection rate of fuzzy, occlusion target, can detect more target, at once increase the detection confidence of all kinds of target detection, position deviation smaller detection box.The proposed algorithm has more advantages than Swim Transformer-T in detecting underwater targets.

Table 1
: Test results of PASCAL VOC dataset

Table 2
Comparison with detection precision and speed

Table 3 ,
it can be analyzed that under the same experimental setting, the model size, model parameters and floating point calculation (GFLOPs) of the proposed algorithm were much lower than the Swim Transformer algorithm, the model size was reduced by 642.7 MB, compressed by 83.9%, the number of model parameters was reduced by 160.8 MB, compressed by 84.1%, and the floating point operation was reduced by 1.319*10 11 , or 80.2%.It is indicated that the algorithm in this paper reduce the model complexity, and is more advantageous in practical application.Table3Comparison with detection precision and speed with Swim Transformer-T algorithm For the difficulty of extracting target feature information and missing target detection, an underwater image enhancement algorithm based on GAN for fusion edge enhancement ,super-resolution reconstruction and an underwater target detector algorithm with visual Transformer are proposed.It is compared to the Swim Transformer algorithm in this paper detection accuracy, detection speed by 1.5%, 35.6% respectively, model size, model number, floating point compression 83.9%, 84.1%, 80.2%.In maintaining fast detection speed, in the detection precision, the number of model parameters has obvious advantage comparing with other target detection.That is, the proposed algorithm is more advantageous in detecting underwater targets.