Source localization in deep ocean based on complex convolutional neural network

To solve the problem that phase information cannot be used effectively in underwater acoustic localization, a deep learning network based on complex convolutional neural network is proposed in this paper. The complex convolution layer is used to effectively utilize the phase features favorable to the source localization problem, and the phase information is used to improve the feature extraction performance. Through simulation, the performance of the network for source localization in deep-sea direct arrival region under different SNR conditions is analyzed. The results show that the complex convolutional network proposed in this paper can locate the sound source with less computation and has better performance under the condition of low SNR.


Introduction
At present, the localization method based on deep learning [1] and deep convolutional neural network(DCNN) [2][3] provides a new idea for underwater target localization.These methods use deep neural networks that are powerful enough to extract information from various forms of input and ultimately translate it into accurate source positions, but simplify the processing of phase information.E.g. in reference [2], the network input is the normalized sample covariance matrices of the broadband data received by a vertical line array.However, the input form of this method means that the computation of the network increases rapidly with the number of elements.Reference [3] takes the amplitude of the sound pressure at frequency F and drops the phase information.For wide-band acoustic signals, they are usually transformed into frequency domain for processing.The hydrophone signals after Fourier transformation are in complex form, and the phase difference of each array signal contains the position information of the target.The method of simplifying the operation process of complex numbers to convolution of real numbers does not make good use of phase information, and has the following potential shortcomings.On the one hand, convolutional kernel needs to perform complex operation in a latent way, which makes the network more difficult to converge; on the other hand, more convolutional kernel and complex network structure may be required to sort out information.Therefore, a more explicable complex convolutional network needs to be proposed, which can express the meaning of its internal information processing and flow in an intuitive way which is easy to understand and for further exploration.
Since 2017, complex convolution [4] has been used in computer vision to processing image and has subsequently been used in radar imaging [5] , magnetic resonance imaging (MRI) [6] and other fields, and has shown potential in speech processing [7] .The operation of complex convolution can intuitively reflect the operation involved in the phase.Based on the above considerations, a small localization neural network named CCONV which is based on complex convolution is designed and trained on simulation data and explores its performance.

Complex convolution
The complex matrix W can be expressed as: W = A + iB, where A and B are real matrices.Complex convolution kernel can be expressed as: h = x + iy, where x and y are real vectors.Since we are simulating complex arithmetic using real-valued entities.As the convolution operator is distributive, convolving the vector h by the filter W we obtain: *  = ( *  −  * ) + ( *  +  * ) (1) if we use matrix notation to represent real and imaginary parts of the convolution operation we have: (2)

Network Architecture
Assuming that the hydrophone array has N elements, the frequency domain reception of the array by taking the discrete Fourier transform of the raw pressure data at the sensor can be expressed as Where PNF indicates that the Nth element complex pressure at F frequencies.The array in this paper has 16 elements, and the source frequencies are 100~300Hz with the increment 1 Hz, so the input of the network is a complex matrix of 16×200.Each element in the matrix are normalized according to their modulus.
Architecture of the CCONV network.Source localization is regarded as a classification problem.According to the interval of 5km, the detection distance of 0.1km to 20km is divided into four intervals, namely 0.1-5km,5-10km, 10-15km and 15-20km and the depth of each section is 1-200m.Corresponding to these four intervals, four networks are trained respectively.In each interval, the range is discretized with the increment 0.1 km and the source depth is discretized with the increment 10m.In this way, the source of each range intervals has a total of 50 classes in range and 20 classes in depth.The source localization network CCONV's Architecture is shown in Figure 1, which has 4 residual blocks.Each blocks has 2 complex convolution layers (kernal size=3×3) and 1 subsampling layer (kernal size=1×1) and the channel of each blocks are 32, 64, 128, 256 respectively.Each complex convolution layer is followed by the complex Batch Norm (CBN) layer and Relu layer.After the features extracted by the network pass through the complex fully connected layer (CFC), the modulus value is calculated.Finally, the results are classified by the range and depth of each output sound source of the two real fully connected layers(FC).Since the initialization of convolution kernel parameters can promote network convergence, we initializes the parameters in the complex convolution kernel h according to the method in reference [3] so that the phase of the parameters in the complex convolution kernel is evenly distributed in [-π, π].

Loss function
To get distance and depth at the same time, CCONV adopts multi-task learning [8] (MTL), which is realized by 2 FC layers at the end of the network.The Loss function of MTL can be expressed as: (4) where   and   are the weight and loss of the ith task.In this paper, the cross-entropy loss function is used for both distance and depth tasks and the weight of each Loss is set as  1 = 2 =1

Dataset and training details
BELLHOP is used to generate simulation dataset to train CCONV and test its ability.The simulation employs the Munk sound velocity profile.The sound speed profile (SSP) of the experimental area is shown in Figure 2. The sea depth is set at 4200 m, and the vertical array with 16 elements ranges in depth from 3900 to 4012.5 m with a spacing of 7.5 m.The acoustic source target emits a broadband continuous signal within the signal processing frequency band of 100 to 300 Hz.In order to improve the anti-noise performance of the classification network, the SNR of the sample varies from -10dB to 15dB uniformly by 25 orders of magnitude.Therefore, there are 50×20×25= 25,000 training samples for each classification network.A 5% sample is randomly selected as the Validation set.On the grid formed by depth and distance, the test set generates samples according to a total of 6 signal-to-noise ratios of -10dB, -5dB, 0dB, 5dB, 10dB and 15dB, which means that the test samples are 50×20×6=6000.The SNR across the frequency band is defined as (5) where P is the complex pressure at F frequencies and r 2 represents the noise variance.
Pytorch1.13 deep learning framework was used in the experiment to conduct training on nvidia GPU3090.The parameters of the network were updated by stochastic gradient descent method (SGD), lr = 0.0002, Batchsize =50.

Training process
Figure 3 shows the Loss of CCONV during training and validation at two intervals of 0-5km and 15-20km.We verify the loss once after each epoch of training, and it can be seen that after 1-6 epochs, the loss has stabilized and converges to about 0.2.
where   and   are the predicted and the ground truth parameters, for either range or depth.Figure 5 shows the EMAE under low SNR conditions.At -5dB, the EMAE for distance and depth are reduced to 50m and 5m, respectively.When SNR=-10dB, the EMAE for distance is about 0.3km, and the EMAE for depth is between 20-70m.When predicting the range of the source, the EMAE of CCONV is smaller than that of MFP, but when predicting the range of the source, the performance of the two methods is close.

Floating point operations
The floating point operations(FLOPs) of CCONV is estimated and compared with that of MTL-CNN.FLOPs is often used to measure the time complexity of an algorithm/model.The FLOPs for each convolutional layer in the network are calculated as follows:  =   ×   ×  ×  ×   ×   /(  ×   ) (7) Where,   and   are the width and height of the convolution kernel respectively, W and H are the width and height of the input feature map respectively, and,   and   are the step sizes of the convolution kernel in the width and height directions respectively.As can be seen from equation ( 7), the computational amount of the model is related to the size of the input.Table 1 shows the input size and FLOPs of the each model，and the FLOPs when sensors of array reaches 32.When the number of sensors increases from 16 to 32, the FLOPs of CCONV increases from 3.26×10 8 to 6.52×10 8 , which is twice of the original, while that of MTL-CNN increases from 4.65×10 9 to 1.47×10 11 , which is about four times of the original.

Discussion and Conclusion
In order to extract and utilize the phase information in array signals, a deep learning network based on complex convolutional neural network is proposed in this paper.Based on the simulation data of BELLHOP, the performance of positioning network under different SNR in deep sea direct sound region is analyzed.The results show that the complex convolutional network proposed in this paper can achieve sound source location with less computation, and the distance location is more accurate than MFP when SNR=-10dB.

Figure 2 .
Figure 2. Environmental parameters used for simulations.Depth of the sea beach is 4200m, the density b  , sound speed b c , and attenuation b  of the basement are 1.6 g/cm3,1650 m/s, and 0.1 dB/ λ, respectively.

Figure 3 .Figure 4 .
Figure 3. Loss of CCONV during training and validation at two intervals of 0-5km and 15-20km3.2.SNRFigure4shows the range and depth accuracy curve as a function of the SNR.When SNR<5dB, the accuracy of positioning begins to decline.When SNR=-5dB, the localization accuracy for the range decreases to about 80%, and the localization accuracy for the depth decreases more slowly, above 98%.As the SNR continues to decline, the accuracy for the range and depth of the source localization rapidly decreases to around 45% and 67% at SNR=-10dB.