Ship Classification Method for Massive AIS Trajectories Based on GNN

Since criminals and maritime terrorism may tamper with AIS data and make the track suspicious, it is urgent to classify ships accurately and improve maritime navigation safety. Ship classification based on trajectory data can make up for the deficiency of traditional radar identification and optical identification which has important academic significance and practical value. The target recognition technology based on the traditional neural network can only process conventional Euclidean structure data, while the emerging graph neural network shows great advantages in processing non-Euclidean structure data. The ship trajectory data has the characteristics of the time and space domain and shows a non-Euclidean structure; therefore this paper proposes a classification and recognition method based on the graph neural network to process ship AIS data. First of all, the ship trajectory data is preprocessed and converted into graph data with vertices and edges. Then we use GNN to classify 4 types of ships including fishing vessels, passenger ships, oil tankers, and container ships. Finally, we compare the results with the SVM method. And it shows that this method is valid and proves that it is an effective method of ship classification.


Introduction
Ship classification is widely used in both military and civilian fields, such as the detection of illegal ships, alertness to maritime terrorism, identify spy ships hidden as civilian ships, and combating smuggling by relevant departments [1]. At present, research methods from China and abroad for the classification of ship types are mainly based on traditional radar recognition and optical recognition, but these methods all have their limitations. For example, optical recognition relies on video surveillance equipment, which field of vision is limited and range of action is short. It is easily affected by meteorological factors such as rain and fog, especially under the meteorological conditions such as high humidity and low clouds at sea.; Although radar recognition is less affected by the environment, it has the problem of "visible but unclear". It is easy to produce co-frequency interference clutter in a complex electromagnetic environment. In military applications, once the radar is turned on, it is easy to be detected and the safety is threatened. However, classification and identification of ships based on AIS data is less affected by the weather and it can automatically identify the status of the ship around the clock. More than that, this method has other advantages, such as it is not easy to be exposed by enemy reconnaissance., the data collection accuracy is high, and static data such as voyages and the attributes of the ships can be collected. In summary, AIS data is of great significance to the classification and identification of ships. AIS data has a large amount of data and a wide coverage area, and its classification and identification have certain challenges.
Traditional research methods mainly include clustering algorithms based on the distance between track points, machine learning algorithms after manually extracting features, and classification methods

Data Preprocessing
The original data set is the AIS data of fishing vessels, passenger ships, oil tankers, and container ships in a certain area of the South China Sea for one year in 2020.

Build Ship Feature Table
After constructing the ship feature database, we extracted the 6 attributes of IMO, timestamp, heading, speed, latitude of the track point, and longitude of the track point from the AIS data as the value. And we used the IMO of the ship as the key value, which means that the trajectory characteristics of each ship are saved according to the IMO number. The track point data of each IMO is arranged in order of time stamp.

Data Cleaning
After data analysis, the dirty data that meets the data cleaning conditions are discarded. The dirty data mainly includes the abnormal position data and the redundant position data. Among which the abnormal position data means that the distance between two adjacent track points is too large when the time interval is short. And the redundant position data refers to the data features and attributes of two adjacent track points are exactly the same. The algorithm is as follows: (1) For data with the same key, we calculated the time interval and the distance interval between the i+1th track point and the i-th track point. Considering the curvature of the earth, we use the Haversine formula as the distance calculation formula, as shown in equation (1). And the distance calculated by this formula is referred to as the Haversine distance, as shown in equation (2).  In equation (2), l represents the distance between the two track points; R represents the radius of the earth, generally 6371Km; x lat1 represents the latitude of x 1 point, y lon1 represents the longitude of y 1 point, x lat2 represents the latitude of x 2 point, and y lon2 represents the longitude of y 2 point.
If the Haversine distance between two track points in a short time interval is too large, the i+1th track point to the nth track point of the IMO ship is the abnormal position data and need to be discarded, where n represents the number of track points of the IMO ship under the same time window.
(2) If the i+1th data point is exactly the same as the i-th data point, then the i+1th track point is the redundant position data and needs to be discarded.

The Graph
The graph is composed of edges and vertices, where edges are represented by e and vertices are represented by v, as shown in figure 1. Each vertice in the graph contains its characteristics. The characteristics of the vertice can be represented by an X×Y-dimensional matrix M, where X represents the number of vertices, and Y represents the feature dimension of the vertex; The edges represent the relationship between the vertices, which is represented by X×X dimensional matrix B, called adjacency matrix. Matrix M and matrix B are the input of the graph neural network.

Build A Sample Database of Track Data.
The time interval of AIS data collection is irregular. Some adjacent track points are separated by a few seconds, and some are separated by a few minutes or even more than ten minutes. The longer the time interval for collecting data, the worse the sample quality. Therefore, it is necessary to choose an appropriate and as small as possible time interval threshold TT (Time threshold) to ensure sample reliability.; The number N of track points contained in the sample segment determines the accuracy of classification and recognition. The more N, the higher the accuracy of recognition. Therefore, it is necessary to choose a suitable and as many N as possible to ensure the validity of the sample. Generally speaking, the larger the time interval threshold, the more the number of track points in the sample segment, that is, the reliability and validity of the sample are a pair of contradictions. Therefore, it is necessary to find a balanced state to make the sample have more N under the premise of a smaller TT. After testing, the experimental data used in this paper can meet the needs when the parameters TT=20 (which unit is seconds) and N=160 are selected.
Ships with the same key may have multiple trajectory segments. One trajectory segment is a sample, and the number of track points in a sample is N. According to the sampling principle of TT=20, N=160, four types of ship trajectory samples are extracted, and the trajectory samples are saved as a threedimensional matrix which called Sample_Matrix. The first dimension of the matrix is the number of samples, the second dimension is the number of trajectory points per sample, and the third dimension is sample attributes, including (IMO, h, v, t-stamp, lat, lon). In which h is the heading, v is the speed, and t-stamp is the timeStamp, lat is the latitude value, and lon is the longitude value.

Build
Topological Structure Graph Data. The total set of ship trajectory sample data is transformed into topological structure graph data, and the vertices and edges are described by appropriate features. Therefore, the adjacency matrix is constructed. Only by selecting appropriate feature data as the input of the neural network can the effectiveness of ship trajectory classification be improved. Through the research and analysis of traditional algorithms, heading and speed features are effective features for ship classification [15]. Therefore, this paper proposes to use heading features as vertical features and speed features as edge weights to construct an adjacency matrix. The specific process is as follows: (1) Determine the receptive field. Calculate the mean Haversine distance I ̅̅̅ between all samples in the total sample set.
In equation (3), − is the Haversine distance between the NTH track point in trajectory S_i and the NTH track point in trajectory S_j. Points with small Haversine distance indicates a strong distance feature connection relationship between vertices, and points with large Haversine distance indicates a weak distance feature connection relationship between the vertices. Sort the Haversine distance in order from small to large, and keep the data with the first 5% of Haversine distance value to ensure the strong connection between vertices, where the data exactly equal to 5% is the threshold of the relationship strength. The strong connection relationship is represented by 1, and the weak connection relationship is represented by 0. The relationship matrix R based on the spatial distance connection strength feature is constructed to determine the receptive field of the vertex. The dimension of the relationship matrix R is X×X, and X represents the number of samples which can also be called the number of vertices in the graph.
In equation (4), Thr represents the threshold of relationship strength, and ̅̅̅ is the average Haversine distance between samples.
(2) Define the weight of the edge. As shown in equation (5), The two-norm of the average speed difference between any two samples is calculated according to the speed characteristics, as the weight of the edge connecting the vertices, and the weight matrix E of the edge is constructed, whose dimension is X×X.
In equation (5), ave_v ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ is the average sailing speed of trajectory , and ave_v ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗ is the average sailing speed of trajectoryS .
(3) Construct an adjacency matrix. Multiply the weight matrix E of the edge by the relationship matrix R based on the characteristics of the spatial distance connection strength, and obtain the adjacency matrix B. In equation (6), whose dimension is X×X.
After normalizing to B, it shows in equation (7).  In equation (7), min (B) is the minimum value in matrix B, and max (B) is the maximum value in matrix B.
(4) Determine the vertex characteristics. The heading feature of the track point is extracted as the vertex feature, and the vertex feature matrix M is constructed, the dimension of which is X×1.

Network Structure
The GNN network structure proposed in this paper is shown in figure 2. The dots in the graph represent vertices, and different colors represent different labels. The graph neural network training process is to perform gradient dimensionality reduction training on the weight matrix . The input graph data features are passed through several layers of GNN, and the classification data with different labels is output. The relationship between the input and output of the graph neural network is as follows: In equation (8), ℎ is the feature value of vertice j in the input data, ℎ is the feature value of output data vertice i, is the activation function, and ∈ ℎ( ) represents the value range of vertice j as the receptive field of vertice i, is the convolution kernel of graph convolution, ̃ is the normalization of Laplacian matrix: In equation (9), B =̂+ , B is the normalized adjacency matrix, and I is the identity matrix. ̃ is the degree matrix of B which formula is A = ∑ .

Model Training
The classification model GNN is based on the win7 operating system and is implemented using the python programming language, whose bottom layer is TensorFlow. The model training platform processor is 64 cores and the memory is 32G. Samples are divided into training sets and training sets. After all the sample data are scrambled, we take 80% as the training set to train the network model, and 20% as the test set to verify the effectiveness of the classification model. The model used a cross-entropy function and Adam optimization algorithm, the number of iterations is 2000, and the learning rate is 0.001. As shown in figure 3, the ship trajectory recognition method based on graph neural networks is mainly divided into two parts: network training and network testing. The network training part inputs the constructed training set graph data into the GNN network for machine learning. The output is the trained graph neural network; The network test part input the test set graph data into the trained graph neural network for vertical classification, and the classification results of each vertex to be tested, which is the ship category. This part was obtained to test the accuracy of the model.

Evaluation Index
The classification of samples is measured by four indicators: Accuracy, Precision, Recall, and F1 value. Accuracy rate refers to the proportion of the number of samples that are correctly predicted to the total number of samples, as shown in equation (10), where is the number of input sample categories that are the same as the input category after model identification, and N is the total number of samples: Precision refers to the proportion of the number of samples that are correctly predicted to the total number of all predicted samples, as shown in equation (11), where is the number of samples predicted by the model: Recall rate refers to the proportion of the number of samples that are correctly predicted to the total number of samples that should be predicted, as shown in equation (12), where is the total number of samples: F1 value refers to the harmonic average of precision rate and recall rate, as shown in equation (13):

Experimental Results
(1) As shown in tables 1 and 2, passenger ships have the highest recognition Precision and Recall rates, both reaching more than 90%. They are the easiest to be identified among the four ship types. This may be due to the fact that passenger ships are the most stable sailing and generally travel at a constant speed. The speed is high and the parking time is small during the driving, so the characteristics are the most obvious. (2)  movement modes according to different working conditions. For example, when it is in a fishing state, the trajectory of the ship is tortuous, the speed is low, and it is often at a berth; when it is in the sailing state, the trajectory is smooth and the speed is high, so the trajectory characteristics are the most obvious.
(3) Tanker identification has the lowest recall rate, with 25.1% of oil tankers classified as container ships. The Precision rate of container ships is the lowest. 15.3% of container ships are classified as oil tankers. This may be due to the fact that oil tankers and container ships are all merchant ships. The two ship types not only travel in close areas but also have similar speed and heading characteristics, which is prone to classification errors. (4) The precision rate and recall rate of oil tankers and container ships have opposite trends, so it is necessary to use another parameter F1 to measure. As shown in table 3, the F1 value of passenger ships is the highest, followed by fishing vessels. The F1 value of both tankers and container ships does not exceed 81%, indicating that tankers and container ships are easy to confuse. In addition, tankers have the lowest F1 value, indicating that the oil tanker is the most difficult to identify. According to equation (8), the Accuracy of GNN ship trajectory classification method is 82.7%.

Comparative Experiment
The method in this paper is compared with the classification of ships based on SVM. SVM classifies high-dimensional nonlinear sample features through hypersurfaces. The comparison results are shown in table 4. The accuracy of the GNN-based ship trajectory classification method is 17.3% higher than that of the SVM-based classification method. The GNN model classification effect is obvious, which proves that GNN can effectively identify ship types.

Conclusion
This paper proposes a classification and recognition method of ship trajectory types based on graph neural network GNN. The heading feature of the track point is used as the vertical feature of the graph data. The speed feature determines the weight of the edge connecting the vertices, and the distance feature between the track points is used as the connection strength of the edge to determine the receptive field, So far, the adjacency matrix based on vertex features and spatiotemporal features can be