Application of Text Analysis Technology in Aviation Safety Information Analysis

This research is aimed to obtain potential safety hazards quickly, accurately, and efficiently from a large amount of daily collected aviation safety information, and provide a clear improvement direction for safety risk control. By combing with the text analysis and machine learning, clustered a given type of aviation safety information according to its content is an important basis of mining information effectively. Taken the system failure/jam/fault events collected by Chinese civil aviation in 2017 as a sample. By means of text pre-processing, feature extraction by using logarithmic IF-IDF and k-means method under the environment of python3.6, an automatic clustering model of the sample information is established, and then the visualized results are output based on Multi Dimensional Scale (MDS). The analysis results show that text clustering and visualization can quickly and automatically file information, identify the similarity among sample information, easily lock key information, and provide targeted measures for the next step of risk management and control.


Introduction
In recent years, with the development of civil aviation industry in China, there are more and more channels to obtain aviation safety information, which has led to an increasing number of aviation safety information Aviation safety information is very professional, so in-depth analysis requires strong industry knowledge. However, the lack of analysis methods and the shortage of personnel keep the analysis of aviation safety information at the stage of statistical analysis. Therefore, it is urgent to develop reliable and professional analysis tools. Faced with a large amount of aviation safety information, some methods [1][2][3] can start from a specific event or event type, then analyse the cause of the event in detail, which is not able to obtain effective analysis results in a limited time. Therefore, domestic and foreign scholars have begun to use computers as tools for human thinking, mainly clustering algorithms, association rules and classification algorithms are used to automatically analyse aviation safety information. For example, the references [4][5] are based on the idea of clustering, by using neural network algorithms and latent semantic indexing methods to achieve the effective classification of airport noise annoyance model fuzzy sets and aviation report; Some scholars [6] used support vector machine algorithm to cluster human factors involved in aviation safety reports, and had achieved positive results; Ludovic Tanguy, Nikola Tulechki [7] applied Natural Language Processing (NLP) tools to aviation safety report management, they used supervised machine learning technology to find the cause of a single event and identify reports similar to the specified content .As a result, they have realized data / text mining and active analysis of interactive information; SD Robinson [8] used the Latent Semantic Analysis (LSA) process to identify correlations between corpora and proposed a computational method that allows the combination of macro data views and micro data views to successfully visualize the factors between  [9] used Bayesian machine learning methods and hierarchical structures to derive statistical estimates and prediction models with different complexity and goals to identify aviation safety risks in advance; Liu [10] classified aviation safety reports based on convolutional neural network algorithm to predict aviation safety risks. The above analysis found that although the data mining technology [11] based on the traditional event analysis method had preliminary applications in aviation safety information analysis, most of them were based on association rules to find potential information relationships. There were few researches on using text clustering and visualization techniques to quickly extract key content of information. Therefore, based on the existing classification of aviation safety information, this paper used python as a tool, completed information clustering by using K-means and realized visualization based on MDS and word cloud diagram to discover hidden safety risks in aviation safety information. This way provided data-based improvement directions for risk management.

Text Analysis
Text analysis involves text classification, text clustering, and cluster visualization. Among them, cluster analysis mainly summarizes the relationships between data objects according to given requirements and rules from the data set [12], and completes the classification process. This paper is used key techniques of text analysis by python to automatically analyse information, including text normalization processing, cluster analysis, and visualization processes.

Text Normalization Processing Technology
2.1.1. Chinese word segmentation. In order to realize the feature input in numerical format, the text must be cleaned and standardized preprocessing. The Chinese text segmentation [13] is the first step in the normalization of text, which breaks down text data into smaller and meaningful components. This paper used rule-based jieba exact mode for word segmentation. Although jieba has the ability to recognize new words, a custom dictionary can prevent words such as "low oil level warning" from being forcibly separated to ensure segmentation accuracy.

Moved stop words list.
Text data is unstructured and has various noises, such as prepositions, punctuation, adverbs, etc. If noise is not removed, it may affect the results of subsequent cluster analysis. Therefore, after the word segmentation, load and traverse the stop words list, clean the text words, and add or delete words in the stop words list as required to minimize text noise.

Feature extraction.
The meaning of feature extraction is to use text analysis technology to screen out the content that can best express text information. Vector Space Model [14] is a very useful concept and model for processing text data. Mathematically, it is assumed that there is a text D in the text vector space VS. Each text dimension and number of columns will be the total number of different terms or words in all text in the vector space, therefore, the vector space can be expressed as: n is the number of different words in the entire text. Therefore, we can represent the text D in vector space as: w Dn represents the weight of the nth word in text D. This weight is a quantity value, which can be the frequency of words in the text, the average frequency, or the TF-IDF weight. TF-TDF [15] still has certain advantages in the weight calculation method, the mathematical definition of calculating tf-idf is as follows: tf represents term frequency, the word frequency of any text is how often the word appears in a 3 particular text; idf represents inverse document frequency, its mathematical expression is: idf (t) represents the idf of the word t, C represents the total amount of text in the database, and df (t) represents the frequency of the number of texts containing the word t. It can be obtained from the value of idf (t) that frequent occurrences of words in the database are not necessarily valuable. It may be useful information that there are few occurrences in the database and more occurrences in specific texts. The final TF-IDF metric used in this paper is tf-idf normalization. We also need to divide the tf-idf matrix by the L2 norm of the matrix for normalization. The L2 norm is also known as the Euclidean norm, which is the square root of the sum of the squared weights of tf-idf for each word, then we got:  [16] has the advantages of simple and scalable large amount of data, and is therefore widely used, which is a centroid-based clustering model. The basic idea of the algorithm is: Suppose there is a data set X with N data points or samples, we separate N data points into K separate clusters C k which do not want to intersect. And each cluster can be represented by the average values of the sample clusters. These averages are the centroids μ k of the clusters. They are not restricted by the actual data points of the N samples of X. The algorithm selects these centroids and constructs a clustering model by inertia optimization, and the mathematical expression is: In clustering and centroid, the optimization steps are as follows: Step 1: Select K random samples in dataset X and select K initial centroids; Step 2: Update the cluster by assigning each data point or sample to its nearest centroid. The mathematical expression can be: Step 3: Obtain new cluster data points for each cluster through Step 2 and recalculate to obtain new clusters, that is: The above steps are repeated iteratively until Step 2 and Step 3 do not change, and finally the clustering effect is better.

Cluster visualization.
In order for the clustered information to be displayed intuitively and visually in low-dimensional coordinates, to grasp the distribution of similar events as a whole, and to realize the function of visually classifying the research objects, so the information needs to be visualized. Aviation safety information is unstructured, so it is best to use Multiple Dimensional Scaling (MDS) to reduce non-linear dimensions. The core idea of MDS is to use a distance matrix to obtain the distance between data points, such as cosine similarity [17], Euclidean distance [18]. MDS attempts to use the high-dimensional features in the vector to convert to low-dimensional feature representations, and uses the matplotlib tool to draw graphics.

Sample Analysis
In this paper, the information of domestic unsafe incidents in 2017 is selected, because of its strong objectivity of mechanical failure, and the reported information content is more true and reliable. Therefore, taken the system failure/jam/fault event type information with the largest number of mechanical reasons in 2017 as the sample, then 931 pieces of sample information is sorted out and extracted text features. Finally, quickly realizes information clustering and visualization.

Information Preprocessing
Based on the characteristics of the research object, the dictionary of civil aviation terms is loaded, adopted jieba word segmentation in Python library, and the stop word list is loaded to delete invalid words. We ended up with 3757 words. Some terms are shown in table 1. Features represents feature terms, Index represents the position of feature terms, and table 1 also shows the type and character size of feature words. It is used tf-idf to calculate the weight values of each feature word, table 2 shows the weight values of "weather radar" in different information.

Cluster Analysis
The core idea of cluster analysis is to input the feature matrix and divide into groups based on the distance between the information. Import K-Means from the sklearn.cluster model in the Python library, set the parameter num_cluster to 13, that is, the clustering label is 13, and the clustering results after running are shown in Table3.   Table 3 that each cluster label is assigned some information to quickly organize and archive. According to the key feature of each label to understand key content, and retrieve detailed information. For example, cluster1 contains 86 pieces of information, and the extracted detailed cluster analysis information is "flaps", "approach", "position", "backing edge", "sensor", etc. Refer to the key features in the cluster label, It can be timely and quickly grasped: In the system failure / jam / fault event type information, the aircraft is prone to cause asymmetrical flaps or jam events during the approach phase, and flap position sensing needs to be replaced after the flight. Therefore, according to the key features of the clustering label, the safety management or maintenance personnel should pay attention to check the flap position sensors, find out the method of eliminating safety risks in advance, and lay down preventive measures in time.

Cluster Visualization
Because aviation safety information describes events in the form of text, and the cosine distance is more likely to distinguish the similarity or difference between texts based on the content of text, the cosine distance is used as the distance matrix for cluster visualization. Visualization results based on MDS by using python are shown in Figure 1.

Figure 1. Clustering visualization of sample information
In the result shown in Figure 1, there are 13 cluster labels in total, and each cluster label has its own colour and symbol to distinguish the feature values in the legend box. Each cluster is described by its topic. The topic is used to define the cluster, in order to quickly retrieve each piece of information. For example, in the visualization results of clustering based on MDS, the information related to the "TCAS" fault is displayed as a red circular cluster label, and the information related to the "landing gear" fault is shown as a yellow pentagonal cluster label. Figure 1 reflects that the information with inherent relative relationship is linked together through visualization, and the similar features of the sample information are displayed in the form of graphics, which is more intuitive in effect and achieves the visual classification function of aviation safety information. In order to fully understand the function of cluster visualization and its application in practice, detailed analysis will be performed on the information in a certain clustering label to clearly sort out the differences between the information. In the python environment, the cluster visualization diagram drawn using matplotlib can enlarge the target area ,which makes clearly see the specific information contained in the clustering labels and other information close to it. For example, the clustering label  6 with the key feature of "TCAS" is enlarged, that is, the red circular area shown in the Figure 1 is enlarged, and the result is shown in Figure2.   Figure 2. Local visualization of clustering label related to "TCAS" Figure 2 not only shows the frequency of failure of a specific system component of the aircraft, but also can mine the correlation between each other based on the distance between information. The same cluster label is mostly related to the same system fault, but the attributes of the same system fault are different. As the information shown in the Figure.2, the TCAS fault is mainly described, but it may be caused by problems such as the TCAS computer, antenna or jumping switch. Therefore, according to the coordinate position of the information in the low-dimensional space, visually discern the gaps between the safety information, retrieve and trace the more similar information, and comprehensively find the fault attributes of different system components involved in similar systems. The risk identification of multi-level faults in a certain system plays a supporting role. Then using the micro word cloud online analysis tool to visualize the cluster label information, which forms a keyword cloud or keyword rendering, so as to filter more invalid words. Finally, the key theme of the information can be scanned at a glance during the information analysis processing. The main features of the information content in the red circular clustering label in Figure. 2 are shown above. The word cloud diagram is shown in Figure 3. . Word cloud of clustering label related to "TCAS" It can be seen from Figure3 that words such as "TCAS", "RVSM", and "cruise" are relatively large and prominently located, which indicates that the words in the information set are associated with many cases and are of great importance. The visualization method can grasp the key theme of the cluster label quickly and easily, and it is easy to help the analyst or manager to lock the key items and improve the work efficiency of the information analysis. There are some words related to the stage of flight and region, such as "taxi", "cruise", "Chongqing", "Shenzhen" and other words, which shows that TCAS failures may be caused by a combination of multiple factors, the description of the system fault information is not specific enough, and the indepth analysis of the system failure / fault is limited, and further detailed event report information needs to be obtained.