Research on Clustering Application Based on Stock Association Network

This article uses stock market data and text information to construct related complex networks of stock information, and compares the cluster analysis effect of three community discovery algorithms base on two networks to discuss how to classify stocks in order to give Investors provide better reference. First, transform the heterogeneous data into structured data by obtaining and preprocessing. Then, build an association network based on the similarity of stock price fluctuations and the correlation of stock text information. Finally, using three different cluster analysis algorithms to analyze the stock association network and the text similarity network to compare the effects of different algorithms on stock classification.

. Other non-overlapping algorithms also have a good performance in complex network community discovery.

Calculate Time Series Similarity of Stocks Based on Dynamic Time Warping (dtw)
Suppose X,Y as two time series of length n and m, using the points of the two series constructed matrix M. Using to express the distance between and . DTW algorithm is to find a curved path from the distance matrix which can minimize the cumulative cost between X and Y, and calculate the total cost of this path, as shown in equation 1: (1) In addition, z-score is used to standardize the time series to eliminate the impact of the dimension and standardize time series data. Equation 2 is the normalized calculation formula. Using μ represents sample, x, σ represents mean and standard deviation of the samples. After the dtw distance of any two stocks is determined, equation 3 quantifies the dtw distance into correlation coefficient. Finally, the × Order symmetric correlation coefficient matrix C is calculated, as shown in equation 4. (2)

Generate Text Similarity Network Based on Text Information Similarity
Softmax is often used for classification problems in neural networks. The function expression is as shown in equation 5, and equation 6 is used for classification calculation.
In actual operation, a word vector with a 200 dimensions is constructed for each word by word2vec, and by calculating the cosine similarity between the word vectors, the quantized similarity can be obtained to determine the similarity relationship between them. A symmetric matrix D is used to describe the degree of similarity between stock text information.

Use Three Community Discovery Algorithms to Perform Cluster Analysis on Networks
Newman fast algorithm uses modular value Q as quantitative target to measure the advantages and disadvantages of community Division. Newman's fast algorithm initially treats each node of the complex network as a community, and then iteratively merges each generated the two communities with the largest value of Q, until the end of the iteration, the community partition with the largest module degree value is selected as the result. The algorithm is as follows: • Initially set each node of the network as a community, n communities in total, The initial values of and are as flow: • Merging pairs of connected communities and calculating the modularity increment after the merger • Repeat step (2) and continue to merge until merged into a community. Louvain's fast algorithm is divided into two phases: Traversing the network nodes, merging each node to its neighboring node communities, the modular value gradually increases until all nodes are stable. After that, it reconstructs the network by merging the communities into super nodes. Louvain's fast algorithm is as follows: • Initialize each point to be divided into different communities; • Divide each point into the community adjacent to it, and calculate the module degree, then judge the difference between the module degrees before and after the division ∆Q, If ∆Q is positive, accept the division, otherwise, give up the division; • Repeat step (2) until the modularity no longer increases; • Construct a new graph. Each point in the new graph represents the community drawn in step (3), and continues to perform (2) and (3) until the community structure no longer changes. In the above Louvain fast algorithm steps, The calculation formula of modular increment ∆Q is shown in equation 9, Where ∑ and ∑ represent the sum of all edge weights in the community and all edge weights connected to nodes in Community c respectively.
is the weight of all edges connected with i and , is All edge weights connected with i in the community c, m is the sum of all edge weights in the network graph.
Using the LDA topic model for community discovery. Documents and words are a collection of stock nodes, and stock nodes are used as documents. G = (V, E) represents the associated network diagram. V = {v 1 , v 2 , v 3 … v m } and E = {e 1 , e 2 , e 3 … e n } represent the set of nodes and edges, w �� �⃗ i and w ij represent the neighbor node of stock node v i and its j-th neighbor node respectively respectively. The network interaction model can be expressed in equation 10. (10) After the model description, the node corresponds to the model SRI can be regarded as a document, the node adjacent to the stock node becomes the word of the document, SRW� , � is Interaction intensity of and as defined in equation 11.

Evaluation Criteria
Modularity: equation 12 is the modified formula for calculating the weighted modularity, ∑ and ∑ are the sum of the weights of all the edges in the community and all the edges connected to the inner node.
If the modular value of the community structure is larger, it means that the better the community division is, and the links within the community are dense.
Average correlation coefficient: According to formula 13, the correlation degree Φ of all the stocks in community c i is measured as the mean value of the number of the price fluctuation relationships among the stocks in communityc i . Equation 14 is the average value of the correlation of all communities in the whole community divisionC = (c 1 , c 2 … c m ), which is the overall correlation measurement index of the inspection algorithm division, where m is the number of communities obtained by the community division.
Equation 15 is the coverage calculation formula, Where c i and s j represent the classification of stock associations and traditional industry plates obtained by association discovery algorithm respectively. (15)

Conclusions
The modularity, average correlation, and industry coverage of the three algorithms for clustering analysis of the n1 network are shown in Table 1, and Figure 1.    In this chapter, the Newman fast algorithm, Louvain fast algorithm, and LDA topic model community discovery algorithm are used to divide and analyze the clustering of stock association networks. From the experimental results, the LDA topic model community discovery algorithm can obtain higher correlation than Newman fast algorithm and Louvain fast algorithm. At the same time, the clustering results obtained by several community discovery algorithms are compared with traditional plate classification, and experiments have found that the clustering results of several algorithms are higher than the average correlation coefficient of plate division.