Protein function module of deep learning and PPI network prediction

The outbreak of new coronavirus in 2020, the development of bioinformatics technology has become a new hot spot, in which the prediction of protein function module has become a research hotspot. The existing algorithms for predicting protein functional modules have the disadvantages of low accuracy and few functional modules. In this paper, an algorithm for deep learning and PPI network prediction of protein functional modules is proposed. Firstly, three important attributes of PPI network (node location, PPI network structure and core node) are integrated. Then the improved density algorithm is used for clustering analysis, and then the improved isometric mapping dimension reduction algorithm is used for principal component analysis. Finally, MLP is used for training The protein function module was obtained. The accuracy, F value, dimensionality reduction rate and the number of protein functional modules were compared among the other three PPI network based algorithms acc-fmd, MCL and mcode. The experimental results show that IIMP algorithm is better than other algorithms in accuracy, F value, dimension reduction rate and number of recognition modules.


1.Introduction
In recent years, with the outbreak of new coronavirus in 2020, the development of bio information technology has become a new hot spot. them, prediction protein function module has become a research hotspot. The early prediction of protein function module is based on the principle of protein sequence similarity to predict unknown protein function module. The similarity of protein sequence is detected by using blast (basic local alignment search tool) [1] and PSI-BLAST (position specific iterated blast) [2] and other tools to predict protein function module. If the sequence similarity is high, it is considered to be similar to the known protein function module, and if the similarity is low, it is considered that it is not similar to the known protein function module. However, in recent years, studies have shown that it is not rigorous to predict protein functional modules by protein sequence similarity [3] , and protein functional module sequences with low similarity will also have the same functional modules. The use of PPI (protein protein protein interaction, PPI) network to predict protein functional modules has been recognized by many researchers, and a large number of protein functional modules have been mined. As the data of protein three-dimensional database become more and more abundant, fatcat (functional and tracer connectivity analysis toolbox) [4] and past (polypeptide angle suffix tree) [5] have become new hot spots for researchers to predict protein functional modules. PPI can be used to predict protein functional modules effectively. However, it is still a problem to predict Matrix qnxm is used to represent the attributes of PPI network interaction nodes. It is defined as follows: By combining VN × m and QN × m, the characteristic matrix xn × m of protein is obtained, where Xi is the eigenvector of VI of PPI network interaction sample, Xij (I = 1,2,...) ，n，j=1，2，… The values of, m) are defined as follows: The matrix yn × Z is expressed as the annotation information label of protein database which has been successfully identified, where Z is the known protein functional module category, and N is the comparison label. It is defined as follows:  Fig. 1 clustering steps of IDBM algorithm In this paper, improved density clustering algorithm is proposed to predict protein functional modules. As shown in Fig. 1 (a), the local density is calculated according to formula (5) and formula (6), and the data 7 is regarded as the first central node, and the node data with the same attribute as the central node is divided. The node data 1, 2, 3, 4, 5, 6 and 7 are classified into one category, in which 7 is the core node position of PPI network The results of the first clustering are shown in Fig. 1 (b). The remaining undivided node data are 8, 9, 10, 11, 12, 13, 14, 15 and 16. Calculate their local density again to get the maximum local density data 5, in which 10 is regarded as the next PPI network core node, and then cluster analysis is carried out. The clustering result is shown in Fig. 1 (c), Check whether there are any other PPI network nodes. If not, it will end. If there are, continue to cluster. Finally, a threshold condition is set to determine whether the remaining data nodes are regarded as noise. As shown in Fig. 1 (d), when the local density of the whole PPI network reaches the limited density value, clustering will not continue, and sparse PPI network nodes will be treated as noise.

principal component analysis
As mentioned above, the three main attributes (core node, core node location, PPI network structure) in PPI network play a key role in protein function module prediction. These three main attributes are used to predict protein function module to improve accuracy and reduce data noise. In this paper, multi-layer perceptron (MLP) is used in training. If the cluster is trained, the dimension of cluster is too high, which leads to high noise and affects the prediction result. Considering the correlation of protein functional attributes, it is necessary to conduct principal component analysis on the clusters after clustering. This paper proposes an improved isometric mapping dimensionality reduction algorithm, which can maintain the consistency of data structure before and after dimensionality reduction. The basic idea is as follows: use the cluster after clustering to establish a matrix; then calculate the minimum distance between any cluster nodes, expressed as geodesic distance; finally, use the (multiple dimensional scaling, MDS) [8] algorithm to get low dimensional data.
The process of improving isometric mapping dimensionality reduction algorithm is as follows: (1) construct adjacency matrix. Cluster is regarded as neighborhood point, ε is neighborhood radius, and LJ is neighborhood point of Li when | Li LJ | ε. The formula is as follows: (2) The shortest path is calculated. If Li and LJ are domain points, the distance between these two points is replaced by the Euclidean distance s (I, J) The shortest path moment of geodesic distance between two points can be obtained by formula (7) (8). SC = [SC (I, J)] n × m, then w matrix is applied to MDS algorithm to calculate the dimension reduction,In order to improve the quality of dimension reduction, the geodesic line is modified as follows, and formula (9) is used for calculation: ( , ) , , 1, 2, , Where SC (Xi, XJ) is determined by the following formula (10)： ( ) The distance s shown in formula (11) is Euclidean distance, and the kernel matrix K is calculated by geodesic distance:  (12) Where en = [0 1] T ∈ wn. We decompose the matrix K to obtain the largest n eigenvalues λ 1, λ 2, λ 3 λ n, and the corresponding eigenvector ω 1 ω 2 ω 3 By using the matrix ω = {ω 1, ω 2, ω 3 The mapping ω → x'ij in ω n} obtains the corresponding low dimensional data x, ij: The core cluster of PPI network after clustering and dimensionality reduction reflects the importance of protein functional modules in PPI network. When the importance of core cluster nodes is higher, the generation represents the more functional attributes of proteins. In this paper, the reduced core cluster is regarded as a feature matrix and combined with the matrix formed by the known principal components of protein functional module attributes to obtain a new feature matrix. In order to improve the quality of training, min max normalization is applied to the feature matrix to make each element data linear transformation.

multilayer sensor training and functional protein selection
In this paper, we combine the reduced core cluster elements in PPI network with the principal components of known protein functional module attributes to obtain a feature matrix FIJ, where the elements fi = (FI1, fi2,...) The value of FIJ is shown in formula (14)   Matrix x'ij is represented as clusters in PPI network after dimensionality reduction, and Yij is represented as features of attribute principal components of protein functional modules. The principal component of protein function module attribute in PPI network reflects the functional information of protein, and the elements of core cluster of PPI network reflect the importance of PPI network in which the protein function module is located. The higher the importance of core node of cluster, the more functions the protein has [9] .
In order to improve the convergence speed of IIMP algorithm, min max normalization is used to obtain the normalized matrix f'ij. The linear transformation result of each element data falls into the interval [0,1], as shown in formula (15). f'i=(f'i，f'i，… , f'n) denotes the attributes of protein functional modules, FJ (f'i, f'i , f'n) is expressed as a normalized eigenvector. As shown in formula (15).
In this paper, a single hidden layer multi-layer perceptron (MLP) is used as the classifier.The activation function and loss function obtained by clustering PPI network are modeled, and the complex relationship between predicted protein functional modules and known protein functional modules is mined by using the hidden layer of MLP perceptron. As shown in Figure 2, this paper takes the feature vector after the term of known protein function module is characterized as input, and the annotation of known protein function module is taken as output to establish the mapping relationship between multi feature and multi-function in protein function module. The number of nodes in the input layer x is the feature vector of the cluster obtained by isometric mapping dimensionality reduction algorithm. The hidden layer outputs f (w1x + B1), where W1 is the weight (also known as the connection coefficient), B1 is the offset, and the number of nodes in the output layer is the number of notes of the predicted protein function module in the data set.

Output layer
The number of hidden layer nodes is according to the empirical formula [10] , the empirical formula is n-dimensional space straight line y = a1x1 + a2x2 + .........+anxn, where a1, a2 an is an empirical constant. The hidden layer uses the rrelu activation function, such as formula (16).
Tanh activation function is used in the output layer, such as formula (17) ( ) ( ) / ( ) The cross entropy is used as the loss function of the output layer,such as formula (18) ( , ) ( ( )log ( ) (1 ( ))log(1 ( ))) The neural network is trained by batch learning [11], the batch size is 20% of the protein number in the training set, the iteration number is 800, the learning rate is 0.08, and the momentum is 0.9.

experimental environment and data set
The environment used in this experiment is Windows 7 operating system, Intel dual core processor, physical memory 8g, processor speed is. 3.1GHz。 The programming tool is r3.32. CPU acceleration technology and mxnet deep learning framework are used to train MLP.
In order to verify the effectiveness of the IIMP algorithm proposed in this paper, yeast PPI interaction network data was selected as experimental data. The specific experimental data, using yeast PPI interaction network data, dip [12] database (version dip20200719), cogs database providing gene ontology, Interpro provides a database of domains and functional sites in proteins Using uniprotkb / Swiss prot for ID conversion of known protein functional modules Then, the effective protein terms were screened by tandem affinity purification technology.

evaluation index
In this paper, precision, recall and F-measure is the standard to judge the advantages and disadvantages of the algorithm. [13] The formula is as follows: For the selection of protein function module, the matching degree between the function module a mined by the algorithm and the known function module B is calculated by formula (22). If the matching degree between the identified function module a and the known function module B exceeds the given threshold value, the known function module is said to be identified. According to the threshold value set in reference [14] , 0.2 [2160-2162] will be set. If OS (a, b) ≥ 0.2 is to predict that protein functional modules will be recognized by known protein functional modules.  In order to analyze IIMP algorithm to detect protein function modules on PPI network in detail, the IIMP algorithm was compared with acc-fmd [15] , MCL [16] , mcode [17] algorithm. Acc-fmd has strong flexibility and strong robustness, and can obtain high-quality data in the face of similarity problems; MCL algorithm can easily obtain data with good training quality, and the accuracy of the data obtained after clustering Mcode algorithm can easily find the close relationship between nodes in PPI network, and it will not bring high false positive effect due to high-throughput data, and the obtained data has high accuracy. In the experiment, the IIMP algorithm is compared with each algorithm, and the number of function modules extracted from PPI network provided by dip database is analyzed. Fig. 3 Comparison ofIDBM algorithm and DBSCAN algorithm As shown in Figure 3, the number of protein functional modules obtained by DBSCAN is 623, and the number of protein functional modules obtained by the improved density clustering algorithm IDBM is 925, and the number of clustering modules is significantly increased. Fig. 4 Comparison between MDS Isomap algorithm and Isomap algorithm As shown in Fig. 4, Isomap dimensionality reduction rate of Isomap is 50.2%. The dimension reduction rate of protein functional modules obtained by the improved isometric mapping algorithm MDS Isomap is 72.1%, and the dimension reduction rate of the improved algorithm is significantly increased.   Figure 5 basic information of function modules mined by various algorithms As shown in Figure 5, PM represents the total number of protein functional modules mined by each algorithm, full refers to the number of fully identified protein functional modules, and TP refers to the number of matching with known protein functional modules. The total number of protein function modules mined by IIMP algorithm is 392, among which 253 are matched. The number of matched protein function modules is the largest among the four algorithms, which indicates that the IIMP algorithm proposed in this paper is more efficient.  Fig. 7 Comparison of dimension reduction rates of various algorithms As shown in Fig. 6 and Fig. 7, in PPI network, the dimension reduction rate of protein attribute principal component feature extraction results obtained by different clustering algorithms is represented, where M represents the cluster of PPI network obtained after clustering, n represents the cluster obtained through dimensionality reduction, 1-N / M represents the dimensionality reduction rate, and the dimensionality reduction rate of IIMP algorithm is 75.36%, which are higher than acc-fmd, MCL and mcode algorithms The dimensionality reduction rates of other algorithms are 40.70%, 54.71% and 43.96%, respectively. The IIMP algorithm has the highest dimensionality reduction rate among the four algorithms.   Fig. 9 significant module number ratio of each algorithm As shown in FIG. 8 and FIG. 91, SC represents the significant number of protein functional modules. In the number of functional modules mined by IIMP algorithm, the proportion of significant modules reached 39.03%. Compared with acc-fmd, MCL and mcode, the proportion of other three algorithms was 22.73%, 28.57% and 38.58%. Thus, IIMP algorithm mining function module is more biological statistical significance.  Figure 10, this paper uses the leave one out method to predict the IIMP algorithm function module. Proteins have been annotated in MF, BP and CC networks. When predicting protein function modules, all the clustering data are divided into k small data sets, and then k-1 data sets are used as training data sets, and the remaining one is used as test data sets. Although this method is time-consuming, the overall results are close to the expected values. In the experiment, the precision, recall rate and F value of PPI network provided by dip database are compared. In terms of accuracy, the accuracy of IIMP algorithm is 0.381, which is higher than acc-fmd algorithm 0.231, MCL algorithm 0.129, mcode algorithm 0.328; in terms of recall rate, IIMP algorithm is higher than acc-fmd, mcode algorithm, only lower than MCL algorithm. From the F value, IIMP algorithm is higher than acc-fmd, MCL, mcode algorithm, has obvious advantages. In general, IIMP algorithm can effectively predict protein functional modules, and is better than acc-fmd, MCL and mcode algorithms in accuracy, F value, number of recognition modules and dimensionality reduction rate.

Conclusion
With the continuous growth of PPI network data, how to predict protein functional modules efficiently is an important research problem in modern bioinformatics. The prediction of protein function module based on PPI network is a popular prediction method. This kind of method is low cost and efficient, but the obtained data has a lot of noise. This paper proposes a IIMP algorithm to solve this problem. This algorithm integrates the three core information (PPI network structure, core nodes, and positions of each node) in PPI network, and comprehensively considers various elements in PPI network. In this paper, three methods and technologies based on density clustering algorithm, isometric mapping dimensionality reduction algorithm and multi-layer perceptron are used to predict protein functional modules. In order to verify that IIMP algorithm is more effective than other algorithms, PPI network data and known protein function module terms are downloaded from dip database, corresponding protein function module numbers are obtained from interpro database, and cogs function annotation is used. The experimental results show that IIMP algorithm can effectively predict protein functional modules, and is better than acc-fmd, MCL and mcode algorithms in accuracy, F value, dimension reduction rate and number of recognition function modules.
In the future work, we can further study and predict protein functional modules from the following directions. 1) In depth understanding of the impact of PPI network density on the prediction of protein function module, in different PPI network density affects the size of clustering, data preprocessing should be how to divide the density of PPI network, different density for the prediction of protein function module is great, this problem needs to be further studied; 2) based on the understanding of PPI network, it is necessary to study How to use modern popular intelligent optimization algorithm such as Bayesian intelligent algorithm, neural network intelligent algorithm to predict protein function module; 3) PPI network is a dynamic network, most researchers are taking static PPI network for research, how to predict protein function module and PPI network dynamic when the overall structure of PPI network has not changed It is also a new research direction to predict different protein functional modules under the condition of.