A Network Topology Approach in Survival Analysis

Network topology can be used to simplify the complexity of the data sets. We are exploring its function in performing survival analysis to identify the most important factor that contributed to the survival time from diagnosis to death. This technique has the potential to illustrate easily some types of complex interactions in data set. Then, based on those interactions, the most important factor in survival analysis will be identified. In this paper, the interpretation of that network topology will be delivered by using centrality measures. A case study of the survival time for cervical cancer patients will be presented and discussed. Based on network topology, the most important factors that influence the survival of cervical cancer patients will be identified.


Introduction
Survival analysis, or more generally, time-to-event analysis, refers to a set of methods for analyzing the length of time until the occurrence of a well-defined end point of interest. Survival time was calculated from time of diagnosis to death for cancer deaths or to date of last contact or death from other causes for censored patients. A unique feature of survival data is that typically not all patients experience the event (example: death) by the end of the observation period, so the actual survival times for some patients are unknown. There are many factors that influence the survival period, see for example, [1][2][3]. The statistical analysis of survival data has a history of several hundred years, reaching back at least as far as the publication of the population life table by the astronomer Halley in 1963 [4]. Schober and Vetter [5] mentioned that the most common statistical techniques used to analyze the survival data are the Kaplan-Meier estimator, log-rank test, and the Cox proportional hazards (PH) model. Further details about those common statistical techniques can be seen in [6][7][8][9]. These three methods are examples of univariate analysis; they describe the survival with respect to the factor under investigation, but necessarily ignore the impact of any others. Hence, in order to analyse the survival time with respect to several factors simultaneously, multivariate analysis will be performed in this study. Multivariate analysis is a set of techniques used for analysis of data sets that contain more than one variable especially when working with correlated variables. Due to that and also the complexity of underlying data sets, multivariate analysis requires much computational effort. Therefore, to simplify the complexity of the multivariate analysis, network topology approach will be used in this study. An example of the survival time for cervical cancer patients will be discussed to illustrate the structure of network topology and a recommendation will be presented. The rest of the paper is organized as follows. In the Section 2, we present the methodology of network topology, followed by the results and discussion of corresponding example in Section 3. At the end, this paper will be closed with a conclusion in Section 4.

2.Methodology
In this paper, network topology will be used to identify the most important factor in multivariate survival data. Hence, in this section, the structure of network topology will be discussed.

Network topology
In this paper, network topology starts with correlation matrix followed by transforming it into a distance matrix [10]. From this matrix, a minimum spanning tree (MST) is constructed as suggested Kruskal [11], by using Kruskal's algorithm provided in Matlab version R2018b. From MST, we construct the network topology of all variables. This is a simplification of the complex system of corresponding correlation matrix which will be used to summarise the most important information. The visualization of MST can be made possible by using the open source called 'Pajek' [12][13][14][15]. Furthermore, to interpret the MST we use the standard tools, i.e., centrality measures. To make the network topology more attractive and easy to interpret, we use the Kamada Kawai procedure provided in Pajek. The interpretation of that network will be delivered by using the degree centrality measure [12][13] and betweenness centrality measure. [12,16,17].

Case study: survival time for cervical cancer
An example used in this study is about the survival time for cervical cancer patients. There are 120 patients with seven predictor variables involved in this study. The data was retrieved from local hospital in Malaysia. An event of interest for this data is death. Those variables are ethnicity (ETH), lymph node involvement (LN), histologic type (HIS), age at diagnosis (AGE), stage at diagnosis (STG), primary treatment (PT) and distant metastasis (DM). In this study, qualitative variables which are ETH, LN, HIS, STG, PT and DM are transform into code, for example, 0 represent negative LN and 1 represent positive LN so that they can be analyse as quantitative variable. For the analysis, a response variable, i.e., status (dead or alive (STA)) is added to determine which predictor variables give significant influence to the response variable.

Results and Discussion
The correlation matrix of survival data consists of 8 variables as nodes connected by ([8-1]×[8/2]) = 28 links each of which corresponds to the correlation between two different nodes. However, by using the MST we only have to consider [8-1] = 7 links. The number of links shows that the complexity of multivariate analysis has been reduced. MST is a subgraph that connects all the variables (nodes) whose total weight, i.e., total distance is minimal. Figure 1 shows the corresponding MST for survival data. This figure shows the most important relationship, i.e., the interconnectivity among all variables in terms of MST. The larger the number of links is the more dominance of that particular node than the other. Based on MST, we learn that STA has been influenced significantly by STG, HIS and DM. Hence, in terms of MST, it can be concluded that STG, HIS and DM are the most important predictors for survival time of the cervical cancer patients. To have better understanding about this finding, other information is presented using centrality measures, i.e., degree centrality and betweenness centrality.   Figure 2.Degree centralityindicates the level of importance of a variable in terms of its connectivity with other variables and provides information on how many edges incident upon a given variable.
In Figure 2, the network topology where the colour of the node (predictor variables) represents the rank of importance based on degree centrality is presented. The colours used in this analysis, ordered decreasingly in terms of the rank of importance: yellow and black. The higher the score of the centrality measures of a particular node, the more dominance that node is. From Figure 2, STG and HIS have the highest (yellow node) number of links, i.e., 3 links in the network. They play the most important role in the network. This means that STA is strongly influenced by the STG and HIS.

Conclusion
Based on the analysis on MST in Figure 1, we learn that the response variable, STA is influenced by STG, HIS and DM. Further analysis based on two centrality measures leads us to the following conclusions.
(i) Based on degree centrality measure, the most importance predictor that contributed to response variable are STG and HIS. (ii) According to the betweenness centrality measure, the most important predictor is STG, followed by HIS. Therefore, based on these findings, we conclude that the most important factors that contributed to the survival of cervical cancer patients are STG, HIS and DM. This finding is verified by the results from Pomros et. al. [18] and Carneiro et. al. [19] that STG affected survival of cervical cancer patients. It is shown that network topology is reliable to use to determine the most significant factor that influence the survival of cervical cancer patients. Consequently, these variables should be given special attention in survival analysis of the cervical cancer patients. For the future research, it is suggested to use a robust network topology so that it will produces a robust results for the survival analysis. The robust network topology can be develop by substituted a classical correlation matrix used in network topology with a robust estimator [20][21][22][23].