Measuring the significance of higher-order dependency in networks

Higher-order networks (HONs), which go beyond the limitations of pairwise relation modeling by graphs, capture higher-order dependencies involving three or more components for various systems. As the number of potential higher-order dependencies increases exponentially with both network size and the order of dependency, it is of particular importance for HON models to balance their representation power against model complexity. In this study, we propose a method, significant k-order dependencies mining (SkDM), based on hypothesis testing and the Markov chain Monte Carlo (MCMC), to identify significant higher-order dependencies in real systems. Through synthetic clickstreams with elaborately designed higher-order dependencies, SkDM shows a powerful ability to correctly identify all significant dependencies at preset significance levels of α={0}{.01, 0}{.05, 0}{.10} , performing as the only method, in comparison to the state of the arts, that can robustly maintain the Type I error rate, and without generating any Type II error across all the experimental settings. We further apply the SkDM method to various empirical networks, including journal citations, air traffic, and email communications. Empirical results show that among those tested networks, only 6.03%, 1.47%, and 1.28% of all potential dependencies are of statistical significance ( α={0}{.01} ). The proposed SkDM method, therefore, provides an efficient tool for higher-order network analysis tasks at reduced computational complexity.

The first three models capture many-body interactions using higher-order structures such as hyperedges, simplices, and motifs and uncover the higher-order dynamics in complex systems, such as percolation,

Higher-order Markov network model
A system (V, S) is defined with S = {p 1 , p 2 , • • • , p N } the set of sequences and V = { v 0 , v 1 , . . ., v m−1 } the component set, where is an ordered record of l i + 1 vertices.Based on a first-order Markov model, the system could be represented as a graph G (1) = V (1) , E (1) with first-order nodes V (1) and directed edges E (1) ⊆ V (1) × V (1) , where (v i , v i +1 ) ∈ E (1) for i ∈ [0, l i − 1] is defined in p i .A k-1th order dependency with k nodes extracted from sequential data S could be defined as I = [v α , . . ., v β ] for v α , v β ∈ V, aggregated in the set Γ = {I 0 , I 1 , . . ., I n }.Based on higher-order Markov chains, real systems (V, S, Γ) could naturally be represented as conditional probability models [42,46].
First-order Markov network model.First-order networks rest on the assumption of Markovian property (first-order Markov process) which is memoryless-a random walker's next destination only depends on the currently visited component and is independent of its history [47,48].For example, when a walker is a web user, the first-order Markov model for user navigational behaviors assumes that the next page they visit depends solely on the current page and not on historical clickstreams.Given a real system (V, S), we view each component as a state and direct interactions between components as potential transitions in a standard Markovian network model.The model can be written as follows: where P ( v i +1 | v i ) is the transition probability of a walker moving from first-order node v i to first-order node v i +1 .The first-order transition probability matrix P (1) is given by p |V (1) |1 p |V (1) |2 • • • p |V (1) ||V (1) | where V (1) represents the number of elements in V (1) and the transition probability from first-order node v i to first-order node v j is proportional to the edge weight W(v i → v j ).k-order Markov network model.The properties of the first-order Markov model indicates that it would be too simplistic for capturing higher-order dependencies that impact network analysis tasks, such as ranking, clustering, and link prediction [5,41].Since higher-order Markov chains are processes that could keep the impact of the past few states on the walker's next step, it can be used to capture higher-order dependencies to improve the accuracy of the representation of real systems.For example, a k-order Markov chain model (M k ) holds that a walker's next movement depends on the last k components visited.
In order to embed higher-order dependencies in systems, it is critical to reconstruct the network.A first-order node can be split into different higher-order nodes based on variable-order dependencies [5].For instance, based on a k-order The HON model with k-order dependencies can be written as follows: where is the probability that a random walker moves from v i to v i +1 based on its (k-1)-step memory.

Significant k-order dependencies mining (SkDM)
Higher-order dependencies go beyond pairwise interactions and could better explain how the components directly and indirectly influence each other.Naturally, HON models embedding higher-order dependencies improve their descriptive powers and the accuracy of network analysis tasks [12].However, the exponentially increased number of potential higher-order dependencies introduces problems such as high computational complexity and 'state space explosion' .It is critical to develop a method to identify significant higher-order dependencies from the system to reduce the complexity of HON models.We develop a two-tailed statistical test framework, significant k-order dependencies mining (SkDM), based on hypothesis testing and the MCMC method to select significant higher-order dependencies from systems.The central topic is to focus on significant higher-order dependencies: Connected and sequential patterns occur in real-world complex systems at numbers significantly higher or lower than those in randomized networks generated according to the first-order transition probability matrix.Here, we plan to generate many randomized networks with the same first-order transition probability matrix as the real network: Each node in the randomized network has the same first-order transition probability as the corresponding node in the real network.In order to obtain randomized networks, we first generate 1000 simulation datasets with 10 6 trajectories using the MCMC method [49][50][51].Then, we extract k-order dependencies from simulation datasets and compute the corresponding k-order transition probabilities based on higher-order Markov chains.Furthermore, we propose the hypothesis testing framework for judging whether a k-order dependency ϕ is significant.Specifically, the null hypothesis (H 0 ) and the alternate hypothesis (H 1 ) can be described as: H 0 : A k-order dependency ϕ is not significant such that H 1 : A k-order dependency ϕ significantly exists in the real system.
where p real (ϕ) is the transition probability of k-order dependencies in real systems and p random (ϕ) is the sample mean θ of transition probabilities of k-order dependencies in simulation datasets.
According to the central limit theorem, the distribution for a sample mean is approximately normally distributed for large simulation datasets [52,53].Therefore, we establish a standard normal random variable z-score (Z) as a test statistic where SD (p random (ϕ)) is the standard deviation σ θ of transition probabilities of k-order dependencies in simulation datasets.Then we could derive a confidence interval with a confidence level 1 − α based on the population mean θ of transition probabilities of k-order dependencies in simulation datasets, as shown in equation ( 11) [52,54,55] In addition, we introduce the p-value to represent the extent that the test statistic disagrees with H 0 .The p-value is the probability of finding that the value of a test statistic (such as Z) is at least as contradictory to H 0 , assuming H 0 is correct [56].We could reject the null hypothesis when the p-value is less than the significance level α.
In this study, we take z 0.005 = 2.58, and the 99% confidence interval (α = 0.01) is Finally, we decide whether to accept or reject the null hypothesis.Since SkDM is of the two-tailed statistical test, both [−∞, −2.58 × σ θ + θ] and [2.58 × σ θ + θ, ∞] are rejection regions.When the transition probability p real (ϕ) of k-order dependencies in real systems falls in the rejection regions or the absolute value of Z is bigger than 2.58 (|Z| > z 0.005 ), we reject the null hypothesis and then obtain the significant k-order dependency ϕ.

Baselines
Here we compare the performance of SkDM with a classical sequential pattern mining method and two higher-order modeling methods: the Prefix-projected Sequential Pattern Mining algorithm (PrefixSpan), the multi-order graphical modeling framework (MON) and BuildHON+.
PrefixSpan quickly mines the complete set of sequential patterns in the database and greatly reduces the computation complexity [57].It first scans the database to find the frequent 1-sequences and generates the projected database for each frequent 1-sequence.In this way, Prefixspan recursively generates the projected database for each frequent k-sequence to find frequent (k+1)-sequences.It has a minSup threshold, which represents the minimum support required to be considered a frequent sequential pattern.
MON is a nested structure of multi-order graphical models which could infer higher-order dependencies at multiple lengths and capture topological and temporal characteristics of real-world datasets [42,43].For example, two multi-order models MK and MK+1 incorporate dependencies up to maximum orders K and K + 1 respectively.MK is considered as the null model, and MK+1 is the alternative model.The p-value of the null model MK is as follows, where S is the sequential dataset and d (K) are the degrees of freedom of MK .The likelihood ratio L( MK|S) represents the relative fitness between the alternative model and the null model, given the observed dataset.Γ is the Euler Gamma function and γ is the lower incomplete gamma function.The algorithm iteratively checks whether the p-value is below a significance threshold ϵ or not.If the p-value is below ϵ, we reject the alternative model MK .This process continues until the maximum order K opt is reached, where the p-value exceeds ϵ.Here, we set ϵ as 0.001 (ϵ = 0.001).
BuildHON+ is a scalable and parameter-free algorithm based on the Kullback-Leibler divergence, designed to extract variable and higher-order dependencies from big data [44].A k-order dependency ϕ could extend to ϕ ext of order k ext = k + 1 when the distribution of ϕ ext (D ext ) is significantly different from the distribution of ϕ (D).BuildHON+ uses the Kullback-Leibler divergence measuring the difference between ϕ and ϕ ext , as shown in equation (18), where δ is a dynamic threshold and Support(ϕ ext ) is the number of observations ϕ ext .When D KL (D ext ∥D) is bigger than δ, the k-order dependency ϕ could extend to ϕ ext of order k ext .

Model validation and comparison
To validate the effectiveness of the proposed method, we apply SkDM to two elaborately designed synthetic clickstream datasets containing variable orders of dependencies, the CK1 dataset and the CK2 dataset.The CK1 dataset incorporates only first-order dependencies, while the CK2 dataset includes both first and second-order dependencies.

Synthetic data
First, we assumed 1000 users navigating through 100 web pages, arranged in a 10 × 10 grid and indexed from 0 to 99.Each page had 2 out-links to neighboring pages, including a down-link and a right-link with wrapping, i.e. a user could move from a rightmost page to the leftmost page in the same row or from a bottom page to the top page in the same column.Each user started from a random page numbered from 0 to 99 and navigated through 100 pages within the given time.As a result, each clickstream dataset contained 1000 records with 100 000 clicks in a period.We aimed to generate two synthetic clickstream datasets with variable orders of user navigating dependencies, the CK1 and CK2 datasets.The CK1 dataset contained only first-order dependencies: Each user had a 50% chance of moving rightward or downward in the next step.Keeping the previous (first-order) Figure 1.Significant second-order dependencies (α = 0.01) extracted from two synthetic clickstream datasets: the CK1 and CK2 datasets.(a) Two synthetic web clickstream datasets, the CK1 and CK2 datasets, with variable orders of user-navigating preferences on 100 web pages as a 10 × 10 grid.The CK1 dataset contains only first-order dependencies: Each user had a 50% chance of navigating to the page on the right and 50% chance of navigating down in the next step.Twelve second-order dependencies on pages 25, 51 and 77 are embedded in the CK2 dataset, such as 15→25→26, 24→25→35, 41→51→52.For example, the second-order dependency 15→25→26 represents that users coming from page 15 to page 25 are more likely to click on page 26 rather than page 35.(b)-(c) The horizontal axis represents the last target component v k in a second-order dependency v i → v j → v k .The vertical axis represents the second-order node v j |v i which is the series of the first two components in a second-order dependency v i → v j → v k .For example, the second-order node 51|41 represents page 51 given page 41 as the previous step.SkDM extracts 4 significant second-order dependencies (highlighted in saddle brown) out of 400 potential second-order dependencies (highlighted in orange) from the CK1 dataset (see (b)).All 12 significant second-order dependencies are correctly extracted by SkDM from the CK2 dataset and represented in the higher-order clickstream network (see (c)).
dependencies, the CK2 dataset introduced 12 second-order dependencies to keep its first-order transition probability matrix P (1) the same as the first dataset.The imposed second-order dependencies were that (1) all users coming from page 24 to 25 (50 to 51, 76 to 77) would have a 10% chance of moving right (p d = 0.1) and a 90% chance of moving down (1 − p d = 0.9) in the next step; (2) all users coming from page 15 to 25 (41 to 51, 67 to 77) would have a 90% chance of moving right (1 − p d = 0.9) and a 10% chance of moving down (p d = 0.1) in the next step; (3) all users coming from page 50 to 51 would have a 10% chance of moving right (p d = 0.1) and a 90% chance of moving down (1 − p d = 0.9) in the next step; (4) all users coming from page 41 to 51 would have a 90% chance of moving right (1 − p d = 0.9) and a 10% chance of moving down (p d = 0.1) in the next step; (5) all users coming from page 76 to 77 would have a 10% chance of moving right (p d = 0.1) and a 90% chance of moving down (1 − p d = 0.9) in the next step; and (6) all users coming from page 67 to 77 would have a 90% chance of moving right (1 − p d = 0.9) and a 10% chance of moving down (p d = 0.1) in the next step.Figure 1(a) shows these defined dependencies in a 10 × 10 grid.

Model validation
Since our goal is to find significant second-order dependencies in the web clickstream datasets, the null hypothesis (H 0 ) and alternate hypothesis (H 1 ) in the selection process of SkDM can be described as follows: H 0 : the selected second-order pattern ϕ is not a significant dependency in the dataset; H 1 : the selected second-order pattern ϕ is a significant dependency in the dataset.Since SkDM can yield only one of two outcomes for each second-order pattern, either rejecting or not rejecting H 0 , the test could be subject to the type I error.Furthermore, the type I error rate (or significance level) is represented by the probability of rejecting the null hypothesis given that it is true, denoted by α [52].
As shown in figure 1(b), our method captures 4 significant second-order patterns on page 32 at the significance level α = 0.01 from the CK1 dataset.However, there are only first-order dependencies in the CK1 dataset: a user currently on page 32 has the same probability of visiting page 33 and page 42 in the CK1 dataset (figure 1(b)).Thus, SkDM rejects H 0 (when, in fact, it is true) and makes a type I error: where N (FS) is the number of false significant dependencies extracted by SkDM and N (TNS) is the number of true non-significant patterns extracted by SkDM.Since there are 4 false significant second-order dependencies and 396 true non-significant second-order patterns by SkDM, P (Type I error) is 0.010.The result that the probability of making a type I error is equal to the preset significance level α validates our method.Furthermore, figure 2 presents the probabilities of making a type I error at different significance levels of α = 0.01, 0.05, 0.10 in the CK1 dataset.The results show that SkDM could select higher-order dependencies at different significance levels from systems to reduce computational complexity.Furthermore, we establish two measures, the test statistic z-score (denoted as Z) and p-value, to represent the observed significance level of each higher-order dependency (see Materials and Methods).The test statistic Z describes the distance that test results differ from the null hypothesis, whether above or below the mean, measured in units of the standard deviation [52,56].The p-value indicates the extent to which the test statistic Z disagrees with H 0 [52].Using SkDM, we could calculate the p-values and z-scores of all higher-order dependencies to indicate their observed significance levels (see figure 6 and tables 4-5 in appendices).
In order to evaluate the ability of SkDM to capture significant dependencies, we embed 12 second-order dependencies on pages 25, 51, and 77 in the CK2 dataset; that is, a user's next click depends on not only the current page but also the previous page.For example, the probabilities of a user currently at page 25 moving to page 26 and page 35 are p d = 0.10 and 1 − p d = 0.90, respectively, if the previous click is page 24; the probabilities of a user currently at page 25 moving to page 26 and page 35 are 1 − p d = 0.90 and p d = 0.10, respectively, if the previous click is page 15. Figure 1(b) shows that our method correctly captures all 12 significant second-order dependencies in the CK2 dataset.Using SkDM, we could also calculate the p-values and z-scores of all higher-order dependencies to indicate their observed significance levels (see figure 7 and tables 6-7).The above findings imply that SkDM powerfully extracts significant higher-order dependencies to represent true flow patterns in systems.

Model comparison
We compare the performance of SkDM with a classical sequential pattern mining method and two higher-order modeling methods: the Prefix-projected Sequential Pattern Mining algorithm (PrefixSpan) [57], the multi-order graphical modeling framework (MON) [42] and the BuildHON+ algorithm [44] (see Baselines).Table 1 represents the performances and average runtimes in seconds of these four algorithms in capturing higher-order dependencies in the CK2 dataset with different values of p d (p d = 0.10, 0.20, 0.30, 1 − p d = 0.90, 0.80, 0.70).
Results show that SkDM captures 16 significant second-order patterns, of which 12 are true significant dependencies when p d is 0.10, 0.20 and 0.30, respectively.Since there are 4 false significant dependencies (N (FS) = 4) and 384 true non-significant second-order patterns (N (TNS) = 384) by SkDM with different values of SkDM, the type I error rate of SkDM is 0.010.Furthermore, we could calculate the type II error rate, which is the probability of failing to reject the null hypothesis when it is actually false, denoted by β where N (FNS) is the number of false non-significant patterns extracted by SkDM and N (TS) is the number of true significant dependencies extracted by SkDM.Since there are 0 false non-significant patterns and 12 true significant second-order dependencies extracted by SkDM, the type II error rate of SkDM is 0.00.As the output of PrefixSpan depends on the model parameter minSup, we represent the performance of PrefixSpan in capturing higher-order dependencies in the CK2 dataset with different values of p d and minSup in table 2. The minSup threshold represents the minimum frequency a pattern must meet to be considered frequent.In order to compare PrefixSpan with other methods, we identify the top 12 frequent patterns as captured patterns.Subsequently, we could obtain the values of minSup with different values of p d .In table 2, PrefixSpan could extract all 12 second-order dependencies from the CK2 dataset when p d is 0.10, and the type I error rate and type II error rate are both 0.00.It shows that PrefixSpan could capture all dependencies from the dataset when p d is 0.10.However, as p d growing up, the ratio of false significant dependencies to the total number of frequent patterns, N(FS)  N(d) also increases.When p d is 0.30, the type II error rate is as high as 0.83.It shows a large proportion of non-significant patterns among the frequent patterns when p d is 0.30.These results indicate that PrefixSpan effectively identifies frequent sequential patterns no less than the minimum support threshold.However, higher-order dependencies are sequential patterns with significantly different numbers than random patterns.Furthermore, SkDM is approximately twenty times faster than PrefixSpan (see table 1).Consequently, the PrefixSpan algorithm is not optimal for identifying higher-order dependencies.
We observe that the MON method treats all 400 second-order patterns in the CK2 dataset as dependencies when p d is 0.10, 0.20 and 0.30 (see table 1).MON could extract 388 false significant second-order dependencies and 0 false non-significant patterns, resulting in a type I error rate of 1.00 and a type II error rate of 0.00 (see table 8 in appendices).Furthermore, SkDM is approximately 100 times faster than MON (see table 1).
Table 1 shows that the BuildHON+ method successfully extracts all 12 second-order dependencies when p d is 0.10 and 0.20.However, BuildHON+ fails to identify any second-order dependencies when p d is 0.30, resulting in a type I error rate of 0.00 and a type II error rate of 1.00 (see table 9 in Appendices).As we have seen, compared to MON and PrefixSpan, SkDM has a relatively low rate of identifying non-significant patterns as higher-order dependencies.Despite being a time-efficient method and demonstrating satisfactory performance on the CK2 dataset at p d = 0.10 and p d = 0.20, BuildHON+ fails to distinguish significant differences between first-order and second-order dependencies at p d = 0.30.In conclusion, SkDM is an effective and robust solution to capture significant higher-order dependencies in networked systems.

Empirical demonstration and analysis
To study significant higher-order dependencies in real networks using SkDM, we collected three public real-world datasets: publications of the American Physical Society (the APS Dataset), flight itineraries between U.S. cities (the DB1B Dataset), and email communications between 146 executives in the Enron Corporation (the Enron Email Dataset).The detailed descriptions of the datasets are as follows.

Data description APS Dataset.
The American Physical Society (APS) provides rich data based on its publications about physics for research about networks science.The dataset covers 116 years, with 468 291 articles in nine journals and 906 398 citations from 1893 to 2009, and is available at https://journals.aps.org/datasets(table 11) [58].By modeling citations between journals, we construct the APS citation network and study interdisciplinary knowledge flows.In the APS citation network, a node represents a journal and an edge represents a citation from one journal to another.
DB1B Dataset.The DB1B dataset contains 19 415 369 itineraries between 464 U.S. airports from the first three quarters of 2011 and is available at https://transtats.bts.gov/PREZIP/[59].By modeling passengers' itineraries between 413 U.S. cities aggregated from 464 airports, we construct a U.S. air traffic network and study passengers' flight patterns among cities.In the U.S. air traffic network, a node represents a city and an edge represents a passenger's itinerary from one city to another.
Enron Email Dataset.The Enron email dataset (www.cs.cmu.edu/~enron/)contains 116 525 messages generated by 146 senior executives from Enron Corporation disclosed during the investigation of the Enron scandal [60].By modeling messages, we construct an Enron email network and study the communication patterns between senior executives.In the Enron email network, a node represents a senior executive and an edge represents messages between them.

Extracting higher-order dependencies in citation flows
The above experiments on CK1 and CK2 show that SkDM can correctly identify higher-order dependencies with the preset significance level.Furthermore, we apply SkDM to the APS dataset to extract significant higher-order dependencies and investigate citation flows in the network.The APS dataset comprises 468 291 articles published in nine journals and 906 398 citations from 1893 to 2009.figure 3(a) shows significant second-order dependencies extracted from the APS citation network by SkDM are highlighted in saddle brown.Table 3 represents the performances and average runtimes in seconds of these four algorithms in capturing higher-order dependencies in the APS dataset.Results show that 44 significant second-order dependencies are filtered out from 729 potential second-order dependencies in the network, with exactly 6.03% (see table 3).SkDM is approximately eighty times faster than MON in capturing higher-order dependencies.Therefore, by eliminating non-significant higher-order dependencies from HON models, SkDM greatly reduces the computational complexity and improves the efficiency of network analysis tasks.
To show citation flows among journals, we then build eight ego HONs for eight APS journals, simultaneously containing first-order, second-order, and third-order nodes.Figures 3(b)-(i) represents each dependency as a path from a first-order node to a third-order node.For example, the second-order dependency PRA→PRL→PRA is described as path PRA→PRL|PRA→PRA|PRL,PRA from the first-order node PRA through the second-order node PRL|PRA to the third-order node PRA|PRL,PRA.Note that a k-order node v i |v i −1 , . . ., v i −k+1 represents the component v i ; for instance, the first-order node PRA and the second-order node PRA|PRL represent the same journal, Phys Rev A. As shown in figures 3(b)-(i), we find a 'commuting pattern' , that is, citation flows mostly return to journals where they come from.For example, affected by Phys Rev A, the citation flow from Phys Rev Lett most likely returns to Phys Rev A (figure 3(b)).Influenced by Phys Rev STAB, there is no network flow from Phys Rev Lett to Phys Rev B. Yet, 36.81% of Figure 3. Significant second-order dependencies extracted from the APS citation network.(a) Significant second-order dependencies at the significance level of α = 0.01 are highlighted in saddle brown, such as PRA→PRL→PRA.The horizontal axis represents the target journals in second-order dependencies.For example, PRA is the target journal of the second-order dependency PRA→PRL→PRA.The vertical axis represents second-order nodes which are the series of the first two components in second-order dependencies, such as PRL|PRA.(b)-(i) Eight ego HONs composed of significant second-order dependencies.Different colors represent different journals in the APS citation network.Each node is colored based on the journal it indicates.For example, since the second-order node PRL|PRA represents Phys Rev Lett, it is marked with yellow.The thickness of an edge is proportional to its weight, representing the number of citations between two nodes.publications in Phys Rev Lett are cited by Phys Rev B, ignoring references of Phys Rev Lett (see figure 8).These findings indicate that HONs could reflect the true network flows in systems by embedding higher-order dependencies.

Extracting higher-order dependencies in air traffic
We continue by performing SkDM on the U.S. air traffic network to capture significant higher-order dependencies to investigate passengers' flight patterns between different cities using the DB1B dataset.The DB1B dataset contains 19 415 369 itineraries between 464 U.S. airports from the first three quarters of 2011.table 3 shows that 65 212 significant second-order dependencies extracted from the network accounted for 1.47% of all potential 4448 032 second-order dependencies.Moreover, SkDM is faster than MON and PrefixSpan in capturing higher-order dependencies.These results demonstrate that our method is extremely efficient at filtering insignificant higher-order dependencies in systems.Figure 4(a) shows a diagonal line and several horizontal zones composed of significant second-order dependencies marked in saddle brown.The diagonal line suggests that passengers often return to the city they departed from; this is the commuting pattern.Several horizontal zones show that passengers influenced by a previously visited city prefer to travel to several fixed cities.For example, using a Sankey diagram, we find that when passengers come from Atlanta (29), they mostly return to Atlanta or travel to other cities such as Chicago (284), Branson (50), Baltimore (69), Detroit (124), Newark (146), Houston (197), and Minneapolis (312; figure 4(b)).Those patterns demonstrate significant second-order dependencies in the air traffic network: a passenger's travel to the next city depends on the currently and previously visited cities.
To embed significant second-order dependencies into the network model, we reconstruct the higher-order U.S. air traffic network containing 405 first-order nodes, 9842 second-order nodes, and 10 676 directed edges with transition probabilities larger than 0.3 (figure 4(c)).Each second-order dependency v i → v j → v k is represented as a directed edge v j |v i → v k from a second-order node v j |v i to a first-order node (a target city) v k .For example, the directed edge from Chicago|Atlanta to Atlanta, Chicago|Atlanta→Atlanta indicates the second-order dependency Atlanta→Chicago→Atlanta. Figure 4(c) shows several clusters centered around different cities (such as Atlanta, Chicago, Los Angeles, Dallas, and Denver).There are many directed edges from second-order nodes to the first-order node in a cluster; this demonstrates that the airports in those cities are busy transportation hubs that serve a large passenger flow in the U.S. The results are consistent with reality: As shown in the list of the busiest airports by passenger traffic distributed by Airports Council International, Hartsfield-Jackson Atlanta International Airport, O'Hare International Airport in Chicago, Los Angeles International Airport, Dallas/Fort Worth International Airport, and Denver Based on significant second-order dependencies, we reconstruct a higher-order U.S. air traffic network composed of 405 first-order nodes (a target city v k ), 9842 second-order nodes (a city v j given a city v i , denoted as v j |v i ), and 10 676 directed edges (v j |v i → v k ) with transition probabilities greater than 0.3.Several clusters are centered around different cities (such as Atlanta, Chicago, Los Angeles, Dallas, and Denver), represented in distinct colors.In addition, two clusters composed of second-order nodes pointing to Atlanta and Dallas are represented in detail.International Airport were the top 5 busiest airports in the U.S. in 2011 (The world's top 100 airports: listed, ranked and mapped).

Extracting higher-order dependencies in email communications
We finally perform SkDM on the Enron email dataset to capture significant higher-order dependencies in human communications.The Enron email dataset consists of 116 525 messages sent between 146 senior executives of Enron from 1999 to 2003.Table 3 shows that 1299 significant second-order dependencies extracted from the network accounted for 1.28% of all 94 176 potential second-order dependencies.Moreover, SkDM is faster than MON and PrefixSpan in capturing higher-order dependencies.This demonstrates that our method is adept at selecting vital higher-order dependencies from large datasets.In addition, the percentage of significant dependencies in 2948 existing second-order patterns is as high as 44.06%, which shows a strong second-order Markov effect in email communications.Furthermore, in figure 5(a), we can observe a diagonal line marked by significant second-order dependencies.This suggests a significant commuting pattern in email communications between executives; that is, the next receiver to which an email is forwarded is influenced by the email's previous sender.When an executive received an email, the executive tended to reply to the sender and forward the received email to other executives.A Sankey diagram additionally describes the emails forwarded by John J. Lavorato (47), a chief operating officer of Enron Americas (figure 5(b)).We find that after receiving emails from Lavorato, most users replied to Lavorato and forwarded the received messages to others.Based on significant second-order dependencies, we construct the higher-order Enron email network, including 131 first-order nodes (an executive k), 645 second-order nodes (an executive j given an executive i, denoted as j|i), and 1299 directed edges (j|i→k; figure 5(c)).Using the Louvain method, second-order nodes j|i and its target node k form several tightly knit clusters in figure 5(c).Furthermore, we select 13 clusters centered around different executives and present their structures in detail.As shown in figure 5(c), the conditions i of most second-order nodes j|i are the same as the target nodes k they point to in a cluster.For example, when 16 second-order nodes point to the target node 47|, 14 of them take 47 as conditions, denoted as j|47 (approximately 88%).This also confirms the significant commuting pattern in email communications; that is, an individual j most likely sends an email to an individual i if they received an email from j.

Conclusion and discussion
Given that the exponentially increased number of higher-order dependencies introduces problems such as high computational complexity and 'state space explosion' , in this paper, we propose a statistical test framework-significant k-order dependencies mining (SkDM)-based on hypothesis testing and the MCMC to identify significant higher-order dependencies from systems.The simulation results indicate that our method can capture all embedded higher-order dependencies at different preset significance levels of α = 0.01, 0.05, 0.1 and calculate the p-values of all higher-order dependencies.While existing state-of-the-arts extract higher-order dependencies with high Type I and Type II error rates, our method demonstrates a robust capability for accurately identifying all significant dependencies, maintaining a low Type I error rate and without generating any Type II error across diverse experimental settings.Empirical results on three real-world networks (the APS citation network, the U.S. air traffic network, and the Enron email network) demonstrate that our method can eliminate maximumly 98.7% insignificant higher-order dependencies in large datasets.By capturing significant dependencies, our method can greatly reduce the computational complexity of network analysis tasks.It is worth noting, that in human communications, the proportion of significant dependencies in existing higher-order patterns is considerable (44.06%).Moreover, using SkDM, HON models can precisely describe true network flow patterns in systems, such as commuting patterns, where the network flow often returns to the component it previously visited.
Further research may consider the following two aspects.On the one hand, SkDM can be generalized to select different orders of dependencies from complex systems such as third-order or fourth-order dependencies.Additional research should determine how to reconstruct HONs, which could simultaneously embed variable orders of dependencies.On the other hand, higher-order Markov models offer a temporal-topological perspective for understanding complex systems.It is essential to extend SkDM to investigate how significant higher-order dependencies influence the dynamical processes taking place on time-varying systems, such as epidemic spreading and diffusion.
the Hunan Science and Technology Plan Project (2020TP1013, 2023JJ40685), and the Innovation Team Project of Colleges in Guangdong Province (2020KCXTD040).
J X L and X L have contributed equally to this work.X L conceived the project.J X L and X L designed the models and methods.J X L and X L performed the experiments and wrote the paper.The authors declare no competing interests.

Appendices
Appendices include:

Figure 2 .
Figure 2. Type I error rates at different significance levels (α = 0.01, 0.05, 0.10) by SkDM.The horizontal axis represents different values of parameter α.The left y-axis shows the significance level and the right y-axis shows the probability of making a type I error.

Figure 4 .
Figure 4. Significant second-order dependencies extracted from the U.S. air traffic network.(a) Significant second-order dependencies at the significance level of α = 0.01 are highlighted in saddle brown.The horizontal axis represents the target cities in second-order dependencies, such as Baltimore and Detroit.The vertical axis represents the series of the first two cities in second-order dependencies, such as Branson|Atlanta.(b) A Sankey diagram describes the traffic volume of passengers departing from Atlanta.(c) Higher-order U.S. air traffic network.Based on significant second-order dependencies, we reconstruct a higher-order U.S. air traffic network composed of 405 first-order nodes (a target city v k ), 9842 second-order nodes (a city v j given a city v i , denoted as v j |v i ), and 10 676 directed edges (v j |v i → v k ) with transition probabilities greater than 0.3.Several clusters are centered around different cities (such as Atlanta, Chicago, Los Angeles, Dallas, and Denver), represented in distinct colors.In addition, two clusters composed of second-order nodes pointing to Atlanta and Dallas are represented in detail.

Figure 5 .
Figure 5. Significant second-order dependencies extracted from the Enron email network.(a) Significant second-order dependencies at the significance level of α = 0.01 are highlighted in saddle brown.The horizontal axis represents the target receivers in second-order dependencies.For example, John J. Lavorato is the target receiver of the second-order dependency John J. Lavorato→Greg L. Whalley→John J. Lavorato.The vertical axis represents the series of the first two individuals in second-order dependencies, such as Greg L. Whalley|John J. Lavorato.(b) A Sankey diagram represents the emails forwarded by John J. Lavorato.(c) Higher-order Enron email network.The higher-order Enron email network comprises first-order nodes (an executive i), second-order nodes (an executive j given an executive i, denoted as j|i) and directed edges that take the number of emails between nodes as their edge weights.Each executive is indexed by a number, e.g. the index of John J. Lavorato is 47.Using the Louvain method, there are several clusters centered on different executives, represented in different colors.The details of clusters centered around B. Rapp (9), K. Watson (72), M. Lokay (90), S. Harris (77), R. Hayslett (134), M. Heard (113), S. Shackleton (107), M. Taylor (58), L. Kitchen (2), and John J. Lavorato (47) are also shown.

Figure 6 :
Distribution of z-scores and p-values of dependencies in the CK1 dataset.

Figure 7 :
Distribution of z-scores and p-values of dependencies in the CK2 dataset.

Figure 8 :
The Sankey diagram of the first-order APS citation network.

Figure 6 .
Figure 6.Distribution of z-scores and p-values of dependencies in the CK1 dataset.(a) The distribution of z-scores of all potential second-order dependencies is closely approximated by a normal distribution.(b) The distribution of p-values of all potential second-order dependencies.

Figure 7 .
Figure 7. Distribution of z-scores and p-values of dependencies in the CK2 dataset.(a) The distribution of z-scores of potential second-order dependencies (without twelve elaborately designed second-order dependencies on pages 25, 51 and 77) is closely approximated by a normal distribution.(b) The distribution of p-values of potential second-order dependencies (without twelve elaborately designed second-order dependencies on pages 25, 51 and 77).

Figure 8 .
Figure 8.The Sankey diagram of the first-order APS citation network.Different colors represent different journals in the APS citation network.The thickness of each edge is proportional to the citation flow that passes through it.

MON p d Table 9 .
= 0.10 p d = 0.20 p d = 0.30 N(d) = N (TS) + N (FS) Second-order dependencies captured by the BuildHON+ method in the CK2 dataset with different values of p d .BuildHON+ p d = 0.10 p d = 0.20 p d = 0.30 N(d) = N (TS) + N (FS)

Table 1 .
Second-order dependencies captured by four algorithms in the CK2 dataset with different values of p d .N(d) represents the number of patterns captured by each method.α and β represents the type I error rate and type II error rate of the four methods, respectively.

Table 2 .
Second-order dependencies captured by the PrefixSpan algorithm in the CK2 dataset with different values of p d and minSup.

Table 3 .
Second-order dependencies captured by four algorithms in three real-world datasets.N(d) represents the number of patterns captured by each method.

Table 8 :
Second-order dependencies captured by the MON method in the CK2 dataset with different values of p d .

Table 9 :
Second-order dependencies captured by the BuildHON+ method in the CK2 dataset with different values of p d .

Table 10 :
Basic statistics of three real-network datasets.

Table 11 :
Journals in the APS dataset.

Table 8 .
Second-order dependencies captured by the MON method in the CK2 dataset with different values of p d .

Table 10 .
Basic statistics of three real-network datasets.|V| and |E| is the number of components and directed pairwise links.The domain is the research field of a dataset.