Identifying key spreaders in complex networks based on local clustering coefficient and structural hole information

Identifying key spreaders in a network is one of the fundamental problems in the field of complex network research, and accurately identifying influential propagators in a network holds significant practical implications. In recent years, numerous effective methods have been proposed and widely applied. However, many of these methods still have certain limitations. For instance, some methods rely solely on the global position information of nodes to assess their propagation influence, disregarding local node information. Additionally, certain methods do not consider clustering coefficients, which are essential attributes of nodes. Inspired by the quality formula, this paper introduces a method called Structural Neighborhood Centrality (SNC) that takes into account the neighborhood information of nodes. SNC measures the propagation power of nodes based on first and second-order neighborhood degrees, local clustering coefficients, structural hole constraints, and other information, resulting in higher accuracy. A series of pertinent experiments conducted on 12 real-world datasets demonstrate that, in terms of accuracy, SNC outperforms methods like CycleRatio and KSGC. Additionally, SNC demonstrates heightened monotonicity, enabling it to distinguish subtle differences between nodes. Furthermore, when it comes to identifying the most influential Top-k nodes, SNC also displays superior capabilities compared to the aforementioned methods. Finally, we conduct a detailed analysis of SNC and discuss its advantages and limitations.


Introduction
The advancements in network science have propelled the analysis of complex networks into a prominent and burgeoning question within academia [1][2][3][4][5][6].Within this realm of study, the identification of key nodes has persistently remained a crucial and foundational issue.Given the heterogeneity of network structures, it often transpires that a small number of nodes exert significant influence within a network, and certain nodes can even wield substantial impact on the structure and functionality of the network [7].For instance, in networks modeling the spread of infectious diseases, the phenomenon of 'super-spreaders' is prevalent.These super-spreaders possess the capability to infect a majority of individuals in an exceptionally short span, thus augmenting the challenges of disease control and prevention.Rapidly pinpointing such 'super-spreaders' could potentially truncate the transmission pathways of the disease.Presently, various facets of life involve the issue of identifying key nodes, such as disease prevention and control [8], safeguarding vulnerable nodes in trade networks [9], and curtailing the propagation of rumors in online social networks [10].Consequently, the efficient and accurate localization of key nodes within networks constitutes a matter of significant practical importance.
Currently, a multitude of methods for identifying crucial nodes in complex networks have been proposed [7,11], and researchers have introduced some novel approaches in recent years [12][13][14].However, existing methods often neglect to consider the clustering coefficient of nodes, which is an essential attribute [15].Additionally, the majority of methods are often designed based on two principles: that nodes with higher degrees exert greater influence, and nodes closer to the center possess higher influence.However, in reality, the contributions of degree and node position to influence are not as substantial.A typical example is the bridge node with low degree but high influence.If a node serves as a bridge between two communities in the network, its significance is self-evident.Similarly, if a node occupies a structural hole position at the network's periphery, it will also wield substantial influence.
Mass is a fundamental physical quantity in physics.Roughly speaking, it can be understood that objects with greater mass possess higher energy potential.In Newtonian mechanics, the magnitude of gravitational force is also directly proportional to mass.For instance, in systems like the solar system or other celestial arrangements, stars are typically the most massive entities.The computation of mass involves multiplying material density by volume.Inspired by this formula, we propose a novel method for identifying key nodes, which we term Structural Neighborhood Centrality (SNC).Its time complexity is related to the network's average degree, n, and the number of edges, and is represented as O (n • |E|).The principle behind SNC is straightforward.Initially, nodes are conceived as physical entities, with a node's local clustering coefficient regarded as its density, and its degree as its volume.The product of these attributes yields the node's mass.If a node's neighborhood includes numerous high-mass nodes, it suggests that the node itself possesses greater mass, enabling it to attract these high-mass nodes.Consequently, such a node may occupy a more significant position within the system.Beyond the aforementioned considerations, we also account for the structural hole position of nodes, as nodes with high mass are not inherently situated in vital locations.
In summary, unlike other methods, SNC incorporates three attributes: node degree, the less frequently considered local clustering coefficient, and structural hole information.Experimental results demonstrate that SNC exhibits enhanced accuracy and monotonicity, and it also facilitating a more effective identification of the most influential top-k nodes.
The structure of this paper is organized as follows: section 2 reviews some relevant literature; section 3 presents baseline methods and details the proposed SNC method; section 4 describes the experiments conducted to assess the performance of SNC and provides the details of these experiments; section 5 analyzes the experimental outcomes, evaluates the SNC method, and finally, the conclusion section summarizes the content of this paper.

Related works
In recent years, researchers have devised numerous methods for identifying key nodes in complex networks.Ma et al incorporated the universal law of gravitation from physics, where the k-shell (KS) value of a node was regarded as its mass, and the shortest distance between nodes was considered as the distance between them.Based on this, they proposed a key node identification method that simultaneously accounted for node positions and path information [16].However, treating the KS value of a node as its mass seemed somewhat strained in Ma's approach.Therefore, Li et al made some improvements to Ma's method and introduced a novel key node identification approach known as LGM [17].Li et al primarily made two modifications: first, they considered the node degree as node mass, and second, they expanded the calculation scope to the network's cutoff radius.Nevertheless, Li et al only treated the node degree as node mass, resulting in a relatively single-factor consideration, which inevitably weakened the performance of LGM.
The KS and its improved methods based on the KS approach are another effective approach for identifying key nodes in network analysis.The KS calculation is straightforward and accurately positions nodes within the network, determining their influence based on their network position [18].However, a critical limitation of the KS method is its inability to differentiate the importance of same-layer nodes.Therefore, researchers have focused on enhancing the precision of the KS method by addressing this drawback.Zeng and Zhang made initial attempts in this direction by introducing the mixed degree decomposition (MDD) method, which considers the contribution of deleted nodes and their edges to the network [19].MDD improves the accuracy of the KS method but increases computational costs, limiting its applicability to large-scale networks.Subsequently.Bae et al proposed the CNC+ method [20], which narrows the consideration to the core numbers of neighboring nodes, thus enhancing K-shell's precision.Following a similar approach, Li also proposed CN, which improved K-shell's accuracy by considering neighborhood information [21].CN significantly enhances K-shell's precision and maintains the same time complexity.Additionally, other methods have been proposed, including the Layered KS introduced by Zareie and Sheikhahmadi [22], KS Iteration Factor suggested by Wang et al [23], and approaches by Magi et al aimed at refining K-shell's accuracy [24,25].
In addition to refining the KS method, exploring important nodes based on the inherent information within the network structure is also a research direction worthy of thorough investigation.Koene first introduced the concept of degree centrality where nodes with higher degrees are deemed more important [26].Subsequently, Chen et al proposed the concept of semi-local centrality, which considers both direct and indirect neighbors' impact on a node's importance [27].In recent years, several novel methods have emerged.Tulu et al incorporated node degree information and community structure to assess node importance and proposed the CbM method [28].Wang et al introduced the EFFC method, which evaluates node importance by measuring the change in information propagation efficiency upon node removal, thus considering information flow efficiency within the network [29].Salavati et al combined community detection algorithms and closeness centrality (CC), introducing a new node importance ranking method that reduces computational complexity while rapidly identifying highly influential nodes in the network [30].Fei et al presented an approach that combines node local information and shortest paths, determining node importance through the summation of the interactive forces between nodes [31].Ullah et al considered both node-specific local information and the network's overall topology, proposing the LGC method for node importance identification, which significantly enhances the algorithm's recognition capability [32].
Another approach to studying important node identification methods is to integrate techniques from other disciplines.For instance, Wang et al combined information entropy of nodes with the KS method to rank node influence within networks [33].Xu et al considered information entropy of nodes' different neighborhoods to distinguish the significance of various nodes [34].Additionally, Fan et al pioneered the integration of reinforcement learning to discover the key node of complex networks with the Finder method, which offering a novel paradigm for complex network research [35].Yu et al merged graph convolutional networks, transforming the problem of identifying important nodes in complex networks into a regression challenge, resulting in an innovative approach for uncovering significant network nodes [36].Du et al incorporated statistical techniques, utilizing the TOPSIS model to fuse multiple node importance algorithms for assessing node influence [37].Building upon this, Kuo et al simplified the TOPSIS approach and introduced new node importance ranking methods, yielding novel ranking criteria [38].Moreover, Yang et al integrated aspects of possibility theory to devise a method for computing (k, η) − cores in uncertain graphs.Although Yang's approach is not specifically aimed at identifying important nodes in complex networks, its underlying principles also offer valuable insights [39].

Proposed method
This section initiates with an introduction to some node centrality metrics, followed by a detailed exposition of the proposed Structural Neighborhood Centrality (SNC) method in this paper.

CC
CC is a classic method for identifying important nodes in complex networks.The CC considers the average shortest path length from each node to all other nodes [40].In essence, CC posits that nodes closer to other nodes are of higher importance.While CC exhibits high accuracy in computation, its complexity limits its application in networks with a large number of nodes.The formula for CC is defined as follows: where d ij represents the shortest path distance from node i to node j.

KS
The KS method, proposed by Kitsak et al [41], is a classic approach to rank the importance of nodes in complex networks based on their network position.The fundamental idea behind KS is to recursively remove nodes with degrees less than or equal to 'k'in the network.For example, KS starts by removing nodes with degrees of 1 or lower, and subsequently eliminates nodes with degrees of 1 or lower that are generated after the previous removal.This process continues until no nodes with degrees of 1 or lower remain in the network.Nodes that are removed are then assigned a 'ks' value of 1, while the remaining nodes in the network have degrees of at least 2. The procedure is repeated to determine nodes with 'ks' values of 2, and this process continues iteratively until all nodes in the network are assigned 'ks' values.The KS method offers a straightforward principle and computationally efficient process.However, a major limitation of the method is its inability to precisely differentiate the importance of nodes within the same shell.

H-index (HI)
In complex networks, the HI of a node is a method that utilizes the concept of HI to determine node importance by considering the degrees of neighboring nodes [42][43][44].Its definition is as follows: In this equation, v1, v2, . . ., vj represent the neighbors of node i, and d v1 represents the degrees of the neighboring nodes.H is a function that returns a value h, which indicates that in the set {d v1 , d v2 , d v3 , . . ., d vj }, there are h values greater than or equal to h.

Collective influence (CI)
CI is also a method for assessing node importance in a network, introduced by Morone and Makse [45].CI determines node influence based on the extent of disruption to the giant connected component in the network when nodes are removed.Its definition is as follows: where B (i, l) represents the set of nodes in the network surrounding node i and belonging to the ball of radius l, d i denotes the degree of node i, and l is a pre-defined value typically set to 3 in large and medium-sized networks, and 2 in small networks.

CycleRatio (CR)
Cycle ratio was proposed by Fan et al [46].The theoretical foundation of CR is the network's cycle structure, where the term 'cycle ratio' refers to the degree to which a node participates in the shortest cycles of other nodes.The term 'shortest cycle' refers to the smallest-length cycle containing the node.The definition of the cycle ratio is straightforward and given as follows: In equation ( 2), c ij represents the number of shortest cycles in the network that pass through nodes i and j, while c ii represents the number of shortest cycles in the network that pass through node i. [47].Unlike other methods, Yang et al regarded a node's position within the network as a crucial attribute of its importance.However, many algorithms for assessing node importance overlook this aspect.Thus, Yang et al devised an improved gravitational model based on the KS method.Its definition is as follows:

KSGC KSGC was introduced by Yang and Xiao
where c ij is a constraint coefficient, defined as c ij = e ks i −ks j ksmax−ks min , where ks i represents the ks value of node i, d ij represents the distance between nodes i and j, ks max is the maximum ks value in the network, and R denotes the network's truncation radius, which is half of the average shortest path length in the network.

Structural neighborhood centrality 3.2.1. Clustering coefficient
Clustering coefficient is a fundamental concept in graph theory, describing the degree to which nodes tend to cluster together.The clustering coefficient of a specific node in a graph is referred to as its local clustering coefficient, which quantifies the extent to which neighboring nodes form clusters or cliques [15].The local clustering coefficient of a node is defined as the ratio of the number of actual edges between its neighbors to the number of potential edges that could exist among them, as outlined below: Here, k represents the number of nodes within the neighborhood, and E denotes the actual number of edges that exist between nodes within the neighborhood.

Structural hole
A structural hole refers to the gap formed between two nodes in a network where no redundant connections exist [48].Nodes that occupy a structural hole position in a network have a competitive advantage, which can manifest in various ways.For example, individuals occupying a structural hole position can allow other individuals to get to know each other or compete with each other.To quantify the control that structural hole nodes exert over various relationships within the network, Burt introduced the concept of network constraint, defined as follows: where node q is the common neighbor between nodes i and j. µ ij denotes the proportion of the total energy invested by node i to maintain the relationship with node j, which is defined as follows In the above equation, N i is the set of neighboring nodes of node i.The value of e ij is 1 when there are connected edges of i and j, and 0 when the opposite is true.

Structural neighborhood centrality (SNC)
Consider a simple graph G (V, E), where V represents the set of nodes in the graph, and E represents the set of edges.We define N v1 as the first-order neighborhood set of a node in V and N v2 as the second-order neighborhood set of that node.The computation of SNC involves two main steps.The first step involves calculating the mass value and structural hole coefficient ω i for each node.The second step involves calculating the nc value for each node, as detailed below: Firstly, calculate the mass value and structural hole coefficient ω i for each node, defined as follows: and: where d i denotes the degree of node i and C i denotes the local clustering coefficient of node i.
After this, we determine the node's nc value, which is calculated as follows mass j × mass i × ω j (11) where ω j represents the structural hole coefficient of node j within the neighborhood of node i.It is noteworthy that, similar to the concept of CR, we posit that the node also exerts an influence on the node itself.So in our method, the node itself is also included in the calculation.Ultimately, by incorporating the structural hole coefficient onto the foundation of the nc value, we obtain the significance score, denoted as the snc value: In equation ( 9), a higher local clustering coefficient of a node indicates a stronger tendency for nodes to cluster together, and the function y = e x amplifies the impact of the clustering coefficient on the node.And in equation (10), as node i occupies more structural holes, its structural hole constraint SH i becomes smaller, leading to an increase in ω i .Consequently, when a node i occupies more structural holes, its structural hole constraint is reduced, resulting in a higher structural hole coefficient.
Unlike most traditional node centrality measures, SNC takes into account both degree and local clustering coefficient, driven by two primary reasons.Firstly, neighbors of nodes with high clustering coefficients tend to cluster together.This not only enhances the robustness of local network structures but also promotes the efficiency of information propagation within the network.However, it is worth noting that nodes with high clustering coefficients may also restrict the range of information dissemination while speeding up its transmission.This happens because, once nodes cluster together to form a community structure, information might primarily circulate within these communities, rather than spreading beyond them.In such scenarios, if SNC only considers the clustering coefficient of nodes, it could lead to localized information propagation.By incorporating both degree and clustering coefficient, we can enhance network robustness and compensate for the limitations of clustering coefficient in measuring node influence.Secondly, the local clustering coefficient signifies the tendency of neighboring nodes to cluster together, and this might be influenced by higher-order neighboring nodes.Therefore, considering the clustering coefficients of first and second-order neighbors is akin to simultaneously taking into account the interaction information of nodes' third-order or even higher-order neighbors.This allows us to incorporate as much information as possible without increasing algorithm complexity significantly, thus compensating for some local centrality measures' limitations in terms of inadequate information consideration.
As mentioned earlier, in the SNC method, a node's degree is regarded as its volume, and its clustering coefficient is seen as density.The product of volume and density yields mass.Mass is a significant physical quantity in physics; typically, the greater an object's mass, the greater its energy, potentially rendering it an important element within a system.If a node is surrounded by numerous high-mass nodes, its mass may be higher, leading to a potentially more significant role within the system.Thus, we consider that a higher snc value for a node signifies greater importance.It is evident that the time complexity of the SNC method mainly revolves around two components.The first involves the calculation of the clustering coefficient and structural holes, while the second entails gathering information within the node's second-order neighborhood.Both the computation of structural holes and clustering coefficient necessitate considering a node's first-order neighbors and the edges connecting them.Consequently, the time complexity for this part is O (n • |E|), where n represents the network's average degree.On the other hand, the collection of information within the second-order neighborhood is only dependent on a node and its first-order neighbors' degrees, leading to a time complexity of O ( n2 ) .Generally, the number of edges in a network far exceeds the average degree of nodes.As a result, the time complexity of SNC is O (n • |E|).Figure 1 illustrates an example network with 12 nodes and 17 edges.When computing the SNC value for each node, we should commences by determining the mass value for each node within the network.Subsequently, the first-order and second-order neighbor nodes of the current node are identified.Taking node 9 as an illustration, its first-order neighbors are nodes 1 and 12, while its second-order neighbors include nodes 1, 2, 3, 5, 6, 9, and 12.It is notable that node 1 functions as both a first-order neighbor and a second-order neighbor to node 9. So, when calculating the SNC value, we should not calculate node 1 only once.This is because node 1 can exert direct influence on node 9 and can also indirectly impact node 9 through its connection with node 12. Therefore, a comprehensive assessment of these influences is imperative when calculating the SNC value.5) Jazz: a collaborative network between jazz musicians [52].( 6) C_elegans: a neural network of elegant cryptobaric nematodes [15].(7) USAir97: a network of American Airlines [50].( 8) NS_GC: a collaborative network of scientists [51].( 9) Email: the email communication network of the University of Rovira I Virgili, Spain [53].(10) Yeast: a protein-protein interaction network in yeast [54].(11) Minnesota: the Minnesota road network [50].(12) Kohonen: a citation network from Pajek [50].Table 1 provides additional information about these networks.

Susceptible-Infectious-Recovered (SIR) model
The SIR propagation model [55,56] can faithfully simulate the spread of infectious diseases across networks, making it one of the standards for comparing node propagation ranking methods.In the standard SIR model, nodes can exist in three states: Susceptible (S), Infected (I), and Recovered (R).At the outset of the SIR model, all nodes are in the S state.In this time, a node is chosen as the initial spreading seed, transitioning its state to I. In each propagation round, an infected node (I) has a probability β of infecting susceptible nodes within its neighborhood, changing their states from S to I. Additionally, infected nodes transition to the R state with a probability λ (λ = 1), after which they remain unchanged.Propagation terminates when there are no nodes in the I state.At this point, the count of nodes in the R state is computed, representing the spreading capability of the initial infected nodes.To mitigate potential stochasticity, multiple simulation runs are often conducted, recording data from each run and subsequently averaging the results to determine the actual propagation capability of nodes.
In this study, each node in the network will be evaluated as a seed node to determine its propagation capability.The propagation capability test for each node will be performed 100 times and the resulting data will be averaged to reduce the impact of randomness.During the simulation process of SIR, it is necessary to set the magnitude of the propagation probability β.If β is too large, the nodes can be easily infected, causing most of the nodes in the network to become infected, which makes it difficult to differentiate nodes based on their propagation ability.On the other hand, if β is too small, the seed nodes will not be able to infect other nodes, resulting in most nodes remaining uninfected, which also makes it difficult to distinguish nodes based on their propagation.Therefore, in SIR simulation experiments, the value of β is usually set to a value slightly larger than the propagation threshold β th , β th = ⟨k⟩/⟨k 2 ⟩, where ⟨ k 2 ⟩ is the average second-order degree of the network and ⟨k⟩ is the average degree of the network.

Evaluation criteria 4.3.1. Kendall correlation coefficient
The Kendall tau correlation coefficient is used to evaluate the accuracy of the SNC method by measuring the correlation between the ranked list obtained by the SNC method and the ranked list obtained by the SIR simulation experiment.A higher Kendall tau correlation coefficient indicates a stronger similarity between the two lists, which implies a higher accuracy of the method.
The Kendall tau correlation coefficient is defined as follows [46] where X and Y represent the sorted lists of node influence obtained by two different methods.n + and n − denote the number of homoscedastic and heteroscedastic pairs, respectively.Consider two sorted lists X and Y. Suppose X = (x 1 , x 2 , . . .x n ) and Y = (y 1 , y 2 , . . .y n ).At this point, N double tuples (x 1 , y 1 ) , (x 2 , y 2 ) , . . ., (x n , y n ) can be formed from the above two lists.For any pair of duals (x i , y i ) , ( x j , y j ) , if two duals are of the same rank, i.e., x i < x j and y i < y j , or x i > x j and y i > y j , then (x i , y i ) , ( x j , y j ) is said to be a homogeneous pair, otherwise, it is a heterogeneous pair.In addition, if x i = x j and y i ̸ = y j , the value of t x is added by 1.Similarly, if x i ̸ = x j and y i = y j , the value of t y is added by 1.In particular, if x i = x j and y i = y j , the element does not belong to any of the above cases and no change is made to any value.

Monotonicity
The primary purpose of assessing the monotonicity of ranking results is to determine whether a particular ranking method can effectively differentiate between the influence of individual nodes.Ranking methods with low monotonicity, such as KS or HI, may group nodes into the same rank, suggesting that the propagation abilities of various nodes are considered to be the same.However, there are often subtle differences between nodes, so a good ranking method should be capable of accurately measuring the propagation ability of each node, rather than roughly dividing them.The ordering monotonicity M is defined as follows [57] ] 2 (14) where |V| denotes the number of nodes in the network, L is the sorted list formed by the important node sorting method, and s i denotes the number of nodes of rank i in the sorted list.M takes values in the range [0,1], and the closer its value is to 1, the stronger the monotonicity.

Jaccard similarity coefficient
The Jaccard similarity coefficient is commonly used to measure the similarity between two finite sets, with a higher Jaccard similarity coefficient indicating a greater similarity between the sets.In this study, we use this coefficient to evaluate the ability of SNC to identify Top-k nodes in a network, ranked by their propagation ability [20,58,59].The Jaccard similarity coefficient is defined as follows where X and Y denote the Top-k node influence sorted lists obtained by two different methods.X typically corresponds to the list of Top-k nodes derived from SIR experiments, while Y represents the list of Top-k nodes obtained using another method.In this study, the closer the value of J (X, Y) is to 1, the stronger the ability of method Y to find Top-k nodes.

Results and discussions
This section presents the results of experiments conducted to evaluate the performance of SNC.Section 5.1 describes the results of node importance ranking using each method on a manually designed miniature example network.Section 5.2 examines the accuracy of different methods on a real network using the Kendall tau correlation coefficient.Section 5.3 discusses the monotonicity of different methods, which refers to their ability to distinguish between the propagation abilities of different nodes.Section 5.4 examines the variation in the ability of different methods to identify the Top-k nodes with the highest propagation ability.

A simple experiment on a tiny artificial network
To assess the true spreading influence of nodes, we employed the classic SIR model for node evaluation.In the SIR experiments, each node underwent 100 propagation trials to reduce potential randomness. Figure 2 illustrates the top three nodes identified by each method in the example network from figure 1, while table 2 provides a more detailed ranking list generated by various sorting methods.From figure 2 and table 2, it can be observed that SNC, CI, CC, HI, and KSGC identify the same top three influential nodes as the SIR experiments, although with slight variations in rankings.CR identifies one different node in the top three in comparison to the SIR results, while the KS metric exhibits significant differences.Table 2 further reveals that  SNC precisely quantifies the spreading abilities of each node, ensuring that no two nodes share the same spreading capability and exhibiting the highest level of distinguishability.Its ranking list of node influence closely aligns with the results obtained from the SIR method.In contrast, KS, HI, and CR fail to differentiate between nodes' influences effectively, as these methods assign multiple nodes to the same rank, despite potentially substantial variations in their influence levels.

Accuracy of the SNC method
We calculated the Kendall tau correlation coefficient between the ranking lists obtained from the SNC method and the baseline method and the ranking lists obtained from the SIR experiments.The accuracy of the SNC method was evaluated based on this correlation coefficient.In this subsection, we set six different propagation rates for the SIR propagation experiment: β − 0.05 or β − 0.01, β, β + 0.05, β + 0.10, β + 0.15, and β + 0.20, where β = 1.01 × β th , and the propagation rate was set to β − 0.01 when β − 0.05 ⩽ 0. The experimental results are presented in table 3, appendix A and figure 3. It can be seen that the SNC method achieved the highest Kendall tau correlation coefficient and the best accuracy when the propagation rate is β.In the Lesmis, USAir97, and Polbooks networks, the SNC method maintained the highest accuracy at all five contagion probabilities.The KSGC method had higher accuracy in some networks but lower accuracy in others because it only considered node degree and did not take into account other local node information.In the Yeast and Dolphins networks, the CI method had the highest accuracy, but the difference between the SNC and CI methods was not as significant.In the Minnesota network, CI widens significant gap with SNC method.This is because the Minnesota network has a higher propagation rate, making nodes more easily infected, which weakened the performance of the SNC method.The CR method had the lowest accuracy in all datasets because the nodes identified by CR were different from those identified by SIR.CR identifies important nodes that are evenly distributed throughout the network, while other methods tend to identify important nodes that are clustered together, forming a distinct association structure.

Monotonicity of the SNC method
Many contemporary node importance ranking methods face the challenge of accurately distinguishing the importance of each node.A good ranking method should not only identify the high-impact nodes but also uniquely determine the propagation ability of each node.The SNC method addresses this issue by considering not only the neighborhood information of nodes but also their clustering coefficients and SHs.Table 4 shows the monotonicity results of different methods, with KS and HI having the worst monotonicity, while SNC, KSGC, CI, and CR perform better, with SNC having the best monotonicity and the average monotonicity closest to 1.In figure 4 and appendix B, if the ranking result shows that a certain level appears more frequently, it means that there are more nodes of the same level, and the differential performance of this method is poor; if the ranking result can draw a straight line at the bottom, it means that nodes of the same level With fewer nodes, the differential performance of this method is better.It can also be seen from figure 4 and appendix B that the node ranking obtained by the SNC method forms a straight line at the bottom, indicating that it has only a limited number of nodes in each rank and can differentiate between nodes effectively.In fact, SNC, KSGC, CI, and CR methods all partially or fully consider the local identify the Top-10 nodes is weaker than that of HI, and its ability to identify the Top-45 nodes is weaker than CI.However, on the whole, the curve of the SNC method remains consistently at the highest position in almost all networks, with its similarity coefficients higher than other methods.Therefore, SNC not only accurately assesses the importance of each node in the network but also identifies the Top-k most influential nodes in the network.

Further analysis of SNC
While we have tested various aspects of SNC's performance and discussed some of its advantages in the preceding sections, we still lack a comprehensive understanding of the underlying principles behind SNC.Therefore, it is necessary for us to delve further into the analysis of SNC.Sensitivity analysis can help identify the influence of network topology characteristics on SNC.Statistical analysis can determine the significance and stability of SNC scores, providing reliable grounds for the interpretability of the SNC metric.Additionally, correlation analysis can establish the relationships between SNC and other centrality measures, aiding in our comprehension of the strengths and limitations of SNC compared to traditional centrality metrics.Hence, this section primarily focuses on further analyzing SNC from these three perspectives.

Sensitivity analysis
The centrality scores of nodes are often influenced by the network's topological features, making the generality of a method across different networks an important property.In this section, we will discuss the performance of SNC under different link densities and degree distributions, comparing it with centrality methods like CI, CR, and KSGC.
To assess the impact of different link densities on SNC and other methods, we constructed three ER random networks, all with 1000 nodes but average degrees of 5, 10, and 15, as shown in figure 6(A).As the average degree increases, the network becomes denser, and the accuracy of these methods decreases.In sparse graphs, SNC's accuracy is slightly lower than that of CI, but it surpasses KSGC and CR.In other scenarios, SNC consistently exhibits the highest accuracy, but it does not show a significant gap when compared to KSGC and CI.CR is somewhat unique, as the influence of link density on it seems to lack a clear pattern.This is because CR is designed based on cycle structures within the network and is unrelated to the network's link density.To test the influence of degree distributions on these centrality measures, we constructed three classic random graph models: ER graphs, Watts-Strogatz small-world networks (WS), and Barabasi-Albert scale-free networks (BA).These three models represent completely random, small-world effects, and scale-free degree distribution patterns, respectively.All of these networks consist of 1133 nodes and have edges close to 5451. Figure 6(B) illustrates that CI performs best in the BA network, while in other cases, SNC achieves the highest accuracy.
Indeed, due to SNC's consideration of both the node and its neighbors' local clustering coefficients, it tends to identify important nodes that are strongly clustered together.As a result, important nodes identified by SNC often aggregate and form a community structure.Consequently, when dealing with sparse graphs, SNC will rely primarily on degree to assess node importance, which can significantly diminish its performance.
Furthermore, specific network structures can also undermine SNC's performance.Consider a network structure like the one depicted in figure 7: In figure 7, the node propagation capability sequence given by the SIR propagation model is 1, 7, 2, 5, 8, etc, while the node propagation capability sequence given by the SNC method is 7, 2, 5, 8, etc, without recognizing the importance of node 1, which serves as a bridge node.The reason for this phenomenon is that SNC considers not only its own structural hole information but also the structural hole information of neighboring nodes.Indeed, node 1 occupies the most structural hole positions with a structural hole coefficient of 0.30, the highest among all nodes.However, nodes 2 and 7 also have high structural hole coefficients of 0.29.Therefore, according to SNC's measurement criteria, nodes 2 and 7, which have more neighbors, are identified as more important bridge nodes.One feasible solution to address this issue is to regulate the contribution of node's local information and neighbor information to its importance by setting parameters.However, this approach will increase the time complexity of SNC and require extensive experiments to determine the parameter values.

Statistical analysis
In the problem of identifying important nodes in complex networks, various centrality metrics often have dimensionless absolute values.Moreover, due to the varying sizes and diverse structures of networks, we need to use statistical comparisons of relative results to illustrate the empirical properties of nodes or networks.In this paper, we accomplish this task by constructing random networks that share certain properties with empirical networks.Such random networks are also referred to as null models in statistical methods [60,61].
Generally, there are two types of methods for constructing null models of complex networks.One approach involves first calculating certain statistical properties of the network and then generating random networks based on these properties to create null models.While this method is straightforward, it is challenging to construct higher-order null models using it.The other approach involves random edge rewiring based on the original network, aiming to preserve the original network's properties while randomizing it.This method allows for the preservation or disruption of different network properties but is more complex to implement and computationally time-consuming, making it challenging for large-scale networks.
In this paper, we employ the method of constructing random networks to create two types of null models for complex networks.One null model has the same average degree as empirical networks, with all other properties randomized.We refer to this as the '0th-order null model.'The other null model maintains the same degree distribution as empirical networks, with all other properties randomized, and we call this the '1st-order null model.'In fact, higher-order null models could be constructed based on more constraints, such as having the same joint degree distribution or same joint degree distribution along with the average clustering coefficient as empirical networks [61].However, as this paper's primary focus lies elsewhere and the generation of higher-order null models is more complex, we do not discuss them here, leaving them for future work.
To explore the potential impact of certain non-random characteristics in empirical networks on SNC, we constructed 500 instances of 0th-order null models and 500 instances of 1st-order null models.We calculated the distribution of node SNC scores in these null models and obtained confidence intervals for SNC scores.Figure 8 and appendix D depict the distribution of SNC scores on empirical networks.In these figures, each point represents a node in the network.Green nodes indicate SNC scores falling within the confidence interval provided by the 0th-order null model.Blue nodes represent SNC scores within the confidence interval of the 1st-order null model but outside the confidence interval of the 0th-order null model.Red nodes represent SNC scores exceeding the confidence interval of the 1st-order null model.From the figures, it can be observed that in the Jazz, NS_GC, and Yeast networks, most of the nodes have SNC scores falling within the confidence interval of the 0th-order null model.There are also some nodes that exceed the confidence interval of the 0th-order null model but fall within the confidence interval of the 1st-order null model.Furthermore, many nodes surpass the confidence interval of the 1st-order null model.This suggests that SNC is not highly sensitive to the degree of nodes, and certain unique structures and properties existing in real networks may have a greater impact on it.Therefore, SNC can identify important nodes in empirical networks that possess special structures and properties not typically found in random networks.It is worth noting that in the Minnesota network, all nodes' SNC scores fall within the confidence interval of the 0th-order null model, which might be related to the sparsity of the Minnesota network.
In conclusion, determining which features in empirical network data differ from those in random networks and assessing the contributions of clustering coefficients and structural hole constraints to SNC rely on the analysis of higher-order null models.For example, constructing random networks with the same joint degree distribution and clustering coefficient as the empirical network would allow an analysis of the contribution of structural hole constraints to SNC.This is one of the tasks we are planning to undertake in our future work.

Correlation analysis
Correlation analysis can examine potential associations between different centrality metrics.If centrality metrics yield highly correlated rankings of node importance, it indicates that they provide similar information.Conversely, if they are uncorrelated, it suggests that they provide distinct information.In this study, we conducted correlation tests between SNC and four other centrality metrics: CI, KSGC, CR, and CC, on the Jazz, NS_GC, Yeast, and Minnesota networks, and the results are shown in figure 9 and appendix E.
In figure 9, each point represents a node in the network.The x-axis represents the node's SNC score, the y-axis represents the node's score under another centrality metric, and the color represents the node's spreading capability in the SIR model.Nodes with higher spreading capability are closer to pink-purple in color.From the figure, it can be seen that SNC has some correlation with CI and KSGC.Additionally, the correlation between SNC and KSGC is higher than that between SNC and CI.However, there is almost no correlation between SNC and CR and CC.Furthermore, nodes with high SNC scores also tend to have high spreading capabilities, which indirectly supports the feasibility of SNC.
An interesting observation is the high exponential correlation between SNC and KSGC and CI.SNC assesses node importance using information about node degree, clustering coefficients, and structural hole constraints.CI relies on the degree information of high-order neighbors, while KSGC utilizes ks values and degree information.In essence, ks values are related to node degrees.Therefore, ks values, clustering coefficients, and structural hole constraints are not inherently correlated.In this context, SNC and KSGC should not exhibit such a high level of correlation.To investigate whether the correlation between SNC and KSGC is driven by degrees, we conducted tests using the 1st-order null model (which preserves the same degree distribution as the original network).
Figure 10 shows the correlation results between SNC and KSGC, as well as SNC and CI, on the 1st-order null model of the Jazz network.It can be observed that SNC and KSGC, as well as SNC and CI, exhibit stronger correlations in the 1st-order null model of the Jazz network.This suggests that in random networks, SNC and KSGC primarily rely on degree information to assess node importance, and the correlation between them is indeed driven by the shared use of degree information.In real networks, the correlation between SNC and these two metrics decreases.This may be due to SNC capturing certain features not present in random networks, such as nodes with high clustering coefficients or nodes occupying more structural hole positions in real networks.Consequently, SNC provides additional information that enhances its accuracy, which is challenging for KSGC to uncover.

Conclusion
Precisely locating highly influential nodes in a network is a problem of significant practical importance.The proposed SNC method in this paper draws inspiration from the mass formula in physics.It not only takes into account a node's degree and structural hole information but also incorporates the often overlooked clustering coefficient.Therefore, SNC not only identifies important nodes within a network but also provides nearly unique measurements of each node's importance.Experimental results from performance testing show that when the transmission rate is β, SNC achieved the highest Kendall Tau correlation coefficient in 10 out of 12 networks, and in the remaining two networks, SNC ranked close to the top with a minimal difference from the first place.Thus, overall, SNC exhibits better accuracy.Additionally, SNC also exhibits higher monotonicity, with an average monotonicity value of 0.9938, surpassing other baseline methods.This indicates that SNC can accurately measure the importance of each node.In terms of identifying the Top-k most influential nodes within a network, SNC also performs admirably.While in some networks, the accuracy and ability of SNC to identify the Top-k nodes may be slightly weaker than certain methods, overall, SNC's performance remains superior.
Further analysis of SNC reveals that it consistently demonstrates top-tier performance in various networks and has the capability to capture latent structural features in real networks, such as clustering coefficients or structural hole constraints, providing a solid foundation for its high accuracy.Nevertheless, SNC still has some limitations.Firstly, SNC exhibits lower performance in sparse graphs.This is because SNC assesses node importance based on the local clustering coefficients of nodes and their neighbors.In sparse graphs or some low-clustering-coefficient random networks, SNC can only rely on node degrees to assess importance.Secondly, SNC still exhibits a high degree of correlation with KSGC.Compared to KSGC, SNC only provides some specific information that exists in real networks but is absent in random networks.Furthermore, the design philosophy of SNC does not consider the position of nodes within the network.Although node position is not necessarily entirely related to node influence, it remains one of the node's essential attributes.Additionally, SNC only considers a node's second-order neighbors, while higher-order neighbors of nodes may also have an impact, necessitating further exploration.

Figure 1 .
Figure 1.Example network.This example network has 12 nodes and 17 connected edges.

Figure 2 .
Figure 2. Ranking performance of different methods.The three most influential nodes identified by the SIR model are marked with cyan blue; the three most influential nodes identified by SNC are marked with dashed circles; the nodes identified by KSGC are marked with dark green; the nodes identified by KS are marked with square boxes; and the nodes identified by Cycle Ratio are marked with pink.

Figure 3 .
Figure 3.The Kendall tau coefficients between various metrics and the SIR experimental results under different infection probabilities for the four networks.In the figure, the black dashed line represents the propagation threshold β th , the x-axis denotes the infection probability, and the y-axis illustrates the Kendall tau correlation coefficient.The experimental results on all networks are showcased in appendix A.

Figure 5 .
Figure 5. Jaccard similarity coefficient results for Top-k nodes obtained by different ranking methods in four networks.The x-axis in the above figure is the number of Top-k nodes and the y-axis is the Jaccard similarity coefficient.The experimental results on all networks are showcased in appendix C.

Figure 6 .
Figure 6.Performance comparison of the four centrality measures across different random networks.All results in the figure represent the averages across 100 random networks.

Figure 7 .
Figure 7.A specific network structure This specific network structure consists of eight nodes and twelve edges.

Figure 8 .
Figure 8.The distribution of SNC scores on four empirical networks.In these figures, each point represents a node within the network.Green nodes indicate that SNC scores fall within the confidence interval provided by the 0th-order null model.Blue nodes represent SNC scores within the confidence interval of the 1st-order null model but outside the confidence interval of the 0th-order null model.Red nodes signify SNC scores that exceed the confidence interval of the 1st-order null model.Results for all networks are provided in appendix D.

Figure 9 .
Figure 9. Correlation results between SNC and other metrics on the Jazz Network.Each point in the figure represents a node within the network.The x-axis corresponds to the node's SNC score, while the y-axis represents the node's score under the respective centrality metric.Node colors are determined based on their spreading capabilities in the SIR propagation model, with nodes displaying stronger spreading capabilities appearing closer to shades of pink-purple.Results for the other three networks are provided in appendix E.

Figure 10 .
Figure 10.Correlation results between SNC and other metrics on the 1st-order null model.Each point in the figure represents a node within the network.The x-axis corresponds to the node's SNC score, while the y-axis represents the node's score under the respective centrality metric.Node colors are determined based on their spreading capabilities in the SIR propagation model, with nodes displaying stronger spreading capabilities appearing closer to shades of pink-purple.
Appendix A. The accuracy performance of different node centrality metrics at various infection probabilities Appendix B. The rank distribution plots of different node centrality metrics across different networks Appendix C. The Jaccard similarity coefficient results for different node centrality metrics

Table 1 .
[51]c topological features of network data.The leftmost column is the name of the network, N and E denote the number of nodes and edges of the network, respectively; ⟨K⟩ and Kmax denote the average degree and maximum degree of the network, respectively; C and r denote the clustering coefficient and congruence coefficient of the network, respectively.In this paper, we investigate the impact of SNC on 12 actual networks from different domains.(1)Dolphins:anetwork of interaction relationships between 62 dolphins[49].(2)Lesmis:asocialnetworkcomposed of characters from Victor Hugo's book Les Misérables[50].(3)Polbooks:anetworkfor buying books on American politics[50].(4)Adjnoun:a network between adjectives and nouns commonly used in Charles Dickens's novel David Copperfield[51].(

Table 2 .
Ranking list obtained by different ranking methods.The number below the method name indicates the node number, and the number below the rank indicates the ranking.SIR represents the ranking list of node influence given by the SIR model; SNC is the ranking list of node influence given by the method proposed in this study; CI represents Collective Influence; CR represents Cycle Ratio; KS stands for K-Shell; HI stands for H-index.