Mean clustering coefficients: the role of isolated nodes and leafs on clustering measures for small-world networks

Many networks exhibit the small-world property of the neighborhood connectivity being higher than in comparable random networks. However, the standard measure of local neighborhood clustering is typically not defined if a node has one or no neighbors. In such cases, local clustering has traditionally been set to zero and this value influenced the global clustering coefficient. Such a procedure leads to underestimation of the neighborhood clustering in sparse networks. We propose to include θ as the proportion of leafs and isolated nodes to estimate the contribution of these cases and provide a formula for estimating a clustering coefficient excluding these cases from the Watts and Strogatz (1998 Nature 393 440–2) definition of the clustering coefficient. Excluding leafs and isolated nodes leads to values which are up to 140% higher than the traditional values for the observed networks indicating that neighborhood connectivity is normally underestimated. We find that the definition of the clustering coefficient has a major effect when comparing different networks. For metabolic networks of 43 organisms, relations changed for 58% of the comparisons when a different definition was applied. We also show that the definition influences small-world features and that the classification can change from non-small-world to small-world network. We discuss the use of an alternative measure, disconnectedness D, which is less influenced by leafs and isolated nodes.


Introduction
Many real-world networks show properties of small-world networks as their neighborhood connectivity, generally denoted by the clustering coefficient, is higher than in comparable random networks [1].The local clustering coefficient for an individual node i with deg i neighbors and Γ i edges between its neighbors is This formula is basically not defined if the number of neighbors deg i becomes zero or one as the denominator becomes zero [2].These cases are usually treated as C i = 0 although some authors also set these values to one [3].In the current scheme, these values would be part of the global calculation In addition, we tested an alternative and more widely used definition of the clustering coefficient [4] in which This might lead to biased assessments of neighborhood clustering in the sense that values that are not defined (division by zero) should not be included in the averaging.Thus, instead of using N as the number of evaluated nodes for the global C 1 , a new number N ′ indicating all nodes with defined local clustering should be used for a global measure C ′ .We show that using such an adjusted measure for the clustering coefficient has several implications for network analysis and can help to identify the contribution of leafs and isolated nodes on average clustering.On a conceptual level, the adjusted value C ′ is more intuitive as the clustering coefficient is commonly called a measure of neighborhood connectivity: If 30% of the local coefficients are zeros from cases where no neighbors exist, how can the classical definitions still give information about neighborhood?Cases of leafs and isolated nodes are more likely in sparse networks where the edge density d, the number of existing divided by the number of possible connections (d = E/N * (N − 1) for a network with N nodes and E directed edges or arcs), is low.Therefore, the classical definition is a mixed measure of neighborhood clustering and sparseness (edge density) or -more preciselythe frequency of leafs and isolated nodes.
A general problem of network measures, such as the clustering coefficient, is whether sampling or perturbations change the values of these measures.Network measures are frequently used for the classification of different networks [5] or of topological changes (addition or deletion of nodes or edges) within the same network.Incomplete sampling -only observing a sub-network of a larger network -can lead to the wrong classification of a network as being a scale-free network [6].This occurred, for example, for comparing the partial and complete protein-protein interaction networks [7] and the router and underlying communication network [8].In addition to sampling, false scalefree classifications can also arise due to statistical errors [9].Whereas previous studies investigated the effect of sampling on the degree distribution, a recent study [10] looked at the sensitivity to sampling and network perturbation for a range of measures: The clustering coefficient, as well as the hierarchical clustering coefficient, the hierarchical degree, and the divergence ratio were found to be least sensitive to perturbations of the topology.Therefore, classifications using the clustering coefficient (e.g.small-world classification [1]) are less affected by the sampling problem.However, as we show here, the definition of the clustering coefficient can have a considerable effect on network classification.

Networks
We tested the effect of different definitions for the clustering coefficient on several realworld networks.All but one network, the German highway system, were small-world networks.The Caenorhabditis elegans neuronal network consisted of individual neurons as nodes and existing synaptic connections as edges [11].The metabolic networks of C. elegans, Saccharomyces cerevisiae, and 41 other organisms included metabolic substrates as nodes and reactions as edges [12].The protein-protein interaction network of S. cerevisiae (yeast) included proteins as well as interactions as discovered by the yeast two-hybrid method (http://dip.doe-mbi.ucla.edu,dataset from 2 Dec 2007).The German highway (Autobahn) system consisted of location nodes (that is, highway exits) and road links between them (Autobahn-Informations-System, AIS, from http://www.bast.de)[13].Only the gross level of highways were included in the analysis, discarding smaller and local roads ('Bundesstrassen' and 'Landstrassen').For the power grid, nodes represent generators, transformers and substations, and edges represent highvoltage transmission lines between them [1].For the world-wide-web, individual pages are the nodes and links between them the edges [14].Information about the size of the networks as well as a reference to the source of the datasets is included in Table 1.For comparisons, we also generated random networks with the same number of nodes and edges as the original networks described above.In such Erdös-Rényi random networks [15], the probability p that an individual connection between two nodes is established equals the edge density d of the desired network.

Adjusted clustering coefficient definition
In addition to the two definitions for neighborhood clustering defined in the introduction, we looked at the effect of removing nodes with less than two neighbors corresponding to leafs and isolated nodes before averaging for the global clustering coefficient.The relation between the new coefficient C ′ and the traditional measure C 1 can be derived  [11] and metabolic network [12], yeast metabolic interaction network [16], yeast protein-protein interaction network (http://dip.doe-mbi.ucla.edu,dataset from 2 Dec 2007), German autobahn system [13], electrical power grid of the western United States [1], and world-wide web [14].

Network
Therefore, f = 1 1 − θ is the factor of the increase of the clustering coefficient C 1 by using the new method.Unfortunately, there is no easy transformation between the new measure C ′ and the other measure C 2 (e.g. the correlation between the two measures is r = 0.06 for 43 metabolic networks).

Results
What is the effect of the adjusted definition C ′ above?If one third of local coefficients were undefined, for example, the clustering coefficient would increase by 50% and would double if half of the nodes were undefined.For the yeast protein-protein interaction network with 4,931 nodes the clustering coefficients C 1 and C 2 raised from 14.4% and 8.4%, respectively to 18.7% for C'.That means that the value increased by 30% compared to C 1 and more than doubled compared to C 2 .For several real-world networks (Tab.1), values of neighborhood connectivity increased by factors between 1.02 and 2.42; that means that the average clustering coefficient increased by up to 142%.This indicates that current definitions significantly underestimate neighborhood clustering.For Erdös-Rényi random networks [15] with the same number of nodes and edges as the yeast protein interaction network the increase was maximally 4.3% and on average 0.7% for 100 generated networks; thus the new clustering coefficient is still comparable with the edge density of random networks.

Network comparison
In addition to the effect for single networks, measures such as the clustering coefficient are often used for comparing networks.Network comparisons can either involve different original networks or the same network before and after structural perturbations.Previous studies compared the clustering coefficient of 43 metabolic networks [17] and changes in neural correlation networks for Alzheimer [18], schizophrenia [19], and epilepsy [20,21] patients.
For network comparisons, the definition of neighborhood connectivity is critical for the comparison (Fig. 1).Assuming that we have two networks G a and G b , where the first has higher classical clustering (C 1 measure) than the second one, i.e.C a > C b .Then, this relation will swap for the new definition to Let us look at the simpler case where we compare a sparse network with a dense network still under the assumption that C a > C b .As the dense network has almost no nodes that are isolated or leafs, we can set θ b = 0.Then, using the new definition How often do these swaps occur in real-world networks?We examined the effect of the adjusted definition for the case of comparing sparse networks by analyzing 43 metabolic networks [12].Testing all 903 distinct relations between pairs of networks, the relations changed -using the adjusted definition -in 58% of the cases for the standard clustering C 1 .For the alternative more widely used definition C 2 , the relation changes in 76% of the cases.Even switching between the traditional definitions C 1 and C 2 changed the relation in 77% of the cases.Comparing the different measures for all 43 networks, there was a linear correlation between C ′ and C 1 but not between C ′ and C 2 or C 1 and C 2 (Fig. 2).This indicates that the effect of using a different clustering coefficient definition can often not be predicted from an where k is the degree of a node and C(k) is the average clustering coefficient over all nodes with degree k [22,23].Then, the distributions of C(k) with k > 1 for the two networks can be compared.Such a comparison might detect cases where one network shows a linear, exponential, or power-law distribution whereas the other network does not.Comparing two networks with a similar distribution becomes more difficult.Whereas qualitative differences might be visible through comparing the distribution plots, getting a quantitative value for describing these differences is more challenging.Therefore, single values for describing networks will remain popular unless standard ways for distribution comparisons are established.

Changes of small-world features
Many real-world networks show features of small-world networks [24].In these networks, the characteristic path length L remains comparable with random benchmark networks whereas the average connectivity between neighbors (clustering coefficient) C of a node is much higher than for random networks, that means L L random but C ≫ C random [1].One way to assess the extent of small-world features is calculating the small-worldness s = (C/C random )/(L/L random ) (note that a comparison of small-worldness is only meaningful for similar edge densities as the edge density influences the possible increase in the clustering coefficient).How do these small-world features, in particular the clustering coefficient component of the small-worldness s, change with the definition of the clustering coefficient?
Using the measure C ′ would lead to higher small-worldness s-the ratio between the clustering coefficient in the original and a comparable random network-if numerator C for the original networks increases at least as much as the denominator C random for the random benchmark networks.We tested the increase of changing from definition C 1 to C ′ which can be calculated by the ratio θ (cf.equation 4): A larger value of θ results in a larger increase of the clustering coefficient.Therefore, the small-worldness would increase as long as θ is larger for the original rather than the random benchmark network.
We tested the increase for the 43 metabolic networks by generating 50 random networks for each metabolic network and using the maximum value of θ out of the random networks.For all 43 metabolic networks, θ was larger for the original network than for random benchmark networks (Fig. 3A).
We also generated artificial small-world networks with 100 nodes and a variable edge density ranging from the minimum (0.5%) to the maximum (2%) value of the metabolic networks.For each edge density, 20 small-world networks were generated and for each such small-world network, 50 comparable random networks were analyzed.In contrast to the previous results of real-world networks, random networks show a higher ratio of leafs and isolated nodes than the generated small-world networks (Fig. 3B).The reason is that the small-world networks were generated starting from a lattice model followed by random rewiring of the network [1].Despite the rewiring, the strong neighborhood connectivity of the lattice model remains and prevents the occurrence of leafs and isolated nodes.
To remove the effect of the lattice network being the starting point of rewiring, we developed a small-world network generator with inverse rewiring ‡: the model starts with a random network and rewires edges so that the connectedness but also the number of isolated nodes increases.For a given network with E edges, 10×E rewiring steps were performed.At each step, an existing edge is chosen and deleted.Thereafter, another existing edge is chosen and the starting node of that edge is connected with a randomly chosen node that has not before been connected to that node.Each step elongates an existing chain of nodes by adding an edge, potentially leading to the formation of triangles, whereas the removed edge is either the internal or terminal part of a chain, leading towards a leaf node.For this model, in accordance with the results from the real-world metabolic networks, θ for the generated small-world network was below the value for random networks (Fig. 3C).‡ The Matlab script is available at http://www.biological-networks.org/

Changes of small-world classification
We have seen in the previous section that the small-worldness s of a network increases, or at least stays the same, when the new measure C ′ is used compared to the classical measure C 1 .Looking at both measures C 1 and C 2 could it be the case that networks that were previously classified as random would be classified as small-world with the new measure C ′ ?This would be the case if the ratio C/C random is lower or close to 1 for the classical measures but much higher than 1 for C ′ .Again we tested artificial small-world networks with 100 nodes and a variable edge density using standard and inverse rewiring as described above.For each edge density, 200 networks were generated and 100 benchmark random networks were evaluated for each generated network.The definition of a change in classification from random to small-world was a clustering coefficient ratio ≤ 1 for C 1 or C 2 and a ratio > 2 when using the measure C ′ .For standard rewiring (Fig. 4A), the fraction of changed cases was zero except for a small range of edge densities where the classification changed in up to 2.5% of the cases when shifting from C 2 to C ′ .For inverse rewiring (Fig. 4B), however, classification changed in around 10% of the cases (up to 15% for some edge densities) when shifting from C 2 to C ′ in the edge density range of 0.6%-1.5%.For shifting from C 1 to C ′ , classification only changed up to 1% of the cases; a fluctuation which might be due to the small sample size.In addition, a shift also occurred between the classical measures C 1 and C 2 : Whereas changing from C 1 to C 2 affected few cases, a shift from C 2 to C 1 affected up to 3% of the cases for standard rewiring and up to 14% for inverse rewiring.Therefore, changes in classification are possible for all clustering measures; in particular when using inverse rewiring.

Discussion
We have shown that current definitions underestimate neighborhood clustering in sparse networks with many isolated or leaf nodes.In addition, the outcome of comparisons of the extent of small-world features between different networks critically depended on the applied definition of the clustering coefficient.Furthermore, networks formerly classified as random can be classified as small-world when isolated or leaf nodes are excluded from the calculation of the average clustering coefficient.This can also happen when switching from C 2 to C 1 .
Could the clustering coefficient definitions impact the analysis of small-world networks?There are three consequences of this study.First, small-world networks regarding previous measures C 1 and C 2 will still be detected as small-world using C ′ as this value will be higher than the previous values.Consequently, the smallworldness s-the ratio of the clustering coefficient in the original and random benchmark networks divided by the unchanged ratio of the characteristic path lengths in original and random networks-will be higher.Second, networks which are currently not classified as small-world networks may be regarded as small-world due to the increase in clustering coefficient.This case will occur when the path length is comparable to that of random networks but the clustering coefficient, concerning previous definitions C 1 and C 2 , is not significantly higher than that of random networks.Third, comparison of networks could lead to opposite conclusions using the new measure.In conclusion, the novel measure C ′ gives a clearer view of neighborhood connectivity and is more independent of the sparseness of edge density.
A problem of the proposed measure C ′ is that the percentage θ of nodes that are excluded from analysis could be considerably high (Tab.1).The percentage of excluded nodes could be as high as 14% for metabolic networks and as high as 59% for manmade networks (power grid).Note that the value for the protein interaction network in yeast is also high at 56% as edge density is low and isolated nodes are not part of the largest connected cluster observed here.In general, however, exclusion from the average affected less than 10% for most of the networks.In addition, using a subset of defined nodes is comparable to the procedure for calculating shortest paths or the characteristic path length where unreachable paths with otherwise infinite distance are not included in calculating the average path length.
An alternative solution would be to describe the clustering coefficient using inverse neighborhood clustering.For the shortest paths, for example, the inverse measure of efficiency [25] where unreachable paths contribute 1/∞ = 0 to the local efficiency circumvents the need for excluding unreachable paths.Similarly, the (neighborhood) disconnectedness D could be defined as: and D i = 0 for Γ i = 0 Here, nodes which are leafs or isolated will contribute a zero value to the average D as one of the degrees will be zero for these nodes.D will be high when neighbors are not connected and low (→1) for high connectivity between neighbors.The correlations between disconnectedness and measures of the clustering coefficient are shown in Figure 5.

Conclusion
Including the percentage θ in publications could help to understand the validity of the applied definition of the clustering coefficient regardless of whether it is the Watts-Strogatz definition C 1 , the Newman-Strogatz-Watts definition C 2 , or the alternative definition C ′ presented here.In addition, this information is critical for the classification as small-world networks.We therefore suggest that information about the applied definition and the number of leafs and isolated nodes, the ratio θ, should be included in addition to the value of the average clustering coefficient.

Figure 1 .
Figure 1.Comparison of clustering coefficients for a sparse (left) and a dense (right) network with nine nodes.Whereas the clustering coefficient is higher in the dense network for the standard measure C 1 , it is higher in the sparse network for the novel, C 2 , and adjusted, C ′ , neighborhood clustering.For the adjusted clustering coefficient, isolated nodes or nodes with only one neighbor (indicated here by red circles) are excluded from the averaging.

Figure 3 .
Figure3.Change of small-worldness.Using C ′ leads to higher small-worldness s if dots are below the identity line and to lower small-worldness above the line.(A) θ for 43 metabolic networks.(B) θ for small-world networks generated by rewiring starting with a lattice model[1] (inset) with 20 different edge densities (maximum θ of 50 generated networks each).(C) θ for small-world networks generated by condensation (inverse rewiring) starting with a random model (inset) with 20 different edge densities (maximum θ of 50 generated networks each).

Figure 4 .
Figure 4. Fraction of cases when the network classification changed from random to small-world when switching from C 1 (x markers) or C 2 (o markers) to C ′ .(A) small-world networks generated by rewiring starting with a lattice model (inset) (B) small-world networks generated by condensation starting with a random model (inset).

Figure 5 .
Figure 5. Relations between disconnectedness D and measures for the clustering coefficient in 43 metabolic networks.Correlations are r = 0.04 between C 1 and D (A), r = −0.28 between C 2 and D (B), and r = −0.43 between C ′ and D (C).

Table 1 .
Number of nodes N , edge density d, ratio θ of nodes with less than two neighbors, and factor of increase (C 1 → C 2 ) for several biological and artificial networks: C. elegans neuronal

Table 2 .
Figure 2. Relations between different measures for the clustering coefficient in 43 metabolic networks.Whereas there is a linear correlation (r = 0.84) between the new definition C ′ and C 1 (A), there is no correlation between C ′ and C 2 (B, r = 0.004) or C 1 and C 2 (C, r = −0.06).Ratios f of the adjusted clustering coefficient C ′ of 43 metabolic networks with the Watts-Strogatz (C 1 ) and Newman-Strogatz-Watts (C 2 ) clustering coefficient.factors of increase for switching from C 1 and C 2 to C ′ is shown in Tab.2).Another way of comparing neighborhood clustering between networks is the use of clustering coefficient functions.One such clustering coefficient function is C