Exploring 3D community inconsistency in human chromosome contact networks

Researchers have developed chromosome capture methods such as Hi-C to better understand DNA’s 3D folding in nuclei. The Hi-C method captures contact frequencies between DNA segment pairs across the genome. When analyzing Hi-C data sets, it is common to group these pairs using standard bioinformatics methods (e.g. PCA). Other approaches handle Hi-C data as weighted networks, where connected node pairs represent DNA segments in 3D proximity. In this representation, one can leverage community detection techniques developed in complex network theory to group nodes into mesoscale communities containing nodes with similar connection patterns. While there are several successful attempts to analyze Hi-C data in this way, it is common to report and study the most typical community structure. But in reality, there are often several valid candidates. Therefore, depending on algorithm design, different community detection methods focusing on slightly different connectivity features may have differing views on the ideal node groupings. In fact, even the same community detection method may yield different results if using a stochastic algorithm. This ambiguity is fundamental to community detection and shared by most complex networks whenever interactions span all scales in the network. This is known as community inconsistency. This paper explores this inconsistency of 3D communities in Hi-C data for all human chromosomes. We base our analysis on two inconsistency metrics, one local and one global, and quantify the network scales where the community separation is most variable. For example, we find that TADs are less reliable than A/B compartments and that nodes with highly variable node-community memberships are associated with open chromatin. Overall, our study provides a helpful framework for data-driven researchers and increases awareness of some inherent challenges when clustering Hi-C data into 3D communities.


I. INTRODUCTION
Chromosomes' three-dimensional (3D) folded structure is critical to understanding genetic processes and genome evolution.The discovery of these 3D structures relied on the analysis of pan-genome-wide chromosome capture data, represented by a pairwise interaction matrix called Hi-C map [1][2][3] .These maps reveal substructures of different scales, including the dichotomous division into A (active) versus B (inactive) compartments, determined from principal component analysis (PCA) 1 , and smaller-scale topologically associated domains (TADs) identified using the Arrowhead algorithm 3 .Several molecular biologists find these structures appealing because they define DNA regions with correlated gene expression and epigenetic modifications.
In addition, their borders enrich binding sites for architectural proteins such as CTCF (11-zinc finger protein CCCTC-binding factor) in humans and CP190 in Drosophila melanogaster 4 .
As researchers delved deeper into this crucial topic, they discovered that TADs act as shielded 3D domains with more internal than external contacts, similar to the definition of "communities" in network science [5][6][7] , and that some TADs are nested or partially overlapped.Also, studies of A/B compartments showed cross-scale organization as they split into six subclasses (A1, A2, B1, etc.) 3;8 .Furthermore, the authors of this paper developed a network-community-detection method that considers the average contact frequency between distant DNA segments based on their sequence distance.This method revealed a spectrum of mesoscale 3D communities in Hi-C data 9;10 , ranging from A/B compartments to TADs.All these findings point to chromosomes as having a complex multi-scale structure.
Like Hi-C networks, most complex networks have a blend of overlapping communities at different scales, making community detection challenging.This complex-arXiv:2302.14684v1[cond-mat.stat-mech]28 Feb 2023 ity complicates finding statistically significant communities, as most community detection methods rely on the objective function called modularity 6;7 .In this function, the scale is defined as the value of a "resolution parameter" 11 , and most community detection algorithms select its value rather arbitrarily without a principled guideline (see Ref. 12 for the relationship between the resolution parameter and a parameter from another established community detection framework called the stochastic block model).Recently, some researchers (including one of the authors of this paper) have directly acknowledged this fact and tried extracting informative structural properties based on the ensemble of cross-scale "inconsistently" detected communities [13][14][15] .
This paper utilizes two representative inconsistency metrics 14 , one local and one global, to quantitatively assess the scales that provide the most reliable 3D communities in the Hi-C data.The reliability is based on the multi-scale "landscape" of Hi-C communities.In Sec.II, we present the Hi-C data and the associated weighted networks.We also describe our inconsistency analysis framework and community detection method, and how we assign chromatin states to each Hi-C bin by calculating folds of enrichment ratios.In Sec.III, we present our findings and conclude our paper in Sec.IV.

A. Transforming Hi-C data as weighted network
We use the same Hi-C intra-chromosomal contact map as our previous series of studies 9;10 [human cell line GM12878 (B-lymphoblastoid) 3;16 ].Also, as before 9;10 , we use the MAPQG0 data set at the 100 kilobase-pair (kb) resolution and normalize the interaction map with the Knight-Ruiz (KR) matrix balancing 17 .As a result, we treat each 100 kb chromatin locus as the minimal unit, or "node", and the normalized interaction weights between nodes i and j as weighted edges, using network science terminology 5 .

B. Network community detection and inconsistency
One of the most popular ways to detect network communities 6;7 -densely-connected substructures-is to maximize the objective function called modularity1 Here, A ij denotes the adjacency matrix elements corresponding to the interaction weights between nodes i and j (A ij = 0 indicates no edge), and P ij is the expected edge weight based on a priori information.The most popular choice is when considering only the overall tendency of node-node interaction P ij = k i k j /(2m), where k i represents node i's strength (the sum of its weights) and m is a normalization constant ensuring that −1 ≤ M ≤ 1.Finally, g i is the community index of node i and δ is the Kronecker delta.A key parameter in our study is the resolution parameter γ, which controls the overall community scale 11 .
In principle, maximizing the modularity function with respect to all of the possible community divisions, encoded as {g i } in Eq. ( 1), is a mathematically welldefined deterministic concept.However, due to the computational limitation imposed by the problem, it is prohibitively difficult to find the exact solution, e.g., from the comprehensive enumeration of the network divisions.Therefore, most network community detection algorithms rely on various types of approximations or parameter restrictions.Many algorithms take a stochastic approach to sample the community partitions, just as in standard Monte Carlo 18 .One example is the Louvaintype algorithms 19;20 we use here (detailed in Sec.II.C).
Although stochastic approaches like Louvain have been successful in terms of speed and accuracy in many community detection applications, their stochastic nature may produce multiple results that sometimes include inconsistent elements.Researchers tend to work around this inconsistency 21;22 by choosing the most consistent, or reproducible, network partition.However, one of the authors of this paper has turned this inconsistency into an advantage, using it to probe network structural information 13;14 at both global and local levels (Figs. 1 and  2).In particular, by studying inconsistency measures one may pinpoint scale regimes or specific node collections that are the most statistically reliable (at the global level) or flexible (at the local level).For a detailed theoretical framework, we defer to Ref. 14 .But below, we remind the reader of the essential parts used in this analysis.
One metric we study is partition inconsistency (PaI).It quantifies the global degree of inconsistency among community partitions in the entire network (Fig. 1).PaI is based on a recently developed similarity measure 23 S αβ between community configurations α and β.The PaI value Ω(≥ 1) indicates the effective number of independent configurations.A small (large) Ω value represents more consistent (inconsistent) regimes, respectively.Using PaI, one extracts the most statistically reliable ranges of community scales by focusing on the (local) minima of Ω, in particular, alongside another meaningful evidence of stable communities: the number of communities stays flat at a specific integer value [illustrated in Fig. 1b)].
While PaI describes the network's global inconsistency, we use another metric, membership inconsistency (MeI), to quantify local (individual-node) inconsistencies (Fig. 2).MeI represents the effective number of inde- pendent communities for a specific node across different community configurations.As shown in Fig. 2b), the MeI values properly detect the functionally flexible or "bridge" nodes participating in different modules 2 .In Sec.III, we use PaI and MeI to study global and local community inconsistencies of Hi-C maps and relate to these metrics to other biological data. 2 The MeI measure introduced in Ref. 14 is a more principled and improved measure than the original "companionship inconsistency (CoI)" measure first introduced in Ref. 13 , by considering the possibility of more than two community memberships.

C. GenLouvain method
The stochastic community detection method we utilize throughout our work is version 2.1 of GenLouvain 20 (see https://github.com/GenLouvain/GenLouvain for the latest version).GenLouvain is a variant of the celebrated Louvain algorithm 19 , which is one of the most widely used algorithms and is popular due to its speed and established packages in various programming languages.Starting from single-node communities, the algorithm accepts or rejects trial merging processes based on the modularity change in a greedy fashion.To determine the community stability across network scales, we run GenLouvain several times, at least 100, for each resolution parameter γ and then calculate the global (PaI) and local (MeI) inconsistency metrics.

D. Cross-scale node-membership correlations
We use GenLouvain to produce an ensemble of community partitions from Hi-C data, for fixed scale parameters γ.However, some of these partitions seems correlated.To better understand these correlations, we use a graphical embedding technique designed to illustrate high-dimensional data on a 2D plane.Specifically, we use t-SNE (t-distributed stochastic neighbor embedding) 24 .
t-SNE is a general framework that aggregates data points based on some distance metric.While there are several choices, we use the so-called correlation distance D, which is common for random vectors and defined as where r(u, v) is the correlation between the vectors u and v, conventionally defined as where ū and v are the mean of the elements (so that ū According to Eq. ( 2), D = 0 if they are perfectly correlated (r = 1) and D ≈ 1 if they are uncorrelated (r ≈ 0).In our analysis, we create these vectors from 100 Gen-Louvain runs.Each vector is a Boolean representation of the node-community membership for one node (each element is either 1 or 0) depending on whether the node belongs to a specific community at a particular GenLouvain iteration.For our analysis, we used scikit-learn 25 .To reach the best visualization results, we tuned several parameters : perplexity: 20, early exaggeration: 8,initialization: random, and number of iterations: 1356.
The start and stop regions for these 15 HMM do not match perfectly with the Hi-C bins.To classify every Hi-C bin into one of these HMM states, we calculate the folds of enrichment (FE) relative to a chromosome-wide average according to the following steps.
1. Count the number of peaks k X per bin, where X = S1, . . ., S15.Because some peaks span multiple bins, we only count the peak starts.
2. Calculate the peak frequency's expected value using the hypergeometric test (chromosome-wide sampling without replacement).The expected number of X peaks per bin is calculated as k X = K X × (n/N ), where, n is the number of peaks of any state in a bin, N is the total number of peaks per chromosome, and K X is the total number of peaks for state X.

Calculate the folds of enrichment FE X for each
HMM state X per bin by dividing the observed by the expected peak number, FE X = k X / k X .
We note each Hi-C bin can be enriched in several chromatin states.Based on enrichment, we divide Hi-C bins into five groups (A-D) if FE X > 1:  Apart from these five groups, we assign bins that are not enriched in any state to the category "NA".

A. Local community inconsistency
We illustrated the local inconsistency associated with a single Hi-C map in Fig. 3.This map depicts the number of contacts between all 100 kb DNA-segment pairs (KR normalized) in human chromosome 10.Along the diagonal, we highlight the GenLouvain-derived communities 20  at resolution parameter γ = 0.75.Squares sharing colors have the same community membership.These colors are better illustrated in the stripe above the map, showing how communities appear along the linear DNA sequence.We note that some scattered segments have the same color, which indicates that communities assemble DNA segments in 3D proximity, not only 1D adjacent neighbors.This contrasts the conventional notion of TADs, which comprise contiguous DNA stretches.To separate notations, we denote unbroken units of DNA stretches in a community as domains.
It is essential to realize that the domains and communities in Figs.3a) and 3b) represent a single configuration, or partition, of Hi-C network communities at one specific resolution parameter value γ.Since GenLouvain uses a stochastic maximization algorithm, we expect to find other partitions if running it several times on the same data set, some of which may differ substantially.To quantify this variability, we generated 100 independent network partitions and calculated the local inconsistency measure MeI 14 that quantifies how many different community configurations a single node effectively belongs to.We plot the MeI profile along chromosome 10 in Fig. 3c).This profile shows that about half of the domains do not change community membership (the median value of MeI = 1.02), whereas the rest show significantly more variability (MeI ≈ 4).We also note that the MeI score is relatively uniform within each domain and that sharp MeI transitions occur near domain boundaries.
Based on previous work 14; 22 , we anticipate that the MeI profile changes with the network scale.Therefore, we scanned through a wide range of community scales, extracted 3D communities, and calculated the MeI profile.We show the result from such a sweep in Fig. 4a), where each MeI profile is associated with one γ value.We note that some DNA regions have low MeI scores (γ > 0.6), which indicates that nodes in those regions mostly appear in the same communities for most γ val-ues.We indicate this as colored rectangles below the MeI profiles.But other DNA regions show the opposite behavior.These regions contain nodes that often do not appear in the same communities, which results in high and variable MeI values.Overall, the local node inconsistency grows as γ becomes larger.

B. Local inconsistency and chromatin states
To appreciate the MeI variations from a biological perspective, we analyzed them relative to local chromatin states.As outlined in the Methods (Sec.II.E), we use five states and calculate the folds of enrichment for each node.We denote the chromatin states as promoters (A), enhancers (B), transcribed regions (C), heterochromatin and other repressive states (D), and insulators (E).
Below the MeI profile in Fig. 4, we show boxen plots for three γ values.Each subplot illustrates the distribution of MeI scores associated with each chromatin group (A-E); 'NA' represents nodes that are not enriched in any chromatin type.If following the MeI medians (horizontal lines), we note that groups A-C have consistently higher values than the rest.In panel (c), we explore this observation more thoroughly and plot the median MeI for several γ values.The lines show that MeI grows with γ and that groups A-C are more inconsistent than the chromosome-wide average (denoted 'all').This result suggests that nodes flagged as active chromatin have more variable node-community memberships.

C. Cross-scale global inconsistency
The previous subsection analyzed the cross-scale local inconsistency measure (MeI) for chromosome 10.Here, we extend the inconsistency analysis to all human chromosomes using the global inconsistency PaI instead of MeI (PaI yields one number per γ value instead of a chromosome-wide profile).The PaI score measures the effective number of independent network partitions (see Sec. II).By the mathematical construction of PaI 14 , if there is no special scale of communities, the "null-model" behavior of PaI as a function of γ would be as follows.As γ gets larger from γ = 0, the PaI value start to increase as the average number of communities increases enough to form a certain level of inconsistency (for γ = 0, PaI trivially vanishes because there cannot be any inconsistency for the single community composed of all of the nodes).On the other extreme case of γ → ∞, each individual node tends to form its own singleton community, so again there is no inconsistency, or PaI becomes zero.Therefore, if there is no particular characteristic scale of communities, the PaI curves against γ would be single-peaked ones without any nontrivial behavior such as local minima.In reality, there are characteristic scales of communities, where PaI reach its local minima 14 , which indicate the most meaningful community &KURPRVRPH RIFRPPXQLWLHV FIG. 5 Global inconsistency measured by PaI (the violet circles) for human chromosome 10, along with the average number of communities (the green circles) across a range of γ values with the small (smaller than the symbols themselves for all cases) error bars representing the standard deviation.
scale.As our results show, the Hi-C communities also exhibit such characteristic scales.To better understand this metric, we revisit chromosome 10 before analyzing all human chromosomes.
We plot the PaI values for chromosome 10 in Fig. 5 as violet circles.When γ is small (< 0.5), we note that PaI has a plateau extending over several γ values.Such a plateau is ideal for stable community partitions (see Fig. 1).However, this case is trivial because the community comprises the entire network (PaI = 1, thus one effective community).Next, if γ increases above 0.5, PaI starts to fluctuate, which indicates that partitions become more variable.Notably, the growing trend stops at γ ≈ 0.6, and PaI decays to eventually reach a local minimum at γ ≈ 0.65.This local minimum represents relatively stable communities, hinted by the small effective number of independent partitions at that scale.As we increase γ above 0.65, the community structure becomes less and less stable, along with the rapidly growing number of communities (the green circles).However, asymptotically, the number of independent community ensembles grows slightly less than the number of communities per ensemble.For example, at γ = 0.9, there are 1.75 independent ensembles (effective), each of which is composed of 25 communities.As a final remark, it is essential to realize that the growing trend of PaI with γ does not necessarily imply the lack of intrinsic organizational scales.Instead, it indicates fuzzy scale transitions where we observe a short range of stable communities at the local PaI minimum.
When examining the PaI curve for chromosome 10, we noticed one significant local minimum with a relatively stable community partition.Next, we ask if similar inconsistency patterns appear across all human chromosomes.To this end, we plotted PaI against γ for chromosomes 1-22 and X in Fig. 6.
We found several commonalities.First, similar to chromosome 10, most PaI curves have at least one local min- imum and maximum.Some chromosomes even have two minima (e.g., chromosomes 1, 3, 9, 14, etc.), which indicate multiple stable scales of communities.Also, when γ becomes large enough, the network enters the multicommunity regime.Second, as γ grows, so does PaI.This growth indicates that community structures become increasingly inconsistent.Although it is natural to observe higher inconsistency for larger numbers of communities as there are more possible combinations.As future work, it would be informative to check its scaling behavior across different chromosomes and classify chromosomes based on the functional shapes of PaI and the number of communities.

D. Node membership correlations
When analyzing the PaI and MeI scores across γ, we noted that the Hi-C networks exhibit relatively few independent partitions (e.g., PaI chr10 < 3), and that each node belongs to just a few communities (median MeI < 4).This suggests that the community partitions are correlated.To better understand these correlations, we use a stochastic embedding technique called t-SNE that projects high-dimensional data clusters on a 2D plane (see Sec. II.D).In our case, the data set is the community membership per node over 100 GenLouvain runs.
We show the t-SNE analysis in Fig. 7 for three γ values, where each filled circle represents a network node (again, using chromosome 10).The closer two circles appear in the plot, the more correlated their node-community

Euchromatin
Heterochromatin Eu-and heterochromatin NA memberships are.As we increase γ, we note that node clusters split and that some circles become isolated.We interpret this as the ensemble of network partitions grows with γ and becomes increasingly dissimilar.While the clustering is identical in panels a) and b), we color-coded them differently to highlight specific features.In a), the colors represent the local inconsistency score (MeI).We note that nodes having high MeI tend to separate from nodes with low MeI.We also see that the low MeI nodes have relatively stronger correlations, thereby forming more distinct clusters.Panel a) also indicates that nodes with low MeI have similar node-community memberships.
In panel b), the color-coding illustrates the chromatin type.
To simplify the analysis, we consider two large chromatin groups-Euchromatin and Heterochromatin-instead of the five we used before (Sec.II.E.These two large groups reflect the traditional division into open and closed chromatin that is associated with active transcription and repression, respectively.In terms of our previous definitions, we form the following two groups: Euchromatin: A, B, and C, Heterochromatin: D Note that we disregarded group E ("Insulators") as it is associated with boundaries rather than long chromatin stretches, such as Eu-and Heterochromatin.Also, while most nodes belong to either Eu-or Heterochromatin, some are enriched in both types3 .We call this group "mixed".Finally, there is yet another node group that does not enrich any of the two chromatin types ("NA").
In panel b), we observe that community membership correlations are associated with chromatin type.For example, when γ = 0.75, Euchromatin and Mixed nodes separate from the large cluster and form new sub-groups.The nodes in these subgroups generally have high MeI scores indicating a larger variability in their community memberships.
Overall, panels (a) and (b) show a scale-dependent separation between communities associated with active and inactive DNA regions (red and blue nodes repel each other).This separation resembles the A/B compartmentalization but for small-scale 3D structures.Furthermore, the MeI score suggests that these structures have multiple independent ways to assemble if formed from Euchromatin nodes.This observation hints at higher structural variability of the accessible genome, which may reflect the dynamic nature of gene expression processes.

IV. CONCLUSIONS
There is a growing awareness that Hi-C networks have a complex scale-dependent community structure.While some communities have a hierarchical or nested organization, others form a patchwork of partially overlapping communities.In this regard, Hi-C networks do not represent exceptions.Instead, they belong to the norm: most complex networks show convoluted multi-scale behaviors whenever competing organization principles shape the network structure.These principles force some nodes into ambiguous community memberships, making network partitioning challenging.
To better understand the scales where this might cause problems when clustering Hi-C data, we have analyzed the node-community variability over an ensemble of network partitions and estimated the ensemble's size.We have found that it typically grows as we zoom in to the network.However, this trend has significant breaks where the ensemble size drops at some specific network scales.This drop narrows the distribution of possible network partitions.We hypothesize that these minima represent the most common partitions of the average 3D chromosome organization (over a cell population).
Moreover, we have found nodes that belong to several communities when calculating the node-community membership variability.These ambiguous nodes act as bridges and are associated with specific chromatin types.For example, we have found the highest variability for nodes classified as enriched in active chromatin.This finding contrasts inactive (or repressed) chromatin nodes that typically exhibit a relatively more consistent community organization.One explanation is Euchromatin's somewhat higher physical flexibility when exploring the nuclear 3D proximity in search of other DNA regions to form functional contacts (see a recent review 28 ).An alternative explanation is that fuzzy node-community memberships reflect significant cell-tocell variations.While some physical interactions might be stable in one cell, they may be absent in another.Therefore, as Hi-C maps portray the average contact frequency over many cells, this variability may manifest in ambiguous nodes.
Finally, a recent study investigated the challenges in finding reliable communities in Hi-C data 29 .This work aims to map out the landscape of feasible network partitions in Hi-C networks and found that the width of the landscape is scale-dependent.Our study takes a more node-centric view, where we calculated the local inconsistency of individual nodes and discovered that some nodes have fuzzy node-community memberships.Both studies highlight that finding reliable communities in Hi-C data is challenging, especially on some scales.One root cause is that Hi-C networks are almost entirely connected (with many weak links).Under these circumstances, we expect that Hi-C networks have several community divisions.Divisions that cannot be distinguished without additional data, such as gene expression or epigenetic profiles.This fundamental problem suggests that there is a significant likelihood of disagreement on the ideal network division between any community-finding or dataclustering methods.This challenge has likely contributed to debates on the actual differences between TADs and sub-TADs 30;31 .

FIG. 1
FIG. 1 Illustration of partition inconsistency (PaI).(a) A community ensemble composed of two configurations α and β.The similarity measure 23 quantifies the degree of similarity between the community configurations.Based on this measure, we calculate the PaI value Ω as in Ref. 14 .(b) Cross-scale PaI curve (varying resolution parameter γ) alongside the average number of communities.Shaded areas show ranges of statistically reliable community scales.

FIG. 2
FIG. 2 Illustration of the membership inconsistency (MeI).(a)The co-membership composition for node i in configurations α and β.From this co-membership structure, we calculate the MeI value Ψi as in Ref.14 .(b) Example distribution of MeI for a small network.

FIG. 3
FIG. 3 Membership inconsistency analysis for γ = 0.75.The horizontal axis represents the genomic position along human chromosome 10.(a) Community membership of DNA segments shown as rectangles along the DNA sequence.Rectangles of the same color enclose DNA segments that are members of the same community (a single configuration).(b) Hi-C data is depicted as a heat map of contact frequencies between DNA-segment pairs.Rectangles along the main diagonal represent DNA segments' community membership from panel (a).(c) Nodes' membership inconsistency measured across 100 GenLouvain realizations.The dark-blue plot shows MeI scores of DNA segments in a chromosome (the network's nodes).The background of this panel shows the color-coded community membership from panel (a).

FIG. 4
FIG.4 Membership inconsistency (MeI) across network scales (0.5 ≤ γ ≤ 0.9) on human chromosome 10.(a) MeI profile for each resolution parameter γ.Below the waterfall plot, we draw blue squares that enclose DNA segments with low MeI scores across all scales.(b) Distribution of MeI scores shown for three network scales (γ = 0.75, 0.80, 0.85).We separated the distributions into A,..., and E groups (and 'NA') highlighting the nodes' chromatin state.Each distribution is shown as a boxen plot with a center line marking the median.(c) Median MeI across γ and chromatin groups.

FIG. 6
FIG. 6 Chromosome-wide global community inconsistency measured by PaI.Each panel shows the values of the PaI metric (the violet circles) and the average number of communities (the green circles) across a range of γ values with the small (smaller than the symbols themselves for all cases) error bars representing the standard deviation.

FIG. 7
FIG.7Node-community membership correlations across three resolution parameters (γ = 0.65, 0.75, 0.85) visualized with t-SNE dimension reduction.Nodes are represented as colored circles.Those having correlated community memberships over several GenLouvain realizations tend to cluster.In contrast, nodes that repel each other belong less likely to the same community.In (a), the colors indicate MeI scores (see legend under each plot).In (b) chromatin types (Eu-or Heterochromatin).