Higher-order co-mutation interactions in mitochondrial genomes

Pair-wise co-mutation networks of the mitochondrial genome have already provided ample evidences about the roles of genetic interactions in the manifestation of phenotype under altered environmental conditions. Here, we present a method to construct and analyze higher-order interactions, namely, three-uniform hypergraphs of the mitochondrial genome for different altitude populations to decipher the role of co-mutating variable sites beyond pair-wise interactions. We considered the human mitochondrial DNA residing at different altitudes with respect to Tibet in Asia as a case study. We found that in 50% of the gene triangles, two positions were occupied by coding genes, which suggested that coding genes are dominantly involved in forming the hyperedges. Based on weights of the gene triangles, we identified altitude-specific genes such as, in low-altitude ATP6 and ND genes, in mid-altitude CO1 and ND genes, and in high-altitude ATP6, CO1, CYB and ND genes. This framework of three-uniform hypergraph serves an avenue for future investigation of nuclear genomes in context of phenotypic association and genetic disorders beyond the pair-wise interactions.


Introduction
Networks equip us with a statistical approach to model real-world complex systems based on underlying interaction networks [1]. For instance, in cellular activities, biomolecules interact to perform diverse functions which can be studied through protein-protein interactions [2], gene-transcription factor interactions [3], and metabolic interactions [4] networks. Using the network framework, biologists have analyzed complex sub-cellular interactions to access various structural and evolutionary functionality of cells [5]. Such interaction analyses under the lens of network theory have assisted in deducing biomarkers and designing new experiments [6,7]. With an advent in network theory concepts, many other details of real-world systems, such as edge weights emphasising on relative importance of interactions, directed edges [8] indicating flow of signal through edges [9], multi-layer structure [10,11], and evolution of network properties with time aka temporal networks [12,13] got incorporated while constructing the corresponding model networks.
All these studies relied in describing complex systems through pairs of interacting units (pair-wise interactions), and demonstrated that properties and evolution of real-world complex systems indeed can be modeled as a net of interacting units. Nevertheless, quest for more accurate representations of properties of real-world interactions has continuously been revealing many exciting facets, particularly, as in real-world systems, the interactions occur in much more detailed fashion. Many factors affect the very nature of interactions themselves. One of the most exciting revelations of these pursuits is discovery of higher-order interactions, which emphasize on existence and importance of interactions beyond pair-wise in real-world complex systems. There are shred of evidences which indicate significance of higher-order interactions in governing properties of many real-world complex systems such as ecological [14], biological [15], neuronal [16], and social [17]. In ecological systems, more than two species interact together and affect the ecological dynamics of the ecosystem. In collaboration networks, more than two authors appear in the same article [18]. In epistatic networks, more than two variable sites interact/co-occur for phenotypic benefits of the species, and reactions involve more than two biochemicals to perform a cellular function. The prediction of critical genes or nodes based on graphs assume that two adjacent nodes have similar functional contribution or where we miss the information of other nodes contributing in that module (functional module). The representation of network as a hypergraph is a spontaneous way of overcoming this assumption. Here, we provide a method to construct three-uniform hypergraphs using the variable sites in the human mitochondrial DNA (mtDNA). Based on the three-uniform hypergraphs, we discussed the nature and characteristics of variable sites and genes to form these structures, and their importance in context to the specific population. Note that in a hypergraph all the hyperedges need not to have the same size. However, a hypergraph is referred to as a k-uniform hypergraph if all its hyperedges connect the exactly k nodes. Here, we considered a three-uniform hypergraph where each hyperedge connects three nodes.
Earlier studies on hypergraphs for the brain were based on the temporal co-variation of brain function during learning via cross-link structures [19]. In a similar approach, correlation hypergraphs were formulated based on edges rather than brain regions to analyse topological and functional changes in the brain during neurodevelopment [20]. Higher-order interactions of protein complexes demonstrated the scale-free degree distribution of both the nodes and the hyperedges [21,22]. Additionally, signalling pathways [23] and protein interactions [24,25] have been studied under the umbrella of higher-order interactions.
Furthermore, interactions between genetic variations (epistasis) and the genetic background alter the phenotype landscape of the respective population, thereby leading to various studies of mutations among genes [26,27]. All these investigations were comprised of only pair-wise interactions and hence limit our understanding of genetic evolution [15]. The presence of a third mutation has been displayed to alter the fitness and also modify the way pair-wise interactions occur [28,29]. An earlier study on higher-order interactions identified a few phenotypes that were not explained by pair-wise interactions while searching for missing heritability in the crossing of two yeast strains for 46 different phenotypes [30]. Another study involving 500 genotypes of yeast tRNA and over 45 000 interactions revealed abundance of higher-order interactions which were responsible for dynamics across different genotypes [31]. In another study, higher-order interactions of five different genes were found to affect the complex phenotype in Saccharomyces cerevisiae [32]. This and a few other studies investigated the presence and absence of mutations/variations and their effect on the fitness or morphology of the organism. However, [33] quantified the digenic and transgenic interactions with respect to the fitness of the population by introducing a transgenic score which combined double and triple mutant fitness extracted from the experimental results. Further, the word co-occurrence hypergraphs have been studied to explore the conceptual landscape of mathematics [34].
All these investigations provided sufficient evidence of the existence of higher-order interactions in complex systems and, moreover, highlighted the importance of such interactions and corresponding hypergraphs in steering the evolution and functionality of underlying complex systems. Here, we studied the co-mutation hypergraphs in the human mitochondrial genomes for three different altitude populations, low-altitude (0-500 m), middle-altitude (2001-2500 m), and high-altitude (>4000 m). As the altitude increases, the percent of oxygen in the air decreases. This decrease in the oxygen level is more or less 1% for each rise in 500 m altitude [35]. We took three populations, two from extremes and one from the middle, to have a comprehensive analysis of mtDNA variations in the framework of hypergraphs. It has been reported earlier that high-altitude leads to a change in mtDNA composition, which might help individuals to adapt better in low-oxygen environments [36]. In this work, we constructed three-uniform hypergraph based on co-mutation frequency defined for three variable sites simultaneously for all the possible combinations of the available variable sites. Thereafter, we used a two-step threshold selection method to construct the unweighted hypergraph of variable sites and a weighted hypergraph of genes.
In the following texts, the terms triangles and hyperedges were used interchangeably as we considered three-uniform hypergraphs only, where each hyperedge contained exactly the three nodes. By going one step forward from traditional pair-wise interactions to triangle interactions, we specifically calculated the co-mutation frequency of three variable sites. A two-step threshold selection was used to filter significant hyperedges, one by using a statistical score (P ijk ) and the second by using the size of the largest connected component (LCC) (N LCC ). Thereafter, to identify the significant genes, we calculated the weight of each triangle (hyperedge), followed by the mapping of genes yielding gene hypergraph. Then we analyzed the network properties of those genes which participated in higher-order interactions and survived after the application of thresholds for each altitude group. Note that this study considered only three-uniform hypergraphs. One could consider higher-order uniform hypergraphs or mixed hypergraphs, which would, accordingly, require more sample sizes for statistically reliable results.

Methods and material
mtDNA sequences were collected from [37] for three altitude groups, lowest (0-500 m), middle (2001-2500 m) and highest (4001 m). The ambiguous nucleotides were replaced by the default letter 'N' for the calculations. All the sequences were aligned together using Clustal Omega [38] and then mapped to rCRS [39] for gene annotation.

Construction of higher-order Gene-Gene Interaction (GGI) networks
Here, we construed two types of three-uniform hypergraphs; the co-mutation hypergraphs with variable sites as nodes and the weighted GGI hypergraphs with genes as nodes (figure 2) for each set of the population.

Step 1 (Co-mutation Network)
The mtDNA is 16 569 nucleotides long, circular DNA, and harbours 13 protein coding genes and 24 non-coding genes. Among these 16 569 nucleotides, there are certain positions which have more than one allele among the samples. For example, at the 5th position among all the samples, 95% contains A and 5% contains G, so this position is having A (major) and G (minor) as alleles. Any set of three variable sites can form a triangle independent of their position in the genome. Any position having more than one allele in the samples was considered a variable site having the information of major and minor alleles. The variable sites were extracted from the aligned sequences for each region separately. For genomic equality, ambiguous nucleotides such as X, M, Y, etc were replaced with 'N' for all the sequences, and tri-allelic sites were not considered (supplementary material).

Step 2
To construct a network for each altitude group, nodes were represented by the position of variable sites, and the edges were represented by co-mutation frequency among three nodes (C ijk ) defined as (figure 2(a)) where (m ijk ) denotes number of times the minor alleles occur together at ith, jth and kth positions, m i , m j and m k indicate total number of times the minor allele occurs at ith, jth and kth positions, respectively.

Step 3 (p-value calculation)
To check the significance of any co-mutation, the threshold has been calculated as where (C r ijk ) indicates the co-mutation frequency calculated after permuting the alleles at the ith, jth and kth positions randomly. 10 000 random simulations were generated and P ijk was set to ⩽ 0.05 (standard p-value) to consider a co-mutation triangle significance. This step provided us with the total triangles (figure 2(b)) to considered for further step.

Step 4 (Co-mutation threshold selection)
To select a co-mutation threshold for the total triangles (figure 2(c)), we analyzed the size of the LCC as a function of C ijk . A threshold was selected based on the change in the size of LCC (figure 1). This threshold is also referred as network efficiency score.

Step 5 (Higher-order and pair-wise interactions selection)
To define a triangle as filled or unfilled [40], we mapped the variable sites in each triangle (i, j and k) to their respective genes (figure 2(d)) and if all the genes are different, we considered the corresponding gene triangle as a filled triangle (true triangle) or hyperedge of order 3 of variable sites and genes (figures 2(e) and (f)), otherwise it was considered as an unfilled triangle. Afterwards, the weight of filled gene triangles is calculated by summing over all the occurrence of the given gene triangle. As we know that the more than one variable sites might belong to the same gene, the different variable site triangles can give rise to same gene triangle, thus giving gene triangles a weight. This is later analysed for each altitude population.  The schematic representation of deducing higher-order interactions from the mutation information of variables sites (a), and construction of co-mutation triangles (b). A threshold is selected based on network efficiency score (c). Co-mutation triangles were then mapped to extract gene triangles and the filled gene triangles were identified such that no gene is identical in a triangle (d). Finally, three-uniform hypergraphs of variable sites (Hv) (e) and three-uniform hypergraphs of gene (Hg) are obtained (f). and M are size of V and e H , respectively. Note that, each hyperedge e α , {α taking value = 1, 2, . . . , M, corresponds to E d α which is a subset of nodes i.e. E d α ⊂ V. Since, we considered three-uniform hypergraph, |E 3 α | = 3 for all α. In a hypergraph, the hyperdegree of a node i represents the number of hyperedges formed by the node i. Next, two types of hypergraphs are constructed in this paper as described in the method section, namely, hyperegraph of the variable sites (H v ), and hypergraph of genes (H g ). We defined the hyperdegree accordingly, let d(v) n,2 and d(v) n,3 be the degree of the nth variable site in the two-uniform and three-uniform hypergraphs, respectively. Similarly, let d(g) n,2 and d(g) n,3 be the degrees of the nth gene in two-uniform and three-uniform hypergraphs, respectively.

Relative weight of genes
Relative weight (R n ) of a gene n was calculated based on the number of triangles that gene is participating (which is equal to hyperdegree d(g) n,3 of nth gene) as where summation runs for all the triangles in which a given gene n is participating, W(g) i is the weight of ith gene triangle in which gene n is participating. Note that D-loop is a non-coding region in the mtDNA which controls the replication and transcription of the whole mtDNA. This region is observed to be highly mutated among all the other genes of mtDNA [41]. Hence, we have excluded theD-loop region to prevent any bias in the calculation of the relative weight.

Results
The following sections analyze the gene hyperedges in terms of their statistical parameters, and associated genetic information based on their function in energy metabolism. First, we discuss the characteristics of  true triangles based on their constituent nodes and hyperedge degrees. Then, we try to incorporate the information from network parameters with the specific mitochondrial haplogroups. In general, the mitochondrial haplogoups are a group of individuals who share similar changes in their mtDNA. Further, we analyse genes involved in each gene triangle to identify the specific genes for each altitude group.

Characteristics of hyperedges
Initially, after applying the first step filtration based on P ijk , the resultant co-mutation hypergraphs were quite dense, therefore a co-mutation threshold (C th ) also called as network efficiency score was further applied to achieve a sparse network. An increase in the threshold value would filter out many connections whose weights were smaller than the threshold value. For a threshold value of zero, all the hyperedges would persist. Whereas, for a threshold value of 1, only those hyperedge would persist which have weight exactly equal to 1. To identify a value of suitable C th , we analyzed the N LCC as a function of C ijk such that all the nodes remain the part of the connected network while removing the edges to achieve a sparse network (figure 3). We witnessed that there existed a sharp fall in the size of the largest connected cluster as we increased the value of C th . A value just before this value of C th at which the network becomes segmented was taken as the threshold. This threshold selection method has been defined as network efficiency score [42]. We started with ∼450 nodes in each altitude group with a potential of 0.1 million combinations and ended up with a very small number of total triangles and true triangles (table 1). For a given hyperedge, the C ijk was subjected to the minimal co-mutation frequency value for all three pairs of the triangle. Hence, C ijk is not a linear combination of pair-wise co-mutation frequencies for all the pairs of i, j, k variable sites. After getting the sparse network, we defined the true triangles, such as no variable sites in any gene triangle belong to the same gene for all the hyperedges as at the genetic level, such as a pair of variable sites where both the nodes belong to the same gene would give rise to a self-loop and also transform the triangle into a weighted edge. Hence such connections were also removed. In low-altitude, mid-altitude, and high-altitude ∼10% of all the triangles were found to be true triangles (table 1).These true triangles are then further analysed based on the information available for the variable sites and the genes.

Variable sites and haplogroups
From this small number of triangles, we checked for the variable sites (nodes) contributing to the formation of triangles in terms of their degree (d(v) n,3 ) (table 2). The mtDNA variations are categorized into certain haplogroups based on the human migration patterns and evolution. The word haplogroup is derived from haplotype, which refers to a set of DNA variations inherited through the single chromosome, and since the mtDNA is inherited only through the mother, is considered as haplotype. We mapped the variable sites to the haplogroups mentioned in the MitoMap database [43] and we found that in the low-altitude group, node 1598 represented M-haplogroup (M17, M28, M29, M30, M42, M52, B5 and N1). In the mid-altitude group

Top most values of d(v)n,3 (with corresponding genes in bracket)
Low . It is noted that in the low altitude, only one high-degree node was represented as a haplogroup marker. The haplogroup C is commonly found in Northeast Asia and America [44]. The haplogroup K is prominently found in South Asia and Europe [45]. In contrast, most high-degree nodes in mid and high altitudes represented a few haplogroups. It is to note that despite having very low variable sites in the low-altitude, tRNA genes, Trp, Tyr, Gly, and Ser are contributing to the highest node degree. When the difference (∆d(v)) of pair-wise (d(v) n,2 ) and triangle degrees (d(v) n, 3 ) was plotted, we found that certain variable sites were having large differences compared to most of the other variable sites. This suggests that a few variable sites show a higher tendency to form hyperedges (figure 4 upper panel).

Overlapping edges and associated genes
Next, we analyzed the statistics of the edge degree of each hypergraph. For three-uniform hypergraphs, the edge degree was defined as the number of triangles a given edge is part of. After calculating the d(v) n,3 , we mapped the edges (both nodes) to genes, yielding a gene pair for each edge. Since more than one edge can belong to the same gene pair, we get the gene-pair weight by adding the edge degree for all the edges of a given gene pair. For instance, edges (1,2) and (3,4) are part of two triangles each. Therefore, the edge degree of both these edges will be equal to 2. Next, mapping these edges to genes, both edges belong to the same gene-pair G1-G2. Hence, the gene-pair weight will be equal to the sum of both the edge degrees, i.e. equal to four (supplementary material). This provided us with the information of overlapping edges between the triangles. In all the three altitude groups, D-loop was commonly found to belong to hyperedges with the highest weights (supplementary material). Particularly, in low and mid-altitude, the D-loop formed the highest weight hyperedge with ND5 gene and in high-altitude with CYB gene. Apart from the D-loop, in low-altitude ND5 formed highest weight hyperedge each with ND1, ND2, in mid-altitude ND5 formed highest weight hyperedge each with CO1, ND1, ND4, and in high-altitude CYB formed highest weight hyperedge each with ATP6, ND2, ND5. In summary, in low-altitude ND gene yielded overlapping hyperedges. In high-altitude CYB, ATP6 and ND genes yield overlapping hyperedges. The CYB and ATP6 genes have already been established to play a significant role in high-altitude adaptation [46].

Common and exclusive triangles
There existed few variable sites which were common and few others are exclusive among all the altitudes. We investigated how these variable sites form triangles by identifying the presence of shared and exclusive true triangles. Note that a variable site common to two altitudes could be part of two different triangles. Hence common variable sites are not necessarily part of common triangles for the corresponding altitudes. We found 181 common triangles among all the altitude groups. In addition, there are 123 common triangles between low-and mid-latitudes, 169 between low-and high-altitudes, and 612 between mid-and high-altitudes ( figure 5). A large proportion of triangles were found to be exclusive to each altitude group. We were interested in identifying the genes forming triangles in these common and exclusive triangles for each altitude. One simple method to access such genes is to map these common triangles to corresponding genes to get weighted gene triangles. The triangles with the highest weight had {CYB and ND3} genes present ubiquitously among common triangles of all the altitudes. Among common triangles of low-and mid-latitudes, {ND2 and ND4} genes, and among common triangles of mid-and high-altitudes, {CO1 and ND5} genes were present ubiquitously in gene triangles with the highest weights. To extend this in terms of genomics, the interaction between two particular genes was significant in forming triangles with a third different gene. This provided evidence and strengthened the presence of particular pair-wise interactions among mitochondrial genes to form the gene hyperedges.

Codon positions in hyperedges
A variable site possesses information about codon position (CP) in the coding genes. The relation between co-evolving variable sites and CPs has been described previously in the human population using two, and three-order motifs [42]. For a coding gene, the codon positions were set based on nucleotide position in codons 1, 2, and 3. For non-coding genes, all the nucleotide positions were set to 0. It was detected that CP 1 and 2 are highly conserved in the formation of the codon triangles, whereas CP 3 is versatile in the formation of these hyperedges ( figure 6). The CP 3 predominates the formation of hyperedges with other coding and non-coding positions. The third codon position is considered to be weakly responsible for amino acid selection during protein synthesis, therefore, supporting the Wooble hypothesis it is not subjected to evolutionary constraint [47]. The codon-conservation has been observed in the regions of conopeptides revealing their accelerated evolution [48]. However, the hyperedges with all the associated sites being CP 3 are relatively less abundant than other hyperedges with one or two CPs occupied by coding and non-coding positions. This suggests that the third CP is subject to conformational stability provided by other CPs. Among all, CP 2 is the most conserved position, which supports the functional stability of amino acid selection during protein synthesis. Similarly, the hyperedges with all CP being 0 were also found to be relatively less. Since these do not code for any proteins, the associated hyperedges represent the intra-D-loop and intra-RNA genes. This suggests that non-coding regions favour the formation of higher-order hyperedges least.

Gene triangles
After identifying the true triangles, we calculated the actual number of true gene triangles by summing over their occurrence in the network, thereby calculating their weights. If we exclude the D-loop out of all the possible true gene triangles, only ∼10% were identified as true gene triangles (table 1) in all the altitudes. The D-loop is known to be a highly mutating region of the mitochondrial genome. Owing to this, we excluded D-loop to analyze the triangles of coding and non-coding genes explicitly. It is evident from the figure 7 that very few gene triangles have relatively higher weights as compared to most of the other triangles. This indicated that the mitochondrial genes prefer the formation of particular triangles over others due to the existence of conserved co-mutation patterns formed by variable sites. After selecting a specific C ijk threshold value, all the variables sites of each surviving triangle were mapped to the corresponding genes, and the contribution of each gene was calculated as a relative weight for each altitude group (figure 8). The relative weight of each gene was defined based on the number and weight of the hyperedges for each gene. Note that for this calculation D-loop variables sites were not considered. The D-loop is a highly mutating region of mtDNA and hence occurred in all the groups equally. In the high-altitude group, CYB gene showed high relative weight among coding genes and tRNA-Gln and tRNA-Met among non-coding genes (figure 8). Cytochrome b (CYB) protein is a subunit of mitochondrial inner membrane protein complex cytochrome bc1, this complex reduces cytochrome c by transferring electrons from ubiquinone in the mitochondrial electron transport chain. There are investigations which suggests that the DNA sequence alterations in CYB might be involved in thermal adaptations [49]. In low and mid altitudes, none of the coding genes showed distinctly high relative weight; however, in the mid-altitude group, tRNA-Ile, tRNA-Arg and tRNA-Leu, and in the low-altitude group, tRNA-Phe, tRNA-Ser and tRNA-Tyr showed distinct relative weights among non-coding genes ( figure 8). It was to note that in the low altitude, tRNA-Phe depicted distinctly high weight among non-coding genes; however, it was not found to be a high degree node.

Gene hyperedges categories
Furthermore, we classified gene hyperedges into four categories based on the information of genes. These categories are all-coding (with no non-coding gene), 2-coding (with two coding genes), 1-coding (with one coding gene) and 0-coding (with no coding genes). We found that ∼50% of the gene triangles were from 2-coding category, and ∼1% from all-coding category in all the altitude groups. All-coding and 1-coding were present equally around 25%. The existence of a relatively high percentage of 2-coding type in all the altitudes suggested that the presence of one non-coding gene favoured the formation of higher-order interactions in the mitochondrial genome in general independent of environmental conditions. However, the gene triangles with high weights were different for different altitudes. The non-coding gene in the 2-coding category with high weight was found to be 12SrRNA in low altitude and 16SrRNA in both middle and high altitudes. Primarily, we were interested in identifying specific gene triangles based on the hypergraphs for each altitude group. To achieve this, we inspected for gene triangles having high weights (table 3) From the table, it was clear that in low-altitude population ). From the table, it was clear that in low-altitude population ATP6 and ND genes, in mid-altitude population CO1 and ND genes, and in high altitude population ATP6, CO1, CYB, and ND genes contributed significantly in the formation of higher-order interactions. Earlier studies of perfectly co-occurring pair-wise interactions had revealed a similar set of genes with some additional genes for these different altitude groups. However, the variable sites forming gene triangles differed from those reported in [37]. This suggested that the formation of hyperedges was facilitated by a different set of variable sites, and might be capturing the genetic interactions according to their role in environmental adaptability. An important observation was that in each of the gene triangles, one of the three edges depicted considerable weight (discussed previously as edge weight), which was again observed when we inspected for the variable sites contributing to gene triangles with high weights. In the low-altitude group, ND1, ND5 genes or ATP6, ND5 genes corresponded to variable sites with the highest contribution, in mid-altitude, CO1, ND5 genes, and in high-altitude, CO1, ND1 or ATP6, CYB or ND5, CYB or ATP6, CO1 or ATP6, CYB genes in one or the other gene triangles corresponded to variable sites with the highest contribution. It was observed in the analysis of uniform hypergraphs for each altitude, specific genes were forming higher-weight triangles which might assist in further analysis at the phenotypic and functional level.

Conclusion
We investigated interaction hypergraphs of variable sites of mtDNA constructed based on the co-mutation frequency of three variable sites. Such hypergraphs were first filtered through the statistical score and second through gene-based information. We superimposed the variable site onto genes to finally realize the three-uniform gene hypergraphs. Using the p-value test, ∼10% higher-order interactions were found to be true hyperedges. These hyperedges belonged to distinct haplogroups in different altitude populations. At the low-altitude, they corresponded to M-haplogroup, in the mid-altitude, C-haplogroup, and in the high-altitude, the K-haplogroup. Moreover, codon-based hyperedges provided evidence about the codon bias and conservation of codon usage throughout the mitochondrial genome for all the altitude groups. Based on gene hyperedges, we found that in the low-altitude group, ATP6 and ND genes, in middle altitude group, CO1 and ND5 genes, and in high altitude group, CYB and ND5 genes were predominantly forming hyperedges. Here we have only considered three-uniform hypergraphs, it would be interesting to investigate other higher-order interactions, say four-, five-uniform hyper to get further insights and to identify the hyperedges-based communities in the mitochondrial genome.

Data availability statement
All data that support the findings of this study are included within the article (and any supplementary files). https://www.ncbi.nlm.nih.gov/nuccore/.