Filtering higher-order datasets

Abstract Many complex systems often contain interactions between more than two nodes, known as higher-order interactions, which can change the structure of these systems in significant ways. Researchers often assume that all interactions paint a consistent picture of a higher-order dataset’s structure. In contrast, the connection patterns of individuals or entities in empirical systems are often stratified by interaction size. Ignoring this fact can aggregate connection patterns that exist only at certain scales of interaction. To isolate these scale-dependent patterns, we present an approach for analyzing higher-order datasets by filtering interactions by their size. We apply this framework to several empirical datasets from three domains to demonstrate that data practitioners can gain valuable information from this approach.


INTRODUCTION
Complex systems science provides a powerful framework for analyzing empirical systems because it not only accounts for individual entities but the interactions between them as well.These interactions shape the structure of complex systems at many scales [1].In these systems, one can quantify this structure by, for example, measuring the propensity of nodes with similar properties to connect with one another [2,3], finding influential nodes [4,5], partitioning a system into different communities [3,6], and measuring the heterogeneity of connection [7].Such structural properties can not only efficiently characterize a dataset [8], but can also inform its dynamical behavior such as contagion spread [9], synchronization [10], and many other phenomena.
Pairwise networks, or graphs, represent complex systems as a collection of interactions involving only two entities.Common measures of pairwise networks include the degree and categorical assortativity [2,11], community structure [12], clustering coefficient [13], and centrality [4].Many empirical systems, however, contain not only pairwise interactions but interactions between more than two nodes, known as higher-order interactions.Higher-order networks, also known as hypergraphs, are the natural extension of pairwise networks [14] and can more accurately model the structure of higher-order empirical datasets [15].Researchers have extended many notions of pairwise network structure to higher-order networks.These measures include degree assortativity [15,16], categorical assortativity [17], modularity [18], community structure [19], centrality [20,21], and degree heterogeneity [22], among others.Many of these higher-order measures extend metrics on pairwise networks through selection rules [15], pairwise projections [16,23], or analyzing datasets with interactions all of the same size [20].
Our central premise is that connection patterns are stratified by the interaction size.For this reason, subsets of the original higher-order dataset, restricted by their size, offer more granular insights into the network structure.These size-restricted subsets, called higher-order filterings, are an effective tool for analyzing complex systems with higher-order interactions.Several studies assume that interactions of a certain size may be used to predict the interaction patterns of a different size [24,25] or that the dataset as a whole offers consistent information on the community structure [19], assortativity [15], or other metrics.We assume this is not the case and observe that removing this assumption increases the information encoded in higher-order datasets.A similar idea has been explored for pairwise networks in Ref. [26].Recent work examining subsets of hypergraphs has examined degree-degree mixing [16,27] and degree centrality [28].In Ref. [29], the authors count the number of motifs of different sizes to describe the structure of datasets.Furthermore, our approach presents a compelling framework to analyze the introversion/extroversion of individuals in a higher-order interaction network.This question has been studied for pairwise networks in Ref. [11], where the author associated introversion and extroversion with a node's degree and used degree assortativity to quantify how gregarious people interact.Higher-order interaction networks allow us to answer this question in a complementary way.Taking increasing interaction size as a proxy for increasing extroversion, we may filter the network into two parts: a part containing small-sized interactions and another part containing larger-sized ones.We then interpret individuals exhibiting large centralities among the smaller-sized interactions as strongly introverted and those with large cen-tralities among the larger-sized interactions as strongly extroverted -disregarding nodes that are very central in both regimes.By changing the threshold for "small"-and "large"-scale interactions, we glean nuanced insight into the introversion/extroversion properties of the network.In fact, our filtering framework allows us to examine a whole range of interaction types, and we could, for example, introduce a range of "intermediate" interaction sizes where central nodes are neither strong introverts nor strong extroverts.
In this work, we begin by formalizing our filtering approach in Section II.Next, in Section III, we demonstrate its utility through a case study on an email dataset, where our framework allows us to examine unique associations, assortativity, centrality, and community structure across different filtering parameters.In the appendices, we develop further intuition for the filtering approach through analysis of synthetic datasets and then apply the framework to glean insights from real-life datasets across email, biology, and proximity domains.Our approach allows us to identify and study previously-obscured structure in the data.

II. THE FILTERING APPROACH
Disaggregating interaction networks based on metadata is a common approach for pairwise networks, resulting in multiplex and multilayer networks.Common examples include separating transportation networks into layers by modality or company [30,31], social media platform [32], or potential for transmission [33].The individual uniplex layers in these examples often have no intrinsic ordering and contain relatively few layers.In contrast to general multilayer networks, multislice networks are composed of ordered layers [34,35] such as variations across time [34,36,37], community structure of the same network at different scales [34], or network backbones with varying levels of sparsity [38].
From a higher-order perspective, several studies use subsets of higher-order data in their analysis.For example, Refs.[39,40] quantify significant interactions by measuring the over-expression of a given interaction with respect to the prediction of a null model, Ref. [41] examines the effect that removing interactions above a certain size has on disease dynamics, and Ref. [42] measures how higher-order structure changes with respect to the minimum interaction overlap size, s.
In this paper, we define filtering as an extension of the multislice approach for disaggregating network data.We present a general framework for more flexible analysis of higher-order datasets to better understand how structure is related to interaction size.

A. Empirical examples
We present examples from social science, academia, and biology that illustrate how the scale of interactions can impact the underlying analysis.
a. Social networks The structure of social networks can be largely context-dependent.In multiplex networks, each uniplex layer formed by a different type of social tie may have significantly different structure [32] due to the various mechanisms of link formation [33].Studies have also shown that the extroversion of individuals affects their social network [44,45].In Ref. [45], the authors show that "extroverts tend to have more complete triads and less incomplete or empty ones, than introverts."There is also evidence of a "rich club" phenomenon [46] where high-degree nodes connect with one another more often than would be expected at random.These pairwise network phenomena point to the possibility that the structure of social networks may change when examining interactions of different sizes.

b. Email
The etiquette of email communication often dictates whether an individual email will suffice or if a group email is appropriate and, if so, the number of people who should receive the message.In Ref. [47], the authors report that "group communication is so prevalent that our analysis of the Google Mail email network shows that over 10% of emails are sent to more than one recipient, and over 4% of emails are sent to 5 or more recipients."These patterns are evident not only in the number of messages sent for different numbers of recipients, but also in the structure of the pairwise projection compared to the hypergraph as a whole [17].
c. Co-authorship networks Co-authorship networks are inherently higher-order; for example, a paper with three authors is not equivalent to three two-author papers.In general, the number of co-authors on a paper is a field-dependent distribution [48].Prior to the advent of higher-order network analysis, studies treated multi-author papers as cliques [49].More recent studies, however, have indicated that including these higherorder interactions can impact a co-authorship network's structure [15].The authors of Ref. [50] show that coauthorship networks largely obey triadic closure and that different scientific fields have different distributions of "maximal" collaborations.d.Protein networks Proteins interact with one another to form complex molecules that are essential to cell structure and function.The specific combination of these proteins can create very different resulting molecules, and combinations of more than two proteins can form a higher-order interaction network with vastly different topology than the equivalent pairwise network [51,52], suggesting that the structure of protein networks may vary significantly across interaction sizes.

B. Terminology
Consider a hypergraph H = (V, E) where V is the set of nodes and E is the set of hyperedges, which are arbitrary nonempty subsets of the node set.We call an interaction between k entities a k-hyperedge, and a hypergraph which solely consists of k-hyperedges is called k-uniform.A simplicial complex is a special case of a hypergraph where the existence of a k-hyperedge -in this instance called a (k − 1)-simplex -implies the existence of every possible subset of that hyperedge.
We now define the filtering of a hypergraph H = (V, E) with respect to the filters f : V → {0, 1} and g : E → {0, 1} to be the hypergraph H f,g = (V f , E g ) where is the set of filtered nodes with respect to f and is the set of filtered hyperedges with respect to g.For simplicity, we denote the set of nodes induced by the filtered edges as V g = {v | v ∈ e ∈ E g }.We note that the choice of f must yield a node set V f such that V g ⊆ V f .Three common choices for the filtered set of nodes are (1) f (v) = 1 and each filtering contains all the nodes, (2) is the giant component of the filtered edges E f , and I is the indicator function, or (3), f (v) = I Vg (v).In this paper, we solely consider the first choice to more easily compare nodal measures across different filterings.For ease of notation, we set H g ≡ H 1,g .We note that g may be chosen to reflect a variety of network properties, as well as metadata associated with hyperedges.Both pairwise networks and hypergraphs may have metadata, however, so the benefit of using higher-order datasets is the representation of multiple interaction sizes.In addition, unlike categorical attributes, filtering edges by their size allows us to define an ordering on the hypergraph filterings.These reasons motivate us to solely consider filters based on hyperedge size.
For a given non-negative integer k and a given comparison operator * , a size-dependent filtering H ( * ,k) is a filtering H g where g is defined by In the following, we define k as the filtering parameter.More generally, one may define a size-dependent filtering based on a set K of hyperedge cardinalities, but in this work, we consider a single value k.Below, we present examples of common comparison operators.
• Uniform filtering, H (=,k) .This type of filtering may be used to isolate the structure associated with a particular hyperedge size.
• GEQ filtering, H (≥,k) .This filtering can be used to see the effect of excluding pairwise interactions or hyperedges with smaller cardinality.One can call this the "higher-order" filtering for its ability to remove low-order interactions.
• LEQ filtering, H (≤,k) .This filtering can be used to exclude higher-order interactions.As before, one can also call this the "lower-order filtering."This has been used to construct some of the datasets in Ref. [53] (the author constructs hypergraphs with hyperedges of size ≤ 25) and other papers that observe the effect of higher-order interactions by only including 2-and 3-hyperedges [54][55][56].
• Exclusion filtering, H (̸ =,m) .This filtering can be used to exclude interactions of a particular size.This can be helpful when considering datasets where a particular interaction size significantly alters the structure of a hypergraph -whether through a large number of edges or a significantly anomalous structure -and one desires to exclude that interaction size.
In principle, one can combine these operations to select or exclude more than one interaction size, but the filters presented in this paper form the basis for all other sizedependent filters.
We can generate a set of filterings by considering different values of k or filtering operations.A set of filterings can either be disjoint or overlapping.A set of filterings, A filtering set is overlapping when it is not disjoint.Examples of overlapping filtering sets are the complete sets of LEQ, GEQ, and exclusion filterings.Any hypergraph H can be expressed as the union of all its filteringsnamely, For the uniform filtering set, the edge-wise union is disjoint.When filtering simplicial complexes, only the LEQ filtering preserves the simplicial structure of the dataset, but one could, in principle, convert the simplicial complex to its equivalent hypergraph to allow the use of other filtering types.

C. Disconnected filterings
It is common that despite the original dataset being fully connected, the filtered dataset is not, and we can deal with this in several ways.
First, we note that many higher-order measures are not dependent on the dataset being fully connected.For example, degree assortativity and modularity are aggregated over nodes and edges without any dependence on the connectedness of the dataset.Similarly, the degree of a node simply measures the number of hyperedges of which that node is a member.This is not always the case, however.Several measures of centrality [20,21] require the dataset to be connected.These eigenvector-based methods require a single connected component, although one can look at each connected component individually and determine the relative centrality of each node in that component.Alternatively, one can consider a regularized version of that metric, just as PageRank [5] is similar to eigenvector centrality and does not require the network to be connected.Likewise, in the case of community detection, when the dataset is not fully connected, there are several well-established ways to label isolated nodes and nodes not contained within the giant component [6].

III.
A CASE STUDY In this section, we filter a single higher-order dataset to demonstrate how the structure of this dataset changes with interaction size.We present results for additional empirical datasets in Appendix C, where we also present intuitive filtering results on synthetic datasets.Those results demonstrate the utility of our approach across various domains.

A. Measures of higher-order structure
Here, we briefly describe the structural measures we utilize in this study and refer the reader to Appendix A for additional details.Global measures.We consider the effective information and the degree assortativity as global measures of higher-order structure.
• Effective information (EI) is defined on the clique projection of the hypergraph as where W out i is the ℓ 1 -normalized vector of outdegrees for node i and H is the Shannon entropy.EI measures the strength of unique associations in a network.
• Degree assortativity measures the tendency of nodes with similar degrees to connect with one another more often than would be expected at random.We utilized four different measures of degree assortativity; the first three -top-bottom, top-2, and uniform -defined as in Ref. [15], and the dynamical assortativity [16] defined as where ⟨k r ⟩ is the r-moment of the degree and ⟨kk 1 ⟩ E is the expected pairwise product of degrees over hyperedges.
Local measures.We consider the betweenness centrality and the community labels of nodes as local measures of higher-order structure.• The node-based analogue of betweenness centrality for hypergraphs defined in Ref. [57] is where σ uv is the number of shortest paths from node u to node v and σ uv (n) is the number of these shortest paths that pass through node n.Central nodes via this measure serve as "bridge" nodes between different parts of a network.
• Community structure describes the mesoscale structure of complex systems by assigning the same labels to nodes that are densely connected with one another.To infer the community labels of nodes, we perform spectral clustering on the normalized Laplacian of the hypergraph, as proposed in Ref. [58].We use a Hungarian matching algorithm to heuristically match two different sets of community labels.

B. Filtering the email-enron dataset
Here, we analyze the email-enron dataset, a higherorder dataset generated from emails sent from a core set of employees at Enron [24,59,60], where nodes represent email addresses and edges represent email messages.Before analyzing this dataset, we removed isolated nodes, multi-edges, and singleton edges.We comment that our results are qualitatively similar to those computed when including these artifacts.The resulting dataset has 143 nodes and 1457 edges and has heterogeneous degree and edge size distributions.
We analyze this dataset using the four types of filtering that we introduced: the uniform, LEQ, GEQ, and exclusion filterings.This provides a broad overview of the different types of filtering that can be used to examine a dataset.The structural metrics that we consider are by no means exhaustive but provide an instructive example of how structure can change for different interaction sizes.To illustrate our filtering approach, we present the effective information, degree assortativity, and inferred community structure in this section.The interaction size affects not only the local structure of a higher-order network but the global structure as well.We start by analyzing the global structure presented in Figure 2.
As seen in Figure 2a, the trends in normalized effective information show substantial differences across the uniform, GEQ, LEQ, and exclusion filterings.While the normalized effective information largely decreases with k for the uniform, GEQ, and LEQ filterings, it largely increases for the exclusion filtering.However, these trends are not monotonic; for example, the normalized effective information for the GEQ filtering increases from k = 14 through k = 16, while that for the exclusion filtering decreases from k = 16 to k = 17.Note that the effective information is not defined for k = 14, 17, 18 for the uniform filtering since no hyperedges of those sizes appear in the dataset.
In Figure 2b, we see that for both the uniform and GEQ filtering, the degree assortativity increases with interaction size before declining again.For small values of k, the assortativity fluctuates less for the GEQ filtering when compared with the uniform filtering because there  are far more edges over which to average.In addition, the exclusion filtering illustrates that the pairwise interactions most significantly affect the degree assortativity.Lastly, the LEQ filtering demonstrates that we can capture most of the assortative structure with interactions of size ten and smaller.
In Figure 3a, we see that the betweenness centrality of all nodes is sensitive to the filtering parameter.Comparing the LEQ and GEQ filterings, we see that the centralities are much more stable when excluding higherorder interactions compared with excluding lower-order interactions.Furthermore, these centrality results provide fertile ground for data insights.For example, Bill Williams (node 137), an Enron trader, has the largest centrality in the uniform filtering for k = 2 but relatively small centralities for all other parameters.It turns out that he participated in many pairwise interactions (emails involving just one sender and recipient).Interestingly, his betweenness centrality remains high for a much larger parameter range for GEQ, LEQ, and exclusion filterings, indicating that he also serves as an intermediary for higher-order information.Another interesting trend is the very high betweenness centrality of Stanley Horton (node 46), President and CEO of Enron Transportation Services, in the GEQ filtering for intermediate values of k, indicating that he serves as a vital conduit of information in groups of intermediate size.In the other filterings, Stanley's centrality looks rather unremarkable across all filtering parameters.
The community structure presented in Figure 3b demonstrates that the assigned communities can change with differing interaction sizes.From the LEQ filtering, it seems that there are groups of nodes that switch memberships as larger interaction sizes are included.For the GEQ filtering, not only do we see nodes fail to join groups of a given size or larger, but their participation in a given community changes as well.In addition, the exclusion filtering indicates that interactions of a single size can be enough to change the community to which certain nodes belong.In contrast to existing literature suggesting that each node in a complex system has a single community label, relaxing this assumption leads to interesting trends in the community structure for different scales of interaction.
This case study illustrates the utility of filtering higherorder datasets by interaction size.Relaxing the assumption that all interactions contribute to a unified picture of a dataset can reveal structure that only exists at a particular scale.For all four measures of structure, we see noticeable changes when increasing the filtering parameter.Such changes can uncover the influence of different interaction sizes on the complete dataset.

C. Gleaning insight across data domains
In the appendices, we present multiple case studies across email, biological, and proximity domains.In the email and biological datasets, measuring effective information allows us to observe that higher-order interactions catalyze fewer unique associations than lower-order ones.Within the email datasets, taking interaction size as a proxy for gregariousness, we can identify key players at precise levels on the introvert/extrovert interac-tion scale.Stanley Horton stands as a compelling example from our Enron analysis, with his importance peaking within interactions of intermediate size.Further, among all the domains examined, community structure remains largely unchanged across all filterings in only the proximity datasets, with other domains showing complex changes in community structure with the filtering parameter.

IV. DISCUSSION
We have presented a framework for looking at subsets of higher-order datasets and demonstrated its usefulness by examining an empirical case study and offering insights from datasets across multiple domains.In particular, we used our filtering framework to study global properties like the strength of unique associations and assortativity, as well as local properties such as centrality and community structure.By examining centrality at different scales of interaction, we offered an approach for identifying introverts and extroverts in a population [11] by looking at the sizes of interactions in which they participate instead of simply their pairwise network degree.We believe that filtering higher-order datasets by interaction size is a valuable tool that reveals the stratification of connection patterns at different scales.There is still much to study on the theory and practice of filtering datasets; we have merely introduced it as a tool for the network science practitioner.Other important questions remain, however.When is it practical and helpful to look at subsets of datasets?Are there heuristics for deciding not only whether to filter a dataset but also the type of filter to use? Can we quantify the information gained by no longer constraining, for example, nodes to have a single community label when performing community detection?
When using this approach, caution is advised.By construction, filtering a higher-order dataset will yield fewer interactions, making the resulting network more susceptible to noise.This can be combatted by choosing sufficiently large or dense datasets.Quantifying the statistical significance and stability of different metrics for different filterings and datasets may be a fruitful future direction.Another pitfall is that the sparsity of a dataset's filterings can drive the observed structural changes, warranting more study to measure and correct for these effects.Despite these limitations, filtering higher-order datasets by interaction size is a powerful approach and further sheds light on the assumptions made when quantifying complex systems.The consistency in the structural information that different filterings provide is a spectrum where, on one extreme, different filterings offer no information about each other, and on the other extreme, different filterings provide perfect information about each other.This notion of structural consistency across subsets of a dataset determines, for example, how similar the community labels are across all size-dependent filterings.Often, the default approach assumes perfect structural consistency, which we have demonstrated is not always the case.It may prove useful to quantify the information that two filterings of a higher-order dataset share to determine how much the structure of a dataset varies with interaction size.This approach offers a counterpoint to the complex systems literature: in this case, the sum of the parts may be greater than that of the whole.This method of disaggregation allows us to observe how contact patterns can be stratified by interaction size.This has implications for sub-fields of network science, such as community detection, dynamics on networks, and structural measures, among other topics.We believe that our approach unifies studies that have indirectly examined the effect of size on structure and dynamics of higher-order datasets [15,27,28,[39][40][41] and will be a fruitful area of research in the future.

DATA AND SOFTWARE
All software used to generate the results in this paper is available on GitHub [61] and utilizes the XGI library [43].All datasets used are openly available in the XGI-DATA repository.In this paper, we explore two types of structural measures: global and nodal.These measures are not exhaustive but are intended to give a large-scale picture of how the structure of complex systems changes with respect to interaction size.

Global measures
We consider the effective information and degree assortativity as global measures of higher-order structure.Effective information was introduced in Ref. [26] for directed graphs and is defined as is the ℓ 1 -normalized vector of out-degrees for node i and H is the Shannon entropy.We extend this definition to hypergraphs by performing the degree computations on the clique projection associated with the weighted adjacency matrix A = BB T (setting the diagonal elements to zero), where B is the incidence matrix.The authors of Ref. [26] explain this metric as quantifying the strength of unique associations in a network.
Degree assortativity, the preferential tendency of nodes with similar degrees to connect with one another, has been extended to hypergraphs in several ways.In Ref. [15], the author defines the top-bottom, top-2, and uniform degree assortativity for non-uniform hypergraphs by selecting two nodes from a given hyperedge according to a given rule (the nodes with the smallest and largest degrees, the nodes with the two largest degrees, and two randomly selected nodes) and then computing the Pearson correlation coefficient.In our case, we approximate this metric by computing the assortativity for a sample of 10 3 hyperedges.
In contrast to this method, Ref. [16] derives the dynamical assortativity for uniform hypergraphs with respect to a mean-field approximation of the largest eigenvalue as where ⟨k r ⟩ is the r-moment of the degree and ⟨kk 1 ⟩ E is the mean pairwise product of degrees over all possible 2-node combinations in each hyperedge in the hypergraph.We note that because Ref. [16] only defines dynamical assortativity for uniform hypergraphs, we can only compute it for the uniform filtering case (or whenever the filtering contains hyperedges of only one size, as is always the case for the largest parameter in GEQ).

Nodal measures
The nodal structural properties that we consider are the centrality and community labels of nodes.There are many notions of centrality for hypergraphs [20,21,57].In this study, we use a node-based analogue of the betweenness centrality for hypergraphs defined in Ref. [57] as where σ uv is the number of shortest paths from node u to node v and σ uv (n) is the number of these shortest paths that pass through node n.Here, the number of shortest paths is computed on the unweighted clique projection.In all experiments, we report the ℓ ∞ -normalized betweenness centrality to facilitate easy comparisons across filterings and filtering parameters.There are many algorithms for community detection on hypergraphs [18,19,58,62], and for this study, we use the method proposed in Ref. [58].We start by defining the normalized Laplacian of the hypergraph as where A is defined as before and D = diag(A1) is the diagonal matrix of degrees.We then compute the l largest eigenvectors of this matrix and perform k-means clustering on these vectors.One caveat is that the number of clusters is fixed to l, regardless of the number of connected nodes.This can sometimes lead to well-defined clusters being split into two or more clusters or other changes in cluster assignments.

Degree assortativity
For this metric, we use the sunflower hypergraph model as presented in Ref. [20] where we select the number of petals, n p , the interaction size, k, and the number of nodes, c to be members of every petal.We populate each petal with the c center nodes and k − c isolated nodes.The resulting hypergraph has n p edges and (k − c)n p + c nodes.We compute the terms in the expression for dynamical assortativity for each edge size k as follows: and simplifying the final expression, we obtain From this expression, we can see that the hypergraph will always be disassortative because k > c ≥ 1, and as the edge size increases, the low-degree nodes will "dilute" the assortativity, eventually leading to a hypergraph with 0 assortativity.ee Figure 5 for a visualization of this trend.

Centrality
Consider a star graph with eight edges joined together at a node with a sunflower having n p = 3 and k = 3, as seen in Figure 6.n the subhypergraph generated by the LEQ filtering for k = 2, node 15 has an ℓ ∞ -normalized betweenness centrality of 1 (and the rest 0).However, this drastically changes in the subhypergraph corresponding to k = 3, as the normalized centrality of node 15 decreases to 11/14, that of node 1 increases to 1, and that of node 2 increases to 24/35.Node 1 transitions from being the least to the most central node in the network.

Community structure
The hypergraph shown in Figure 7 contains two dense communities (left, surrounded by dotted ovals), but when applying the GEQ filtering for k = 3 (right), two completely different communities emerge.Here, we present a table of the summary statistics of all datasets considered in this study.

email-eu
The "email-eu" dataset is a higher-order dataset generated from email data sent within a large European research institution between October 2003 and May 2005 (18 months) [24,59,64,65].In this dataset, nodes represent email addresses and edges represent email messages.We neither account for the sender-receiver relationships nor the temporality of the emails sent.In addition, we remove duplicate emails involving the same group of people, emails sent to oneself, and email addresses that neither send nor receive emails.This dataset is heterogeneous in both the degree and edge size distributions.In Figure 8a, we observe that for the effective information, small interactions contribute the most and that there is a sharp drop-off in effective information for interactions larger than 32 email recipients.We also observe in Figure 8b that as the interaction size increases, the dataset becomes more and more strongly assortative.This is not the case for the top-bottom assortativity, but this could be caused by a highly dense core with a few peripheral nodes so that, on average, the degrees are highly correlated, but when looking at the highest and lowest degrees, they are highly negatively correlated.The community structure is harder to decipher; we have specified 42 communities to infer.Nonetheless, in Figure 9b, we see that as the email size increases, fewer and fewer nodes participate.We also see that excluding interactions of sizes two or three changes the community labels of many nodes.We see several communities persist over larger scales of interactions while other communities are much more sensitive to interaction size.In Figure 9a, we see that smaller interactions stabilize the centrality, and when excluding larger and larger interactions, the centrality is scale-dependent.

disgenenet
The "disgenenet" dataset is a higher-order dataset describing the relationship between genes and the diseases associated with them [59,66].In this dataset, a disease is a node, and a gene is a hyperedge (the dual of the dataset available from XGI-DATA [59]).We have removed disease-disease correlations to enforce only disease-gene relationships.Before analyzing the dataset, we applied the LEQ filter with k = 20 for ease of presentation, as the maximum hyperedge size in the complete dataset is in the hundreds.We also excluded diseases with no associated genes and genes that have the same set of associated diseases.In Figure 10a, the diseasome dataset shows decreasing effective information for increasing interaction size.Unlike the email-eu dataset, the pairwise interactions do not have as significant an influence, and excluding larger and larger interactions slightly increases the effective information.Figure 10b indicates that associations between diseases become assortative for those involving more than two diseases.The assortativity remains relatively stable, however, and plateaus for mid-sized interactions.As seen in Figure 11a, for all filterings, the most central node remains so across most interaction sizes, indicating that the diseases most associated with other diseases are consistent across all scales of interaction.Figure 11b shows nodes participating in many higher-order interactions, with many nodes not participating in lower-order interactions.While some communities persist across filtering parameters, many nodes often switch communities.

diseasome
The "diseasome" dataset tracks diseases and the genes associated with them [67].In this dataset, a disease is a node, and a gene is a hyperedge.The "label" attribute of the nodes is the disease description, and the "label" attribute of the edges is the gene name.The disease-disease correlations were filtered out to enforce only disease-gene relationships.This results in a very sparse hypergraph, precluding us from analyzing community structure.When ignoring interactions of smaller sizes through filtering, the effective information (shown in Figure 12a) of the dataset decreases -although, interestingly, unlike the email datasets, the effective information drops much more sharply for the GEQ filtering, which may be due to the sparsity of larger interactions.This dataset exhibits low assortativity for many filterings and filtering parameters, as seen in Figure 12b.As in other datasets, the pairwise interactions seem to have the most significant effect on the assortativity.As seen from Figure 13, centrality exhibits the most change in the smaller interactions, specifically those smaller than size 6.seems to -on the whole -gradually increase with interaction si e.The most significant change is in the top-bottom assortativity, where interactions of size 4 are most disassortative.This indicates that the lowest and highest degrees are more anti-correlated, i.e., large-degree nodes tend not to interact with small-degree nodes.We see that the most central nodes are sensitive to pairwise interactions and that when excluding pairwise interactions, the centrality can change quite randomly with interaction size.We see that the community structure is relatively stable; if anything, Figure 15b indicates that very few nodes participate in interactions larger than 4.This could be an artifact of the proximity data collection method.
FIG.2.Global higher-order structural measures with respect to the filtering parameter k.
The ℓ∞-normalized betweenness centrality of each node.Disconnected nodes have a centrality of zero by definition.
FIG.3.Nodal higher-order structural measures with respect to the filtering parameter k.
CONTRIBUTIONS N.W.L., I.A, M.S., and S.G.A. designed the research; N.W.L. and I.A. performed the research; N.W.L. and I.A. wrote the article; N.W.L., I.A., and S.G.A. edited the draft; and S.G.A. provided supervision.Appendix A: Structural measures

FIG. 5 .
FIG. 5. A plot of the dynamical assortativity of a sunflower hypergraph for (a) different number of petals, np and (b) different size centers, c.The line colors range from dark to light as the number of petals in panel (a) and the size of the sunflower center in panel (b) are increased from two to nine.In panel (a), the center size is fixed at c = 3, and in panel (b), the number of petals is fixed at np = 5.

TABLE I .
Summary statistics of cleaned datasets.