Hypernetwork Science via High-Order Hypergraph Walks

We propose high-order hypergraph walks as a framework to generalize graph-based network science techniques to hypergraphs. Edge incidence in hypergraphs is quantitative, yielding hypergraph walks with both length and width. Graph methods which then generalize to hypergraphs include connected component analyses, graph distance-based metrics such as closeness centrality, and motif-based measures such as clustering coefficients. We apply high-order analogs of these methods to real world hypernetworks, and show they reveal nuanced and interpretable structure that cannot be detected by graph-based methods. Lastly, we apply three generative models to the data and find that basic hypergraph properties, such as density and degree distributions, do not necessarily control these new structural measurements. Our work demonstrates how analyses of hypergraph-structured data are richer when utilizing tools tailored to capture hypergraph-native phenomena, and suggests one possible avenue towards that end.


Introduction
In the study of complex systems, graph theory is often perceived as the mathematical scaffold underlying network science [6]. Systems studied in biology, sociology, telecommunications, and physical infrastructure often afford a representation as a set of entities ("vertices") with binary relationships ("edges"), and hence may be analyzed utilizing graph theoretic methods. Graph models benefit from simplicity and a degree of universality. But as abstract mathematical objects, graphs are limited to representing pairwise relationships between entities. However, real-world phenomena in these systems can be rich in multi-way relationships involving interactions among more than two entities, dependencies between more than two variables, or properties of collections of more than two objects.
Hypergraphs are generalizations of graphs in which edges may connect any number of vertices, thereby representing k-way relationships. As such, hypergraphs are the natural representation of a broad range of systems, including those with the kinds of multi-way relationships mentioned above. Indeed, hypergraphstructured data (i.e. hypernetworks) are ubiquitous, occurring whenever information presents naturally as set-valued, tabular, or bipartite data. Additionally, as finite set systems, hypergraphs have identities related to a number of other mathematical structures important in data science, including finite topologies, simplicial complexes, and Sperner systems. This enables use of a wider range of mathematical methods, such as those from computational topology, to identify features specific to the high-dimensional complexity in hypernetworks, but not available using graphs. Although an expanding body of research attests to the increased utility of hypergraph-based analyses, many network science methods have been historically developed explicitly (and often, exclusively) for graph-based analyses. Moreover, it is common that data arising from hypernetworks are reduced to graphs.
In spite of this incongruity, effectively extending graph theoretical tools to hypergraphs has sometimes lagged or proven elusive. A critical aspect of this is one of axiomatization: as a generalization there are many, sometimes mutually inconsistent, sets of possible definitions of hypergraph concepts which can yield the same results consistent with graph theory when instantiated to the graph case. In some cases, developing any coherent hypergraph analog poses significant theoretical obstacles. For example, extending the spectral theory of graph adjacency matrices to hypergraphs poses an immediate challenge in that hyperedges may contain more than two vertices, thereby rendering the usual (two-dimensional) adjacency matrix insufficient for encoding adjacency relations. In other cases, graph theoretical concepts may be trivially extended to hypergraphs, but in doing so ignore structural nuance native to hypergraphs which are unobservable in graphs. For instance, while edge incidence and vertex adjacency can occur in at most one vertex or edge for graphs, these notions are set-valued and hence quantitative for hypergraphs. Consequently, while subsequent graph walk based notions, such as connectedness, are immediately applicable to hypergraphs, they ignore high-order structure in failing to account for the varying "widths" associated with hypergraph walks.
Due to these challenges, scientists seeking tools to study hypergraph-structured data are frequently left to contend with disparate approaches towards hypergraph research. One approach for grappling with hypergraph complexity is to limit attention to hypergraphs with only uniformly sized edges containing the same number of vertices. Much of the hypergraph research in the mathematics literature, such as in hypergraph coloring [21,42], the aforementioned spectral theory of hypergraphs [12,18], hypergraph transversals [3], and extremal problems [60], focus on this k-uniform case only. While imposing this assumption facilitates more mathematically sophisticated and structurally faithful analysis of the hypergraphs in question, real-world hypergraph data is unfortunately very rarely k-uniform. Consequently, such tools are problematic in lacking applicability to real hypernetwork data. Another approach towards hypergraph research is to limit attention to transformations of (potentially non-uniform) hypergraphs to graphs. Sometimes called the hypergraph line graph, 2-section, clique expansion, or one-mode projection, such transformations clearly enable the application of graph-theoretic tools to the data. Yet, unsurprisingly, such hypergraph-to-graph reductions are inevitably lossy [20,38]. Hence, although affording simplicity, such approaches are of limited utility in uncovering hypergraph structure.
To enable analyses of hypernetwork data that better reflect their complexity but remain tractable and applicable, we believe striking a balance between this faithfulness-simplicity tradeoff is essential. With this goal at heart, we study how a number of graph analytic tools popular in network science extend to hypergraphs under the framework of high-order hypergraph walks. We characterize a hypergraph walk as an "s-walk", where the order s controls the minimum walk "width" in terms of edge overlap size. High-order s-walks (s > 1) are possible on hypergraphs whereas for graphs, all walks are 1-walks. The hypergraph walk-based methods we consider include connected component analyses, graph-distance based metrics such as closeness-centrality, and motif-based measures such as clustering coefficients. As each of these methods is based fundamentally on the graph-theoretic notion of a walk, we extend them to hypergraphs by using hypergraph walks. Ultimately, our goal is not only to formulate these generalizations in a cogent manner, but to probe whether these tools reveal prevalent and meaningful structure in real hypernetwork data. To the latter end, we compute these measures based on hypergraph walks on three real datasets from different domains and discuss the results.
Our work is organized as follows: in Section 2, we provide background definitions and review preliminary topics relevant to hypernetwork theory. In Section 3, we define the s-walk notion underpinning our subsequent work and discuss related prior research. In Section 4, we introduce s-walk based analytical measures, apply them to the aforementioned datasets, and briefly analyze the results. In Section 5, we consider three generative hypergraph models, and experimentally test the extent to which the structural properties observed in Section 4 can be replicated by synthetic models. Finally, in Section 6 we conclude and outline several directions for future research.

Preliminaries
Hypergraphs are generalizations of graphs in which edges may link any number of vertices together. Just as "network" is often used to refer to processes or systems which yield data streams which are graphstructured, we will use the term "hypernetwork" to refer to those yielding hypergraph-structured data. More formally, we define a hypergraph as follows: The degree of a vertex is the number of hyperedges to which it belongs, d(v) = |{e : v ∈ e}|, and the size of an edge is its cardinality, |e|. A hypergraph in which all hyperedges have size k is called k-uniform, and a 2-uniform hypergraph is simply a graph. We note that definitions of hypergraphs given in the literature may differ slightly from author to author. For instance, Bretto's definition [10] of a hypergraph is identical to ours, apart from prohibiting empty edges (ei such that ei = ∅). Berge [8] similarly prohibits empty edges, as well as isolated vertices (vi such that vi ∈ ∪ m i=1 ei). In contrast, Katona [36] allows empty edges and isolated vertices, but defines E = {e1, . . . , em} as a set and explicitly prohibits pairs of duplicated edges ei = ej for i = j. In defining E as an (indexed) family of sets, we allow for duplicated edges but require edges be distinguishable by index. The overall generality of Definition 1 in permitting isolated vertices, as well as empty, duplicated, and singleton edges is intended to facilitate the application of hypergraphs to real data, which commonly possess such features.
Under Definition 1, any rectangular binary matrix uniquely defines a labeled hypergraph 1 ; conversely any labeled hypergraph uniquely defines an incidence matrix. Consequently, it can be easily seen there is a bijection between hypergraphs and bicolored graphs. Recall a bicolored graph is a triple (V, E, f ) where V is a set of vertices, E is a set of unordered pairs of vertices, and f : Indexing the rows and columns by vertices such that f (vi) = 0 and f (vj) = 1, respectively, bicolored graphs may also be uniquely associated with incidence matrices (and hence hypergraphs) by defining S(i, j) = 1 if and only if {vi, vj} ∈ E. To avoid confusion, it is worth emphasizing bicolored graphs differ from bipartite graphs in explicitly specifying a fixed bicoloring f . In contrast, a bipartite graph is defined as a graph admitting some bicoloring f . Accordingly, since a bipartite graph with k connected components has 2 k possible bicolorings, a single bipartite graph may correspond to up to 2 k distinct hypergraphs, depending on the bicoloring. In applications, however, it is often the case that a natural bicoloring is already given explicitly by the data (e.g. in an author-paper network) and hence the terms "bipartite" and "bicolored" graphs have been used synonymously.
As an upshot of the hypergraph-bicolored graph correspondence, a number of complex network analytics for bipartite graph data extend naturally to hypergraphs, and vice versa. However, interpreting this correspondence in light of the fact that bicolored graphs are graphs does not mean that graph theoretic methods suffice for studying hypergraphs. To the contrary, bicolored graph data often require entirely different network science methods than (general) graphs. An obvious example is triadic measures like the graph clustering coefficient: these cannot be applied to bicolored graphs since (by definition) bicolored graphs have no triangles. Detailed work developing bipartite analogs of modularity [7], community structure inference techniques [43], and other graph-based network science topics [44] further attests that bipartite graphs (and hypergraphs) require a different network science toolset than for graphs.
While graph theoretic methods should not be conflated with those of bicolored graphs, it is natural to ask why we gear our exposition towards hypergraphs. We utilize the language of hypergraphs because of the fundamentally set-theoretic nature of our approach. Our focus in this work is on hyperedge incidences and hyperwalks that arise from sequences of incident hyperedges. Hyperedges themselves are defined explicitly for hypergraphs, but only implicitly for bicolored graphs (as the neighborhood of a vertex in the color class designated for hyperedges). For this reason, framing our exposition using the language of arbitrary set systems is natural, whereas adopting the constrained language of bicolored graphs would be cumbersome and confusing.
Another important topic highlighted by the bicolored graphs-hypergraph correspondence is the inherent duality of hypergraphs. That is, just as it may be arbitrary to label one partition in a bicolored graph "left" and the other partition "right", which class of objects one designates as "vertices" versus "hyperedges" in a hypernetwork may also be arbitrarily chosen. However, hypergraph properties and methods may be vertex-based or edge-based, and hence differ depending on which choice is made. To avoid limiting one's analysis toward either a vertex-centric or edge-centric approach, it may be prudent to consider both the hypergraph and its dual hypergraph. Loosely speaking, the dual of a hypergraph is the hypergraph constructed by swapping the roles of vertices and edges. More precisely: . . , vn} and family of edges E = (e1, . . . , em). The dual hypergraph of H, denoted H * = (E * , V * ), has vertex set E * = {e * 1 , . . . , e * m } and family of edges Put equivalently, the dual of a hypergraph with incidence matrix S is the hypergraph associated with the transposed incidence matrix, S T . Clearly, (H * ) * = H. Furthermore, observe that two vertices belonging to the same set of edges in H correspond to multi-edges in the H * and isolated vertices in H correspond to empty edges in H * . Thus, the generality of our Definition 1 in permitting multi-edges, empty edges, and isolated vertices ensures the dual of a hypergraph is also a hypergraph. Indeed, as a formal matter, one could go so far as to always consider that hypergraphs present in dual pairs. Note, however, this is not necessarily true when restricted to the graph case: for a graph G, its dual G * is 2-uniform (and hence still a graph) if and only if G is 2-regular, in which case G is a cycle or disjoint union of cycles. Continuing this line of observation, in the complex networks literature, one of the most oft-used tools for studying hypergraph data is its line graph. In the line graph of a hypergraph, each vertex represents a hyperedge, and each edge represents an intersection between a pair of hyperedges. More formally: i , e * j } : ei ∩ ej = ∅ for i = j}. In order to additionally capture information about the size of hyperedge edge intersections, line graphs of hypergraphs may be defined with additional edge weights where {e * i , e * j } has weight |ei ∩ ej|. By definition of matrix multiplication, it is easy to see that for a hypergraph with incidence matrix S, its line graph has edge-weighted adjacency matrix S T S with diagonal entries all converted to zero. Figure 1 gives an example of a hypergraph, its dual, and their respective line graphs. All the hypergraph visualizations in this paper were created using HyperNetX (HNX) [56], a recently released 2 Python library echoing NetworkX [28] for exploratory data analytics and visualization of hypergraphs.
Hypergraph line graphs are also referred to by a plethora of other names. Berge [8] refers to L(H) as both the "line graph" and "representative graph" of H; Naik [49,50] refers to L(H) the "intersection graph" of H. In the complex networks literature on bipartite graphs or "2-mode" graphs, the oftmentioned "one-mode projections" are equivalent to hypergraph line graphs. For instance, L(H) and L(H * ) are referred to as the "top and bottom" projections in [44], similarly, [24] dubs these the "column and row" projections. Moreover, L(H * ) is commonly referred to as the "2-section", "clique graph", or "clique expansion" of the hypergraph H, since the edge set of L(H * ) is generated by taking all 2-element subsets of each edge in H, hence vertices within each hyperedge H form a clique in L(H * ). Consequently, if G is a graph, L(G * ) is identical to G.
Line graphs play an important role in hypernetwork science. Due to the relative dearth of hypergraph analytic tools, line graphs are often analyzed in place of the hypergraphs they were derived from so that classical network science techniques can be applied. However, as has been noted, hypergraph line graphs are fundamentally limited in several ways. First, line graphs are lossy representations of hypergraphs in the sense that distinct hypergraphs can have identical line graphs. We note that such structural loss does not occur for graphs, as Whitney's theorem [67] states, apart from the triangle and 4-vertex star graph, any pair of connected graphs with isomorphic line graphs must be isomorphic. In the case of hypergraph line graphs, Kirkland [38] recently illustrated the structural loss in a severe sense by giving an example of two distinct 19 × 19 incidence matrices S and R respectively, such that both Put equivalently in the language of hypergraphs: although non-isomorphic, the weighted line graphs of the hypergraphs represented by S and R, as well as those of their dual hypergraphs, are both identical. Using tools from combinatorial matrix theory, Kirkland also constructed infinite familities of such pairs of hypergraphs and showed they constitute a vanishingly small proportion of hypergraphs. Accordingly, while one isn't likely to encounter such pairs of hypergraphs in empirical data, Kirkland's work illustrates structural properties of hypergraphs may be lost even when simultaneously accounting for hypergraph duality and using weighted line graphs. Consequently, depending on the properties under consideration, the extent to which line graphs faithfully represent hypergraphs may be unclear. Nonetheless, researchers have offered preliminary evidence that some meaningful, albeit incomplete, hypergraph structure can be extracted from their line graphs [24].
Lastly, as has been previously noted [44,62], another important limitation of line graphs of hypergraphs is computational: sparse hypergraphs can still yield relatively dense line graphs that may be difficult to analyze or store in computer memory. Indeed, this can be easily seen by observing that k-way edge intersections (guaranteed by a vertex of degree k) in the hypergraph yield k 2 edges in its line graph. Particularly if the hypergraph is large and its vertex degree and edge cardinality distributions heavily skewed (common features in real world network data), its line graphs may be too dense to analyze computationally or even construct at all.

From Graph Walks to Hypergraph Walks
One of the most fundamental concepts in graph theory, underpinning a myriad of areas including Hamiltonian and Eulerian graphs, distance and centrality measures, stochastic processes on graphs and PageRank, is that of a walk. For a graph G = (V, E), a walk of length k is a sequence of vertices v0, v1, . . . , v k , such that each pair of successive vertices are adjacent. By definition of a (simple) graph, two adjacent vertices belong to exactly one edge, and conversely, two incident edges intersect in exactly one vertex. Consequently, any valid graph walk can be equivalently expressed as either a sequence of adjacent vertices or as a sequence of incident edges, i.e.
In the setting of hypergraphs, this simple observation no longer holds. Hypergraph edge incidence and vertex adjacency is set-valued and quantitative in the sense that two hyperedges can intersect at any number of vertices, and two vertices can belong to any number of shared hyperedges. This motivates two walk concepts for hypergraphs that are dual but nonetheless distinct: walks on the vertex level (consisting of successively adjacent vertices), and walks on the edge level (consisting of successively intersecting edges). For ease of presentation, and to be consistent with the presentation of similar notions appearing in the literature, we limit our exposition to edge-level hypergraph walks. Nonetheless, we emphasize that both notions are captured when duality is considered, as a vertex-based walk on a hypergraph H is simply an edge-walk on the dual hypergraph H * . We define a hypergraph walk as an "s-walk" on a hypergraph, where s controls for the size of edge intersection, as follows: Definition 5. For a positive integer s, an s-walk of length k between hyperedges f and g is a sequence of hyperedges, f = ei 0 , ei 1 , . . . , ei k = g, where for j = 1, . . . , k, we have s ≤ |ei j−1 ∩ ei j | and ij−1 = ij.
In this way, a graph walk is simply a 1-walk, whereas s-walks for s > 1 are only possible for hypergraphs. When interpreted on the dual hypergraph H * , an s-walk corresponds to a sequence of adjacent vertices in which each successive pair of vertices belong to at least s shared hyperedges. As will become apparent in subsequent sections, a number of basic, yet important, properties of walks in graphs immediately extend to s-walks on hypergraphs. For instance, just as any graph walk ending at vertex v k can be concatenated with any walk starting at vertex v k to form another walk, any s-walk ending at a particular edge can be concatenated to any other s-walk starting at the edge. Consequently, the existence of an s-walk between hyperedges defines an equivalence relation under which hyperedges can be partitioned into s-connected components, which we explore in Section 4.2. Furthermore, this also ensures the length of the shortest s-walk between edges, called s-distance (Section 4.3), satisfies the triangle inequality and defines a bona-fide distance metric on the hypergraph. Finally, in Section 4.4 we explore how one may distinguish between different kinds of s-walks in a hierarchical way, and how the subsequent notions of s-traces, s-meanders, s-paths, and s-cycles lend themselves to discerning substructures native to hypergraphs, such as s-triangles.
Many researchers have considered different notions of "high-order walks" on hypergraphs, abstract simplicial complexes, and related set systems. Indeed, a number of concepts closely related to s-walks have for long appeared in the mathematics literature. Bermond, Heydemann, and Sotteau [9] introduced and analyzed k-line graphs of uniform hypergraphs, which are derived from hypergraphs by representing each hyperedge as a vertex, and linking two such vertices if their corresponding hyperedges intersect in at least k vertices. In this way, a (graph) walk on their line graphs corresponds to an s-walk on a hypergraph.
In [47], Lu and Peng define higher order walks on hypergraphs for k-uniform hypergraphs as sequences of intersecting edges in which the vertices within each edge are ordered. Their work is related to a rich literature on Hamiltonian cycles in k-uniform hypergraphs (e.g. [30,37]) and takes a spectral approach: they introduce these generalized walks as a means for studying a generalized s-Laplacian matrix. Wang and Lee [64] define hypergraph paths as edge sequences in which no successive intersection is a subset of any other. Their motivation is to prove enumeration formulas for certain cycle structures in hypergraphs. In a series of three recent papers [15,16,17], Kang, Cooley, Koch, and others consider a notion of s-walk between s-tuples of vertices. They conduct a rigorous mathematical analysis of the asymptotic s-walk properties of binomial random k-uniform hypergraphs, considering hitting times, the evolution of highorder s-components, and high-order "hypertree" structures. Lastly, in [33,57], authors of the present work briefly considered the s-walk based notion of s-distance as applied to Domain Name System (DNS) cyber data, and the Enron email dataset, respectively.
Like all of the work above, the present work considers high-order hypergraph walks, but for different ends. Most importantly, our approach towards higher-order hypergraph walks differs in being geared first and foremost towards the application and analysis of real hypernetwork data. Accordingly (in contrast to all the work mentioned above, apart from that of Wang and Lee [64]) we do not assume k-uniformity, as real hypernetworks are frequently non-uniform. Furthermore, while our notion of s-walk is more general in the sense of applying to non-uniform hypergraphs, it is also simpler than some of those mentioned above in order to ensure our methods are more computationally tractable in light of the combinatorial explosion inherent in hypergraphs. For instance, the aforementioned work of Lu and Peng [47] defines s-walks between arbitrary ordered s-tuples of vertices; consequently, the s-Laplacian matrix they study is n s × n s , where n denotes the number of vertices and x k = x k k! denotes the falling factorial. Even for a modestly sized hypergraph on n = 10 4 vertices with s = 20, this matrix has size n s ≈ 10 80 , approximately the number of atoms in the known universe. Thus, while this matrix is amenable to theoretical analysis, computational methods are likely infeasible in practice. By defining our notion of s-walk between pairs of unordered hyperedges (or when working with the dual, pairs of single vertices), we develop s-walk based methods that are more tractable in application to real data.
In short, our goals for the present work are threefold: (1) to investigate how walk-based network science measures for graphs generalize to hypergraphs under the s-walk framework; (2) to apply these methods to real hypernetworks and briefly discuss insights they reveal; and (3) to experimentally test the degree to which existing generative hypergraph null models are able to replicate the properties seen in real data captured by these methods.

Hypergraph Walk Framework
In this section, we explore how analytic tools from network science extend to hypergraphs in the hypergraph walk framework. Within each subsection, we focus on one such particular topic (e.g. s-distance, a hypergraph geodesic), and introduce relevant methods in the "Methods" section. In the "Application to Data" section, we apply these methods to real hypernetworks, and briefly analyze the results. Here, we emphasize our goal is not to explain why the observed structure exists using domain-specific analyses. Rather, we identify abstract structural properties revealed by these measures, and highlight how these properties differ within each dataset (as we vary the walk order s), as well as across datasets. In doing so, we illuminate particular properties these measures capture, and point out additional insights revealed by considering s-walk based metrics. While we take a broader, methods-based viewpoint here, we believe such an approach may be more useful in guiding future application-specific studies of these methods across multiple domains. To that end, we consider three datasets from three domains: corporate governance, biology, and text analysis.

Test Data Sets
For each of our data sets, we briefly define the associated hypergraph, review prior graph or hypergraph analyses of these (or closely related) datasets, and discuss basic properties of the data, summarized in Figure 2. In this figure and throughout, we use the same notation as in Definition 4 to refer to the dual hypergraph associated with a data set (e.g. LesMis * refers to the dual hypergraph of LesMis, in which the roles of the vertices and edges are swapped relative to how they are defined below). Figure 2 presents plots of the edge cardinality distribution and pairwise edge intersection cardinality distribution. For instance, the point (x, y) = (3, 100) on the "edges" distribution means there are 100 edges which consist of exactly 3 vertices in that hypergraph; the same point on the "pairwise edge intersection" distribution    means there are 100 distinct, unordered pairs of hyperedges whose intersection contains exactly 3 vertices. We remind the reader that the edge cardinality distribution of the dual hypergraph H * is the same as the vertex degree distribution of H. The table in Figure 2 highlights some basic statistics for each hypergraph. The number of "multiedges" is the number of edges that duplicate (in the sense of set equality) another edge, i.e. |{i : ei = ej for j < i}|. The maximum edge size and edge intersection sizes, reported below the multiedge counts, are particularly pertinent because they determine the range of interest for our measures: the former determines the largest value of s for which s-walk based measures are defined while the latter determines the maximum value of s for which s-walk based measures are non-trivial. Finally, "density" measures the number of vertex-hyperedge memberships relative to the number of possible vertex-hyperedge memberships. Put equivalently, this is the number of nonzero entries in the incidence matrix S divided by the product |V | · |E|. Note that, by definition, density is always the same for the hypergraph and its dual, whereas the other reported values are edge-based and hence may differ.

CompBoard
Data set. A company-board network. Vertices represent board directors (i.e. members of the board), and hyperedges represent company boards. A vertex belongs to a hyperedge if that person sits on the company board. The data consists of 4,573 companies and 32,189 directors that sit on their boards. The companies are identified by their ticker symbols, excluding any location or exchange code suffixes (e.g. Vodafone group is represented solely by VOD, not VOD.L or VOD.O) and were taken from the NYSE, AMEX, and NASDAQ stock exchange listings 3 on 10/1/2018. The data was collected from publicly available 4 board director information listed on Reuters. Board director names were cross referenced against age data to better ensure that different people with the same full name were distinguished.
Prior work. Company-board network studies are historically rooted in corporate elite theory, focusing on companies which share a common board member, called interlocking directorates. Many such studies focus on the line graph representations of the network, linking companies whose boards interlock. For instance, Conyon and Muldoon [14] studied the small-world properties of companyboard networks from the US, UK, and Germany, focusing on the clustering coefficient and average path length of the line graphs. In [53], Newman compares the clustering coefficient of a companyboard network line graph to that of a random model. Measuring hypergraph clustering on line graphs rather than the hypergraph has been rightly noted to be potentially misleading (see, for instance, [48,54]) since the cliques generated by the line graph heavily skew the number of triangles. Levine and Roy [46] appear to be among the first to analyze bipartite representations of companyboard networks directly, rather than solely line graphs. They considered topics such as the average path length, connected component sizes, and proposed a "rubber-band model" 5 to cluster the bipartite network. Later, Robins and Alexander devised a bipartite global clustering coefficient, based on the ratio of bipartite 4-cycles to 3-paths, in order to measure "the extent to which directors re-meet one another on two or more boards" [58]. In Section 4.4, we propose a new notion of hypergraph clustering coefficients and explain how it compares to that of Robins and Alexander, as well graph clustering coefficients measuring on the line graph. More generally, since an "interlocking directorate" is represented by a hyperedge intersection, our methods can be interpreted in this context as not only based on the existence of interlocks (i.e. a pure line graph analysis) but also the size and relative set relationships of that interlock.
Basic properties. The edge size distribution shows that the sizes of company boards are tightly concentrated around 7-10 members and drop off sharply at either end: only about 3% of companies have fewer than 4 members, and 3% have more than 14. In contrast, the edge size distribution of the dual hypergraph is monotonically decreasing, showing more than 99% of board members belong to between 1-3 company boards. The pairwise edge intersection distribution for the hypergraph and its dual similarly exhibit a sharp decrease, and the range of these distributions imply that different companies share up to a maximum of 12 board members, while different members serve on up to maximum of 5 common company boards. Among the three datasets, CompBoard is the sparsest: it contains about 0.03% of possible vertex-hyperedge memberships, as opposed to 0.33% and 2.68% for Diseasome and LesMis, respectively.

Diseasome
Data set. A human gene-disease network from [27]. Vertices represent genes and hyperedges represent genetic disorders. A vertex belongs to a hyperedge if mutations in that gene are implicated by that disease. The data consists of 903 genes and 516 diseases.
Prior work. Goh et al. [27] collected the list of genes, disorders, and their associations from the Online Mendelian Inheritance in Man (OMIM) [29] compendium in 2005. Their study considered the line graphs of hypergraph and its dual, which they dubbed the Human Disease Network and Disease Gene Network. One of their structural observations was that the size of the largest connected component in these networks differed with those generated by random models. In Section 4.2 we study a generalized notion of high-order connected components and compare these against those of random hypergraph models in Section 5.1. For a broader discussion of the potential applications of hypergraphs and hypergraph statistics in biology and genomics, see [39].
Basic properties. The edge size distributions of Diseasome and its dual show that the most genes implicated by a disease is 41 while the most diseases implicated by a gene is 11. The pairwise edge intersection size distribution show that, among pairs of diseases that implicate a common gene, the 94% of such pairs share exactly 1 gene; conversely, examining the same distribution for the dual reveals that among genes associated with a common disease, 98% of such pairs are associated with exactly 1 common disease. Among the three datasets, Diseasome and its dual features the narrowest range of pairwise edge intersection sizes, with maximum edge sizes of 5 and 4, respectively.

LesMis
Data set. A character-scene network from [40]. Vertices represent characters and hyperedges represent scenes from Victor Hugo's novel, Les Misérables. There are 80 characters and 402 scenes.
Prior work. This dataset was collected by Donald Knuth [40] and can be structured according to different granularities in which hyperedges represent the scenes, chapters, books, or volumes of the novel. The line graph of the LesMis hypergraph, often dubbed the Les Mis co-appearence network, has appeared frequently in network science literature for the purpose of demonstrating clustering or modularity methods [25,52], or centrality and ranking methods [4]. With regard to the latter, we apply our proposed hypergraph centrality measure to rank LesMis characters in Section 4.3.
Basic properties. In LesMis and its dual, the largest hyperedge features 9 characters and 137 scenes, respectively, with the latter hyperedge (unsurprisingly) corresponding to the protagonist, Jean Valjean. Compared against other datasets, the edge intersection size distributions are particularly distinct. LesMis * features the largest range of edge intersection sizes across all datasets. Both LesMis and its dual are also notable for featuring the most edge intersections relative to the number of possible edge intersections: respectively, 22% and 8% of all pairs of edges in LesMis and its dual intersect, whereas this ratio is an order of magnitude smaller for Diseasome and its dual and two orders of magnitude smaller for CompBoard and its dual.

Connected Components
Methods Under Definition 1, the graph notions of connectedness and connected components extend naturally to the s-walk framework. Since for any e ∈ E, there can be no s-walk from e to any other hyperedges if |e| < s, the order of an s-connected component is bounded above by |Es|, where Es = {e ∈ E : |e| ≥ s}. More precisely, for any positive integer s and hypergraph H, the edges in Es can always be partitioned into s-connected components. We call a hypergraph s-connected if Es is s-connected.
Observe that, while an s-connected component of a hypergraph H is an equivalence class of edges, a vertex-based notion of s-connected components is obtained by simply applying the above definition to the dual hypergraph H * . In comparing these edge and vertex-based notions, note that the total number of 1-connected components for H and H * are always the same: it is straightforward to see that, in either case, the number of such components is the same as the number of nontrivial connected components (i.e. excluding isolated vertices) in the bipartite graph representation of the hypergraph. In this sense, edge and vertex-based connectedness are equivalent for s = 1 and whenever H is a graph. However, for s ≥ 2, the number of s-connected components for H and H * may differ. Hence, s-connectedness in hypergraphs is richer and more varied for high-orders, yielding dual but distinct vertex and edge-based notions.
An effective way of visualizing and studying basic properties of s-connected components is via its sline graph. As previously mentioned, s-line graphs were studied for k-uniform hypergraphs by Bermond, Heydemann, and Sotteau [9] as early as 1977. A definition for the general case may be stated as follows:  Although more general, s-line graphs are still subject to the same limitations underlying (the usual) hypergraph line graphs, and do not uniquely identify a hypergraph, up to isomorphism. Nevertheless, s-line graphs can be utilized to determine a number of s-walk properties, including s-distance, which we explore in the next section. It is worth stressing, however, that the study of high-order s-walks in hypergraphs is not limited to studying s-line graphs. As we will see in Section 4.4, s-line graphs cannot be used to distinguish between finer classes of s-walks, such as s-meanders and s-paths, and consequently cannot be used to compute s-clustering coefficients, for example. Table 2 we visualize the s-components of the LesMis, Diseasome, and Company Board datasets for s = 1, . . . , 5 by plotting their s-line graphs. A page-sized version of this table is included in Appendix A.

Application to Data In
A number of qualitative differences are readily apparent from the visualization. In the case of LesMis, we observe the majority of hyperedges are contained within a giant component for s = 1, . . . , 5. This means that one can link the majority of characters with each other via a pathway of characters cooccurring in at least one to five scenes together. In the case of Diseasome, for s = 1 we similarly observe a giant component; however, for s ≥ 2, this giant component fragments into small, roughly equally sized components. In this case, as s increases from one to five, many of the shared-gene pathways linking diseases for s = 1 break down. By s = 5, the s-components consist almost entirely of isolated hyperedges (apart from a single pair of closely related diseases,"Diabetes Mellitus" and "Mature onset diabetes of the young (MODY)") meaning that diseases associated with 5 or more genes do not share 5 or more of those genes with other disorders. The most dramatic fragmentation occurs for the CompBoard dataset. For s = 1, we note that 74% of the companies are contained within the giant component (pictured in the lower-left hand corner), while for s = 2, this drops to 0.5%. This affirms that shared board-member pathways linking companies almost always rely on single shared board members.
In order to quantify these observations about the changes in s-connected component sizes more rigorously and completely, we compute several entropy-based measures on the underlying s-connected component size probability distributions. The s-connected component size probability distribution, ps = p s 1 , . . . , p s k , is defined by taking p s j to be the fraction of hyperedges in Es that are in the j'th s-component. For a discrete probability distribution p, its Shannon entropy is given by H(p) = − k i=1 pi log 2 (pi). However, direct comparisons of Shannon entropy on our data may be problematic, as we note the number of hyperedges in Es and number of s-components varies not only between datasets, but also as s varies within each dataset, thereby complicating cross-comparisons of the (unitless) Shannon entropy. In order to facilitate a more meaningful entropy comparison, we consider the normalized entropy called p-smoothness by the authors [34]. Note that H achieves its maximum value of log 2 (k) for the uniform distribution, 1 /k, . . . , 1 /k , and a minimum value of 0 for a fully skewed distribution, e.g. 0, . . . , 0, 1 . For the special case in which k = 1, one takes the limiting definition as k → 1, and defines H(p) := 1. We consider the p-smoothness of the s-component size distribution ps, which we denote Hs = H(ps). In our context, if all s-components are equally sized, then Hs is 1, whereas if the disparity between component sizes is maximal (e.g. |Es| − 1 hyperedges in one s-component, and 1 hyperedge in the other), then Hs approaches 0. In this sense, p-smoothness reflects how smooth or uniform the s-component sizes are, but may not reflect how dispersedly the hyperedges are distributed among s-connected components. For that purpose, we consider an additional measure, again from [34], aptly called dispersion. In our context, the dispersion of the s-component size distribution compares the the number of s-components to the number of possible s-components on a logarithmic scale, i.e.
where Cs denotes the set of s-connected components and Es denotes the set of s-hyperedges.
In Figure 3, we plot the p-smoothness and dispersion for s = 1, . . . smax, where smax = max f,g∈E |f ∩g|. For s > smax, the s-components are either all isolated hyperedges, or non-existent. In all datasets, we observe that both dispersion and p-smoothness tend to increase in s, although, as evident from LesMis, this increase is not always monotonic. In the case of LesMis * , we observe lower values of p-smoothness for each of s = 1, . . . , 5 relative to those for corresponding values of s in the other datasets, consistent with the highly skewed distribution associated with the large component we observed in the visualization. For CompBoard, we observe a large separation between between p-smoothness and dispersion for s = 1. In this case, while the component size distribution is still skewed -and hence has low p-smoothnessthe remaining s-components consist of many isolated hyperedges, reflected in the high dispersion value. Lastly, for the Diseasome dataset, p-smoothness is maximal while dispersion is minimal for s = 1, while for s ≥ 2 both p-smoothness and dispersion closely coincide at values near 1. This reflects the fragmentation of a single giant component into many s-components (hence the high dispersion) that are equally sized (hence the high p-smoothness).

Distance and Centrality
Methods Under Definition 1, it is straightforward to show the length of the shortest s-walk serves as a distance metric function over a set of hyperedges. More precisely Then (Es, ds) is a metric space.
We omit the proof, as the triangle inequality can proved constructively, and the other metric space axioms follow immediately from Definition 3. If the hypergraph H is 2-uniform (i.e. is a graph), then the graph distance between vertices x and y in H is equivalent to the 1-distance between hyperedges x * and y * in H * . With s-distance serving as hypergraph geodesic distance, hypergraph s-analogs of local and global distance-based graph invariants easily extend. Important caveats arise when applying Definition 8 to real data. As we observed in the previous section, it may often be the case that H contains more than one s-component for some values of s, in which case the s-distance between some pairs of edges is infinite. Consequently, the s-eccentricity of every edge (and hence s-diameter and s-radius) and mean s-distance are all infinite; similarly, the s-closeness centrality of every edge is trivially 0. Similar to how these issues are sometimes addressed for graphs, one alternative may be to compute these measures on only the largest s-component. Depending on the analyst's aims, such an approach might be satisfactory, particularly if the majority of hyperedges in Es are contained within the largest s-component, as was seemingly the case in LesMis * .
However, restricting to the largest component may be unsatisfactory in cases where the largest scomponent does not constitute the overwhelming majority of edges in Es, as we observed in the Company Board network for s ≥ 2. In such cases, one may wish to compute s-eccentricity on a per-component basis, taking the extrema over all s-components as the s-diameter and s-radius. One may similarly compute mean s-distance or s-closeness per-component, however, it is unclear how to properly synthesize these values in order to obtain (in the former case) a single global numerical measure or (in the latter case) a ranking over all hyperedges in the entire network. Instead of a per-component approach, an elegant  alternative for averaging graph distances in disconnected graphs, advocated by Newman [51], is to use the harmonic mean instead of the arithmetic mean. This approach was adopted by Latora and Marchiori [45] to define network efficiency as the reciprocal of the harmonic mean path length, proposed as a quantitative measure of small-worldness. Latora and Marchiori termed this measure "efficiency" in reference to how efficiently information might be exchanged over the network. Later, a similar approach was adopted by Rochat [59] to define the harmonic closeness centrality index of vertices in a disconnected graph. Extending these notions to the hypergraph context, a more practical definition of the aforementioned s-distance based notions is given by: (iii) The harmonic s-closeness centrality index of a hyperedge f is In this way, we take the limiting value of 0 for the summand in (ii) and (iii) when f and g are in different s-components. Both (ii) and (iii) are numerical quantities between 0 and 1, with larger values indicative of closer s-distances between hyperedges, either globally (in the former case) or locally (in the latter case). Next, we apply the notions in the above definition to real data.
Application to Data Turning our attention to the data, we compute three of the aforementioned sdistance-based measures: the average s-efficiency index, the s-diameter (i.e. the maximum s-eccentricity over all s-components) and the harmonic s-closeness centrality index.
In Figure 4, we plot the maximum s-diameter (over all s-components) and the average s-efficiency for the hypergraph and its dual of each of our datasets. We recall that, by definition, larger values of average s-efficiency imply smaller s-distances among the hyperedges in question. For a number of the datasets (e.g. LesMis, Diseasome, Diseasome * ) we observe that average s-efficiency and s-diameter tends to decrease as s increases. In these networks, the shortest s-walks linking hyperedges tend to become longer (or infinite) as s is increased. However, in the case of CompBoard * , we observe that average s-efficiency increases in s for each s ≥ 2. This suggests that, among company board members who sit on multiple boards, those who sit on more boards tend to (on average) be closer to one another in sdistance. We observe a similar phenomena regarding average s-efficiency for the LesMis * dataset, where for characters appearing in at least s ≥ 22 scenes, the more scenes they appear in, the closer they are to each other in s-distance.
Turning our attention to s-diameter, we remark it is possible for s-diameter (taken as the maximum over all s-components) to increase or decrease in s. In the former scenario, as s is increased, shorter swalks linking hyperedges may disappear, and those edges may only be linked via longer s-walks, thereby increasing s-diameter. We observe this most prominently for LesMis, where s-diameter increases from 5 to 9 as s increases from 1 to 2. On the other hand, if increasing s eliminates all s-walks between pairs of hyperedges, then these hyperedges are separated into different s-components in which hyperedges may be closer to each other. In such cases, one may observe the s-diameter decrease, as in Diseasome. Consistent with our intuition from the Diseasome visualization in Figure 2, this s-diameter drop reflects the fragmentation of the network into small components; accordingly average s-efficiency also drops because of the infinite s-distances between edges in different s-components.
Lastly, in Figure 5 (top row), we list the top 15 hyperedges in the CompBoard, Diseasome, and LesMis * datasets for s = 1, 2, 3, as ranked according to their harmonic s-closeness centrality. Boxes enclosing hyperedges indicate a tie in s-closeness centrality. Comparing the ordinal rankings across the datasets, we make several observations. In some cases the top 15 ranked hyperedges for s = 1 remain within the top ranked for s = 2 (e.g. for LesMis * , 10 remain in the top 15) whereas in other cases, the top ranked hyperedges may change completely (e.g. in CompBoard, none of the top 15 companies with highest 1-closeness centrality remain in the top 15 for 2-closeness centrality).
A drop in a hyperedge's rank from s = 1 to 2 may indicate that short pathways linking that hyperedge to others rely on sparse hyperedge intersections. To illustrate this with an example, in Diseasome we observe that while "Colon cancer" and "Breast cancer" remain in the top 3 ranked hyperedges for s = 1, 2, 3, "Diabetes mellitus" drops from having the second largest 1-centrality, to having the 34th largest 2-centrality. "Diabetes mellitus" shares genes with 24 other diseases and hence this hyperedge intersects with 24 other hyperedges. However, of these 24 diseases, "Diabetes mellitus" shares at least 2 genes with only two diseases: "Obesity" and "Mature Onset Diabetes of the Young (MODY)". Thus, any 2-walk between "Diabetes mellitus" and another disease can only go through one of these diseases, which (in this case) results in larger average 2-distance between diabetes and other diseases, relative to the average 2-distance between other pairs of diseases. In contrast, "Breast cancer" shares at least 2 genes with 9 other diseases, and (on average) can be linked to other diseases via a shorter 2-walk than for "Diabetes mellitus".
In order to more rigorously explore these the changes in ordinal rankings by s-closeness, we compute Kendall's τB rank correlation coefficient between the top k ranked hyperedges for one value of s and the rankings of those same hyperedges under another value of s. We compute this coefficient for each of k = 10, . . . , |E| and for each ordered pair of s-values from {1, 2, 3}. Hyperedges with equal s-closeness centrality are considered tied in rank, and we assign the minimum s-closeness centrality score of 0 to any hyperedge with fewer than s vertices. We remind the reader that values of Kendall's τB range from -1 (if the ordinal rankings are perfectly inverted) to 1 (if the ordinal rankings are identical), and that that Kendall's τB is explicitly formulated to handle ties in rank [1]. We plot the results in Figure 5 for the CompBoard, Diseasome, and LesMis * datasets. For CompBoard, we see an absence of correlation for the 1-closeness rankings when compared against 2 or 3, and a stronger correlation for the 3-closeness rankings compared against the 2-closeness rankings. When all hyperedges in the network are considered (i.e. for k = |E|, given by the rightmost points in each plot), the 1-closeness rankings of LesMis * exhibits the strongest correlations between the 2 and 3-closeness rankings.

Paths, Cycles, and Clustering Coefficients
Methods So far, we've centered our methods solely around the base definition of an s-walk. However, just as graph walks may be distinguished into finer classes such as trails, paths, circuits and cycles, s-walks may also be distinguished from each other and organized in a hierarchical way. As we will soon see, doing so allows us to distinguish between high-order substructures native to hypergraphs, such as s-triangles, that cannot be determined from their s-line graphs.
Definition 10. For a hypergraph H = (V, E), let the sequence of hyperedges ω = (ei 0 , ei 1 , . . . , ei k ) be an s-walk of length k. For ease of notation let Ij = ei j−1 ∩ ei j be the j'th intersection. The s-walk ω may be further defined as: (i) An s-trace if ix = iy for all x = y (all hyperedges are pairwise distinct by label).
(ii) An s-meander if ω is an s-trace in which Ix = Iy for all x = y (all intersections are pairwise distinct).
(iii) An s-path if ω is an s-meander in which Ix \ Iy = ∅ for all x = y (no intersection is included in another).
If the hypergraph is 2-uniform (i.e. is a graph), the s-meander condition requiring successive edge intersections to be pairwise distinct is equivalent to the s-path condition requiring no intersection is a subset of any other. Hence, s-meanders and s-paths are equivalent for graphs, and both reduce to the usual graph path. We note that Wang and Lee [64] also define hypergraph paths using the same subset condition stated in 10 above.
Observe that the notions of s-walk, s-trace, s-meander, and s-path form a nested hierarchy: every s-trace, s-meander, or s-path is an s-walk; every s-meander and s-path is an s-trace; and every s-path is an s-meander. However, in each case, the reverse may not be true (e.g. an s-meander may not be an s-path). With regard to s-distance (Section 4.3), it is straightforward to show constructively that if there exists an s-walk (resp. s-trace, s-meander) of length k between two hyperedges, there exists an s-trace (resp. s-meander, s-path) of length at most k. This implies the length of the shortest s-walk between two hyperedges is equivalent to the length of the shortest s-path; consequently, s-distance as given by the length of the shortest s-walk is equivalent to that given by the length of the shortest s-path.
While not having ramifications for the notion of s-distance, the finer classes of s-walks above provide a means, within the s-walk framework, to define high-order substructures or motifs that cannot be determined from the s-line graph. To define an example of these substructures, we require the notion of a closed walk. Analogous to its usage in graph theory, we call an s-walk is closed if i0 = i k , and call a closed s-path an s-cycle. As a point of clarification, closed s-traces, meanders, or paths are still considered valid s-traces, meanders or paths (that is, only the terminal edges are exempt from the s-trace requirement that all edges be distinct by label). Using s-cycles, we define hypergraph s-analogs of triadic measures commonly applied to graph data. We note that whereas graph triadic notions like the local clustering coefficient [66] are defined for vertices, the s-analogs below are defined for hyperedges, keeping consistent with the rest of our presentation. We remind the reader that vertex-based notions are obtained by simply applying the below definition to the dual hypergraph, H * .
Definition 11. For a hypergraph H, an s-triangle is a closed s-path of length 3 and an s-wedge is an s-path of length 2. For an s-wedge e0, f, e2, we say f is the center of the s-wedge. In the same way as for the LCC of graphs, one may obtain a global measure for the s-LCC of a hypergraph by taking the mean s-local clustering coefficient over all edges in Es. If H is a graph (i.e is 2-uniform), then the above triadic notions reduce to their usual graph counterparts on H * (e.g. the 1-LCC of hyperedge x * in H * is equivalent to the usual LCC of vertex x).
In Figure 6, we give examples of three different hypergraphs induced by a closed s-walk e1, e2, e3, e1 of length 3; namely, in 6a, a closed s-trace that is not an s-meander, in 6b, a closed s-meander that is not an s-path, and in 6c, a closed s-path of length 3 (i.e. an s-triangle). Observe that the 1-line graphs of all three of these hypergraphs consists of a single triangle, while only the hypergraph in 6c is a 1-triangle (as well as a 2-triangle). This illustrates that s-triangles cannot be determined from line graphs (a fact that is unsurprising, given that the s-line graphs do not encode the subset relationships stipulated for s-paths).
Given that many other definitions of hypergraph clustering coefficients have appeared in the complex networks literature, it is worth clarifying how these notions compare to ours. Estrada [23] proposes a global hypergraph clustering coefficient as a ratio of (non s-walk based) hypergraph triangles to hypergraph wedges. More precisely, Estrada defines a hypertriangle as an alternating vertex-hyperedge sequence with three distinct vertices and three distinct hyperedges such that for each subsequence vi, e k , vj, we have that vi, vj ∈ e k (put equivalently, these are 6-cycles in the bipartite representation of the hypergraph). Thus, returning to the example in Figure 6c, the alternating sequence given by interlacing the pair of vertex and hyperedge triples (v1, v3, v6) and (e1, e2, e3) constitutes a triangle, as does the same pair with v1 replaced with v2. Indeed, it is easy to see that the existence of an s-triangle implies the existence of at least one such hypertriangle as defined by Estrada; however, the converse is not necessarily true (e.g. while the hypergraph in Figure 6b contains many such hypertriangles, neither this hypergraph nor its dual contain any s-triangles). In this sense, Estrada's notion of clustering and ours are fundamentally different.
Other proposed notions of hypergraph clustering differ to ours in being based on averaging various pairwise set theoretic measures between pairs of hyperedges or vertices. For instance, Latapy, Magnien and Vecchio [44] propose a pairwise clustering coefficient between hyperedges ei, ej as |e i ∩e j | |e i ∪e j | . In other words, this is the Jaccard similarity coefficient between the sets of vertices constituting the two hyperedges (or, when applied to the dual, the Jaccard similarity between the sets of hyperedges to which two vertices belong). They then define a local and global notion of hypergraph clustering by averaging this quantity. Zhou and Nakhleh [68] propose local and global hypergraph clustering coefficients based on the pairwise excess overlap between hyperedges. As described and studied further by the authors in [20], excess overlap measures the proportion of the vertices in exactly one of the edges that are neighbors of vertices in only the other edge. Lastly, notions of bipartite graph clustering proposed in the literature, (which via the bicolored graph-hypergraph correspondence mentioned in Section 2 apply naturally in the hypergraph setting) are frequently based on bipartite 4-cycles [2,58]. In the language of hypergraphs, a bipartite 4-cycle is a subhypergraph on two hyperedges and two vertices. Hence (in addition to again not being based in high-order s-walks) these bipartite 4-cycle based notions of clustering differ from our s-triangle based notions in involving only pairs (rather than triples) of hyperedges.
Application to Data In Figure 7, we plot the mean s-LCC and s-GCC (left block) as well as the proportion of triangles and wedges in s-line graph that correspond to s-triangles and s-wedges in the hypergraph (right block) for each of our datasets. We recall that every triangle and wedge in the s-line graph represents a closed s-walk of length 3 and s-trace of length 2; which, in turn, may or may not be an s-triangle or s-wedge, respectively. For all three datasets, we observe that a higher proportion of wedges in the s-line graph correspond to s-wedges compared with the proportion of triangles in Ls(H) that correspond s-triangles.
On the other hand, the datasets exhibit different behavior regarding the absolute size of these proportions, as well as how these proportions vary as s varies. For LesMis * , a relatively large proportion of triangles in Ls correspond to s-triangles than for CompBoard * . Furthermore, the proportions of striangles to s-wedges are much greater, both on average locally (given by the mean s-LCC) as well as globally (given by the s-GCC) than for CompBoard * . In contrast, CompBoard * exhibits an extremely small proportion of triangles in Ls(H) corresponding to s-triangles. This means that whenever there is a triad of board members such that each pair of members belong to common company boards, it is almost always the case that for at least one pair of board members, the set of companies in common are either identical (i.e. forming s-trace that is not an s-meander) or subsets of each other (i.e. an s-meander that is not an s-path). Indeed, s-triangles in both CompBoard and its dual are scarce, which is reflected in the extremely low s-LCC and s-GCC coefficients. Here, it is worth mentioning that (in the context of a company-board network) an s-wedge is a generalization of a different represenatives' interlock 6 , a topic which features prominently in the corporate governance literature [5,46,58]. Furthermore, pairs of hyperedges having an s-distance of 1 and 2 represent so-called "direct interlocks" 7 and "third company interlocks" 8 , respectively, between competing companies. This parallel illustrates how s-path and s-distance based notions may provide a generalized framework for describing and measuring phenomenon already important to particular domains. In the case of CompBoard, given that the aforementioned interlocks between competing companies are regulated by Section 8 of the Clayton Act [5], it is unsurprising that s-wedges (and hence s-triangles) are relatively rare.

Comparison with Generative Hypergraph Null Models
Graph generation serves far-ranging purposes across scientific disciplines. In particular, generative graph models are used for benchmarking, algorithm testing, and creating surrogate graphs to protect the anonymity of restricted data. Here, we apply hypergraph generative models as null models to experimentally test the significance of the high-order properties explored in Section 4. By "null model", we mean a generative model that controls for certain basic features of the data. Such models may be utilized to test whether observed measurements in the data are necessarily consistent with the controlled features. For example, in the Erdős-Rényi graph model, the user specifies the desired number of vertices n and edge-probability p; hence, by controlling n and p one can generate ensembles of random graphs with the same expected edge density. A subsequent comparison of measurements on given graph data against those on random graphs with the same edge density tests whether the measured features can be explained as sole consequences of edge density. To the extent to which the properties of the real and synthetic graphs tend to diverge, this provides evidence that the properties under question cannot be explained as sole consequences of the structural properties preserved by the model. In comparison to their graph counterparts, generative hypergraph models are relatively few. Nonetheless, researchers have recently begun developing a wider variety of hypergraph models, both for the case of uniform hypergraphs [15,16,17,55] and non-uniform hypergraphs [11,19,20,26,35]. In the present work, we consider three generative hypergraph models from [2], which can be thought of as hypergraph interpretations of the graph models Erdős-Rényi (ER) [22], Chung-Lu (CL) [13], and Block Two-Level Erdős-Rényi (BTER) [41,63]. These models were originally presented as "bipartite models" in [2], with similar acknowledgment of the bicolored graph-hypergraph correspondence we discussed in Section 2. While these models were inspired from their graph counterparts and named as such, there may be multiple ways of conceiving these models in the hypergraph/bicolored graph setting, as is often the case with graph-to-hypergraph extensions. In fact, others have proposed non-uniform hypergraph analogs of Erdős-Rényi and Chung-Lu (see [20] and [35], respectively) differing to those considered here with regard to the inputs required, the model itself, and the underlying definition of hypergraph assumed.
We've chosen these particular models for several reasons. First, these models can generate nonuniform hypergraphs in accordance with the full generality of Definition 1. Notably, all three of these models can generate hypergraphs with duplicated edges, which occur frequently in hypergraph-structured data and are highly prevalent on our particular data (see Figure 2). This is in contrast to the nonuniform hypergraph ER and CL models proposed by [20] and [35], which do not permit duplicated hyperedges, but treat hyperedges themselves as multisets. Secondly, taken as a suite, these models have the advantage of providing tiered control over three fundamental properties: (1) vertex-hyperedge density, (2) vertex degree and edge cardinality distributions, and lastly, (3) metamorphosis coefficients, a measure of community structure from [2] which we comment on further shortly. More specifically, ER controls for vertex-hyperedge density, CL controls for density as well as degree distributions, and BTER controls for all three of the aforementioned properties. In fact, taken in sequence, ER, CL and BTER can each be conceived formally as a generalization of the previous model. Lastly, all three three models afford scalable implementations and [2,31] report results on hypergraphs generated using these models with hundreds of millions vertex-hyperedge memberships. We note that open source implementations of all three of the above generative hypergraphs models are available 9 as part of The Chapel HyperGraph Library (CHGL, [31]), a prototype HPC library [32] for large-scale hypergraph generation and analysis written in the emerging programming language of Chapel.

Three Generative Hypergraph Models
We define each of the generative models we consider below. Then, we briefly discuss and compare the model properties.
1. Erdős-Rényi, ER(n, m, p). The user specifies three scalar parameters: the desired number of vertices n, desired number of hyperedges m, and vertex-hyperedge membership probability, p ∈ [0, 1]. For each of the nm vertex-hyperedge pairs, the probability of membership is the same, Pr(v ∈ e) = p.
2. Chung-Lu, CL( dv, de). The user specifies a desired vertex degree sequence dv = (dv 1 , . . . , dv n ) and and desired hyperedge size sequence de = (de 1 , . . . , de m ), which (in order to be realizable by a hypergraph) must satisfy c = n i=1 dv i = m i=1 de i . The probability that a vertex belongs to a hyperedge is proportional to the product of the desired vertex degree and edge size, i.e.
In order to ensure this probability is always less than 1, one may further require the input sequences satisfy maxi,j dv i de j < c.
3. Block Two-Level Erdős-Rényi, BTER( dv, de, mv, me) . In addition to the desired vertex degree and edge size sequences mentioned in Chung-Lu, the user also specifies desired vertex and edge metamorphosis coefficients, mv and me, which, as clarified further in the discussion below, are measures of community structure based on the prevalence of small, dense substructures in the hypergraph. The BTER model is designed to output a hypergraph that matches the input degree distribution and metamorphosis coefficients. The BTER model proceeds in two phases: in the first, metamorphosis coefficients are approximately matched by grouping vertices and hyperedges into small, disjoint sets called affinity blocks and applying the Erdős-Rényi model on each block. In the second, the degree distributions are matched by running the Chung-Lu model on the excess desired degrees, thereby linking the blocks. As formal details of the BTER model are complicated, the reader is referred to [2] for a complete specification.
Before proceeding, we briefly discuss the three models. For the ER model, the expected number of vertex-hyperedge memberships is pnm, and hence this simple model can be used to generate random hypergraphs with a specified vertex-hyperedge membership density. We reported this density for our datasets in Figure 2. For the CL model, each vertex v achieves its user-specified desired degree dv in expectation since An identical argument also shows each hyperedge e achieves its desired size de in expectation. In this way, CL not only matches the desired vertex-hyperedge membership density in expectation like ER, but additionally matches the vertex degree and edge size distributions in expectation. We reported these degree distributions for our datasets in Figure 2.
The CL model is a generalization of the ER model in the sense that the ER can be obtained from CL by taking the degree and edge size sequences to be constant, i.e. CL (mp, mp . . . , mp Lastly, the BTER model (which, as explained in [2], utilizes the CL model as a subroutine) can be understood as a generalization of CL. The BTER model is designed to match not only vertex and edge size distributions, but also per-degree metamorphosis coefficients. A complete definition of metamorphosis coefficients is involved; interested readers are referred to [2] for full details. Nonetheless, to elucidate how metamorphosis coefficients are interpreted in the hypergraph setting, we provide a high-level description here.
Metamorphosis coefficients are measures of network community structure based on counts of bipartite 4-cycles, also called butterflies, and bipartite 3-paths, also called caterpillars. In the language of hypergraphs, a butterfly is a subhypergraph consisting of two vertices and two edges intersecting in those two vertices; a caterpillar is an edge with two vertices intersecting with another edge in one of those vertices. The authors in [2] define metamorphosis coefficients for vertices within each of the two partitions of a bipartite graph, based on the ratios of butterfly to caterpillar counts those vertices participate in. Stated equivalently, this defines metamorphosis for the vertices and hyperedges of a hypergraph. If a hyperedge e has a large metamorphosis coefficient, this means a large proportion of the edges that e intersects with intersect in (at least) 2 vertices; dually, if vertex v has large metamorphosis, then a large proportion of vertices v shares an edge with share (at least) 2 edges. The BTER model is designed to match degree distributions, as well as the average metamorphosis coefficients for vertices and hyperedges of a given degree and cardinality, respectively.
Taken as a suite, these three models serve particularly well as null-models since each provides successively more control over hypergraph structure than the previous, providing the flexibility to choose different tiers of structural nuance for the generated hypergraphs. In the next section, we run each model multiple times on each dataset, and compare their s-walk properties. By (for example) "running CL on LesMis", we mean extracting the model inputs (in this case, the vertex degrees and hyperedge sizes) from the data, and using the Chung-Lu model to generate a hypergraph under these inputs.

Comparison
In Figure 8, we compare s-walk based properties of LesMis * , Diseasome, and CompBoard against those of synthetic hypergraphs generated by ER, CL, and BTER. For each dataset, we generate 100 instances of each synthetic model and compute the properties in question for each instance. The plot reports the average values observed over the 100 trials, for each s.
In the leftmost column in Figure 8, we use Kolmogorov-Smirnov (KS) distance to compare the scomponent size probability distributions of the original and synthetic hypergraphs. We remind the reader that KS distance is normalized between 0 and 1, with smaller KS-values indicating greater similarity 10 .
Comparing the models, we observe that as s increases, the ER model tends to exhibit higher KS distance than for CL and BTER, indicating s-component size distributions that are more dissimilar to the original. All three models seem to exhibit larger KS distances for larger s values, although in some cases (e.g. for CL and BTER on LesMis * ) this increase is not monotonic in s. One notable exception is CompBoard, in which all models exhibit much larger KS distance for s = 1 than s = 2. This can likely be attributed to the large number of isolated hyperedges observed in 1-components of CompBoard: in contrast, all three models tend to output hypergraphs in which the majority of hyperedges are contained within a single giant 1-component. Turning our attention to our s-distance based measures, we compare the original s-diameter (center right) and average s-efficiency (far right) of the three models against those of the synthetic model's hypergraphs. As the average s-efficiency plots are in log-scale, average values of 0 (which occur whenever no two hyperedges intersect in s vertices) are not plotted. Comparing the three models, ER tends to have lower s-diameter and average s-efficiency as s increases, when compared to CL and BTER. For LesMis * and Diseasome, CL and BTER seem to perform comparably; for CompBoard, however, BTER does noticeably better than both in matching average s-efficiency for s ≥ 2, although still diverging from the original values considerably for s ≥ 5. Lastly, we consider the model's performance with regard to the mean s-local clustering coefficients (center left). For all three datasets, ER produces smaller clustering coefficients than observed in the original data, for all s. For CL and BTER, we observe that for smaller values of s, the mean s-local clustering coefficients sometimes exceed those of the original data (e.g. for s ≤ 2 on Diseasome), while for some larger values of s (e.g. for 13 ≤ s ≤ 19 on LesMis * ), BTER and CL produce smaller local clustering coefficients than those of the original data.
Taking a broader view of these results, note that none of these three models are able to provide a consistent, close match across values of s. This suggests the s-walk-based measures in question cannot be explained as sole consequences of the model inputs (e.g. degree distributions for the Chung-Lu model), that are preserved in expectation in the output hypergraphs. Nonetheless, this experiment should not be extrapolated to provide generalized guidance on which model best preserves certain s-walk properties. Indeed, depending on the properties of the data in question, it may be the case that ER (the least accurate model on our data) provides a closer match than CL or BTER. In order to provide more a comprehensive approach to such questions, it would be of interest to determine conditions on model inputs under which certain s-walk properties of the output hypergraphs can be tightly bounded or controlled. While such work is outside the scope of the present paper, the aforementioned research by Kang, Cooley, and Koch [15,16,17] illustrates that proving analytical guarantees on even basic high-order walk based properties in random hypergraphs (such as the size of the largest s-component) requires sophisticated probabilistic analysis.

Conclusion
The prevalence and complexity of hypernetwork data necessitates analytic methods that are both applicable and able to capture hypergraph-native phenomena. We have proposed that hypergraph s-walks provide a framework under which a number of graph analytic tools popular in network science extend more meaningfully to hypergraphs. In applying these measures to real data, we've explored how they may reveal varied, interpretable, and significant structural properties of the data otherwise lost when analyzing hypergraphs under the lens of the usual graph walk. The methods that we've focused onconnected component analyses, distance-based measures, high-order motifs and clustering coefficientsare meant to illustrate the breadth of tools to which this approach is relevant. However, ours is clearly far from an comprehensive exploration. We conclude by briefly outlining several promising lines of future work that highlight the limits of our approach.
One immediate open question concerns how the methods we've developed here may be generalized further. For instance, it would be of both theoretical and practical interest to develop tractable s-walk based measures for weighted hypergraphs (with real-valued vertex and/or edge weights), directed hypergraphs (in which each edge's vertices are either in its "head" or "tail"), ordered hypergraphs (in which each edge's vertices are totally ordered), or temporal hypergraphs (consisting of sequences of hypergraphs). Another open direction lies in devising efficient computational methods for the s-walk measures put forth here. Indeed, we did not explore the algorithmic aspects underlying these methods. In some cases, the methods we utilized -while sufficient on our data -were not scalable to massive hypergraph data (e.g. computing s-centrality via the s-line graph quickly becomes infeasible for large hypergraphs with skewed degree distributions, as the density of s-line graphs increase quadratically in the maximum vertex degree). Developing algorithms that leverage the sparsity of the hypergraph (rather than resorting to computation on dense s-line graphs) would help facilitate the application of these methods to larger-scale data. Furthermore, just as researchers have begun developing efficient schemes for computing atomic bipartite graph motifs such as cycles of length 4 [61,65], work in a similar vein would prove useful for enabling large-scale s-triangle counting in hypergraphs.