BiFold visualization of bipartite datasets

Jiang, Yazhen; Skufca, Joseph D; Sun, Jie

doi:10.1140/epjds/s13688-017-0098-4

Regular article
Open access
Published: 20 April 2017

BiFold visualization of bipartite datasets

Yazhen Jiang¹,
Joseph D Skufca¹ &
Jie Sun^1,2,3,4

EPJ Data Science volume 6, Article number: 2 (2017) Cite this article

2462 Accesses
4 Citations
10 Altmetric
Metrics details

Abstract

The emerging domain of data-enabled science necessitates development of algorithms and tools for knowledge discovery. Human interaction with data through well-constructed graphical representation can take special advantage of our visual ability to identify patterns. We develop a data visualization framework, called BiFold, for exploratory analysis of bipartite datasets that describe binary relationships between groups of objects. Typical data examples would include voting records, organizational memberships, and pairwise associations, or other binary datasets. BiFold provides a low dimensional embedding of data that represents similarity by visual nearness, analogous to Multidimensional Scaling (MDS). The unique and new feature of BiFold is its ability to simultaneously capture both within-group and between-group relationships among objects, enhancing knowledge discovery. We benchmark BiFold using the Southern Women Dataset, where social groups are now visually evident. We construct BiFold plots for two US voting datasets: For the presidential election outcomes since 1976, BiFold illustrates the evolving geopolitical structures that underlie these election results. For Senate congressional voting, BiFold identifies a partisan coordinate, separating senators into two parties while simultaneously visualizing a bipartisan-coalition coordinate which captures the ultimate fate of the bills (pass/fail). Finally, we consider a global cuisine dataset of the association between recipes and food ingredients. BiFold allows us to visually compare and contrast cuisines while also allowing identification of signature ingredients of individual cuisines.

1 Introduction

Despite the dominance of automated algorithms for data mining and knowledge discovery, it has been increasingly recognized that human perception can play an essential and often favorable role in exploring patterns and developing insights [1]. For instance, the Hertzsprung Russell diagram of stellar luminoscity versus temperature provides a classic example of a data analysis problem easily tackled by a person but remains a challenge for automated methods [2]. Typically, the utilization of human cognition in exploratory data analysis relies on proper representation and visualization of the data in a low-dimensional embedding space [3–8].

The standard concept of a “dataset” is a tabular array, where each row corresponds to an object in the dataset and every column corresponds to a variable (or factor) measured on each object. A natural question about such a dataset is “how are objects like (or unlike) other objects and are there relevant relationships among collections of objects?” Multidimensional scaling (MDS) refers to a family of techniques that address these questions by visualizing the objects as a set of points embedded in a low-dimensional (typically 2-D or 3-D) geometric space, with the goal of representing the dissimilarities between objects by the distances between the corresponding points in the embedded space [9, 10]. The generality of MDS approaches makes them suitable for a broad range of practical problems, as demonstrated in many classical examples [9, 10] as well as in several recent scientific breakthroughs: mapping of brainwide neural behavior [11], discovery of sex-specific and species-specific perceptual spaces among different biological species [12], and analysis of biogeographic differentiation between geographical regions [13]. On a more fundamental level, several recent developments focused on generalizing different measures of “distance” in the MDS formulation to allow for embedding from and/or onto general nonlinear manifolds [14–16].

Frequently, we encounter dataset which encodes a binary relation between two sets (or “classes”) of objects, with elements of one set corresponding to the rows, elements of the other corresponding to the columns, and the data entries (“1” or “0”) indicating whether or not there is a relationship between the associated row and column. Common examples include politicians and bills they supported, or movie-goers and the movies that they attend, or students and the courses in which they enroll. Such examples can be regarded as decision-makers and choices, while we note that similar datasets arise in many contexts that are often described by bipartite graphs, such as the association between genes and diseases [17], relation between chemical reactants and reactions [18].

Knowledge discovery on binary relation datasets can benefit from a visualization of both decision-makers and choices in a common embedding space, where (simultaneously)

(1)
“Similar” objects (whether decision-makers or choices) ought to be “nearby” in the visualization;
(2)
Decision-makers should be positioned “close” to their preferred choices.

The BiFold method developed here relates to a set of ordination methods that attempts to resolve various aspects of this challenge. From a classical perspective, one should build that framework upon three primary choices, where we would point the reader to [9, 10] and the references therein for details of common methods: Biplot [19] aims to satisfy requirement (1), with points (typically referred to as “samples”) representing one set of objects and coordinate axis (often referred to as “variables” and plotted as position vectors) describing the other set; Unfolding [20] considers only between-class distances, and therefore focuses only on requirement (2); Correspondence analysis [21–23] focused on contingency table data rather than general binary relation data. The BiFold method developed herein merges the respective goals of Biplot and Unfolding methods, satisfying both requirements (1) and (2), and it is this connection that motivates the name.

In addition to the classical ordination methods described above, we note that BiFold has similar goals to nonlinear and generalized biplot methods described in [24, 25]. In particular, the generalized biplot addresses categorical variables (of which dichotomous variables are a subset) with consideration of both requirements (1) and (2) in developing the ordination. Their approach is to ordinate each entry in the dataset, such that each level of a categorical variable is separately visualized. For binary variables, that approach would require representation of one of the classes of objects by two sets of ordination coordinates, one to represent the “1” and the other to represent the “0” in the data. Our original contribution here is two-fold: (i) Our treatment is completely symmetric with respect to the classes, with neither being treated as the “variables.” The resulting ordination is identical, even if we transpose our dataset. Consequently, each object, regardless of class, is assigned only one coordinate. (ii) We consider an ordination scheme that accounts for the difference in information quality of cross-group and within-group distances, as well as the difference in information content across groups of different size, as specialized to the binary data framework. The ordination approach is more naturally able to account for the difference in information content arising from the non-square data matrices, missing data, and differences in interpretation of matches for categorical variables [25].

We begin with an introductory example of BiFold below, leaving the details of the approach and more examples to the later sections.

An introductory example - BiFold plot of the southern women dataset

Consider the Southern Women dataset, collected in the 1930s in a small town in the southern United States. The data records the participation of 18 ladies (Southern Women) in 14 social events [26] and can be represented by matrix $B=[b_{ij}]_{18\times 14}$, where $b_{ij}=1$ indicates that woman i attended event j, and $b_{ij}=0$ otherwise (see Figure 1 middle panel as well as Materials and Methods). Due to its relatively small size and simple structure, the dataset serves as a popular benchmark for techniques that consider social stratification, group formation, and other social structure questions [27].

One way to visualize the Southern Women dataset is to use MDS to place the 18 women at suitable 2-D locations, where distance between embedded coordinates reflects the degree to which the women attended similar events, as in Figure 1-top left. In this case, we are treating the women as the entities to be plotted, while the events are regarded as factors that characterize each individual. Alternatively, we can treat the events as entities and the women as factors, allowing us to obtain an MDS configuration of events, as shown in Figure 1-bottom left. The goal is to “overlay” the two embeddings to not only capture the within-class relationships (woman to woman, event to event) but also the cross-class relationship of woman to event. BiFold produces such a joint visualization (Figure 1-right panel) in which social group structure [27, 28] is easily identified through proximity: (a) nearby women attended similar events, (b) nearby events were attended by similar groups of women, (c) nearby woman-event pairs indicate that the woman likely attended that event.

2 Results

2.1 The BiFold approach

BiFold provides a procedural framework to produce a low-dimensional embedding from a binary data matrix. First, we create a joint dissimilarity matrix that appropriately fuses information from both within-class and cross-class relations. Secondly, we construct a weighting matrix to reflect the relative uncertainty associated with the dissimilarities. Finally, we minimize a weighted stress function to obtain a BiFold embedding, coordinates in $\mathbb{R}^{d}$ for each row and each column of the data matrix. In this section, we describe and explain this framework, leaving the detailed specification of the algorithms and parameters to Materials and Methods.

Given a binary relation between two types (classes) of objects, encoded as matrix $B=[b_{ij}]_{m\times n}$, where

$$ b_{ij}= \textstyle\begin{cases} 1,&\mbox{if object $i$ of type 1 relates to object $j$ of type 2};\\ 0,&\mbox{otherwise}. \end{cases} $$

(1)

Such data equivalently encodes a bipartite graph, where an edge in the graph corresponds to a binary relationship, and the matrix B is the biadjacency matrix of the graph [29].

Focusing on objects represented by the rows of B, we quantify, using some appropriate measure, the dissimilarity between row i and row j, denoted $\delta^{(x)}_{ij}$, producing matrix $\Delta ^{(x)}=[\delta^{(x)}_{ij}]_{m\times m}$. Likewise, we generate dissimilarity matrix $\Delta^{(y)}=[\delta ^{(y)}_{ij}]_{n\times n}$ by comparing the columns of B. Finally, the dissimilarity between row object i and column object j is defined by a monotonic transformation of the entries in matrix B, which yields a cross-class dissimilarity matrix $\Delta^{(xy)}=[\delta ^{(xy)}_{ij}]_{m\times n}$. A binary relation dataset typically falls into one of the two categories: (1) choice data, for which each data entry (whether “0” or a “1”) reflects an active decision (either a positive or negative relation); and (2) association data, for which the “0”s indicate only an absence of a relation, and are usually much less informative than the “1”s. For data from each of these categories, we have developed some sensible dissimilarity measures (see Materials and Methods).

Given within-class dissimilarity matrices $\Delta^{(x)}$ and $\Delta ^{(y)}$ together with the cross-class dissimilarity matrix $\Delta ^{(xy)}$, we form a joint dissimilarity matrix of size $(m+n)\times(m+n)$, as:

$$ \Delta= \begin{bmatrix} \alpha_{x}\Delta^{(x)} & \alpha_{xy}\Delta^{(xy)}+\beta\mathbf{1} \\ \alpha_{xy}\Delta^{(yx)}+\beta\mathbf{1} & \alpha_{y}\Delta^{(y)} \end{bmatrix} , $$

(2)

where 1 is a matrix of 1s, and $\Delta^{(yx)}$ is the matrix transpose of $\Delta^{(xy)}$. The joint dissimilarity matrix contains a few tunable parameters: $\alpha_{x}$, $\alpha_{y}$, and $\alpha_{xy}$ control the relative scale of the within-class and cross-class distances in the embedded space, while β allows for explicit translation (i.e., shifting) of the row objects away from the column objects. In the examples of this paper, we use $\beta= 0$ (no translation). Note, however, that for some datasets the visualization might be improved by a translation (realized by a nonzero β value) as determined by the end-user.

Dissimilarities are generated from data and should be viewed as a measurement with uncertainty. To capture such uncertainty, we associate a “weight” to each dissimilarity following the principle that the weight should reflect the information content (or reliability). We denote the corresponding joint weighting matrix as

$$ W = \begin{bmatrix} W^{(x)} & W^{(xy)} \\ W^{(yx)} & W^{(y)} \end{bmatrix} . $$

(3)

Once the joint dissimilarity and weighting matrices are specified, a d-dimensional BiFold embedding yields coordinates $X=(\boldsymbol {x}_{1},\boldsymbol {x}_{2},\dots,\boldsymbol{x}_{m})$ and $Y=(\boldsymbol{y}_{1},\boldsymbol {y}_{2},\dots,\boldsymbol{y}_{n})$ to denote the sets of points corresponding row objects and column objects. Denote the full coordinate set as $Z=(X,Y)$. Such an embedding is computed by minimization of the multivariate stress function $S:\mathbb{R}^{d\times(m+n)}\rightarrow\mathbb{R}$, defined as

$$ S(\boldsymbol{z}_{1},\boldsymbol{z}_{2},\dots, \boldsymbol{z}_{m+n})=\sum_{k,\ell=1}^{m+n}w_{k\ell } \bigl( \Vert \boldsymbol{z}_{k}-\boldsymbol{z}_{\ell} \Vert - \delta_{k\ell} \bigr)^{2}. $$

(4)

Here $\Vert \cdot \Vert $ denotes a metric distance in the common embedding space. The stress function S, which is given by a weighted sum of the discrepancies between the embedded distances and the dissimilarities, is a standard type of loss function frequently used in MDS [9, 10].

In the following sections, we illustrate BiFold via three additional binary datasets: US presidential election results, US senate voting records from the 112th Congress, and a food-recipe relational dataset from five major global cuisines.

2.2 Examples: voting datasets

A common type of binary relation comes from voting data: for each item to be voted on, a voter either votes “for” or “against,” with (perhaps) the ability to abstain. We consider two such examples: US presidential election results for past ten elections, for which the “voters” are the individual US states and the “items” are the winning presidents in each election. We also examine senate congressional roll call votes, with US senators as voters and the items are senate bills.

2.2.1 Presidential election by states

Consider the state-level votes for the United States presidential elections for the period from 1976 to 2012. There are 51 decision makers (50 states plus the District of Columbia) and a total of 10 decisions, resulting in a data matrix $B=[b_{ij}]_{51 \times10}$ where

$$ b_{ij}= \textstyle\begin{cases} 1 & \text{if state $i$ voted for the winner of election $j$}, \\ 0 & \text{otherwise}. \end{cases} $$

(5)

As with the Southern Women example, we seek a low-dimensional visualization which captures the within-class relationships (state-to-state and president-to-president) while also accounting for the between-class relationships (state-to-president). To quantify these relationships, we define the dissimilarity between two states as the fraction of elections for which they voted differently; the dissimilarity between two elections is computed as the fraction of states which voted differently in those elections; finally, the dissimilarity between state i and election j is quantified as $1-b_{ij}$. Given these dissimilarities, Figure 2 visualizes these election results, where coordinates are determined using BiFold.

Using BiFold for positional layout, we may encode additional information using other aesthetics. In Figure 3, states are colored according to party affinity (based on the fraction of these 10 elections in which that stated voted for the presidential candidate from that party, with Republican in red and Democrat in blue).

Aided by the additional encoded information, the BiFold layout yields some interesting observations.

Not surprisingly, the primary coordinate axis (left/right) strongly encodes the party affinity (blue state/red state).
Over time, the election positions have (generally) moved toward the left/right extremities, capturing the increasing partisanship of the elections.
Most of the purple colored “swing states” are near the center of the visualization, which implies that they align with most of the election winners, with slight variation based on the particular set of presidents that they supported. As interesting exception, West Virginia lies far above the main cloud, which we attribute to its trend of having often supported the non-winning candidates.
Noting several “paired” election coordinates, we observe that such pairs associate with two-term presidents, likely because their constituent support did not change much between elections.
Positional outlier Carter ‘76 reflects support from a non-typical coalition of states, likely attributed to Carter being the first president elected from the Deep South since the Civil War. Reagan ‘84 is the most centrally positioned, reflecting broad national support. Figure 3b connects each of these elections to the supporting states.
Comparing Bush ‘00 to Obama ‘12 (Figure 3c) we see both elections driven primarily by partisan support.

We remark that any of the visually indicated hypothesis should be viewed as exploratory and be confirmed by additional quantitative analysis (as would also be appropriate for most other data visualizations). However, we note that the BiFold visualization motivates a rich palette of such hypotheses, many of which directly exploit the between-class information.

2.2.2 Senate congressional roll call votes

We consider the voting record of the United States (U.S.) Senate. The U.S. legislative body is composed of two chambers, known as the Senate and the House of Representatives. A particular congress serves for two years with this time frame divided into two sessions. We focus on voting data from the 112th congress, first session of the Senate, which conducted 235 roll call votes. For each roll call vote, there are at most 100 senators. However, the replacement of Senator Ensign by Senator Heller mid-session leads to a data matrix with 101 rows and 235 columns, recording the action of senator i on bill j

$$b_{ij}= \textstyle\begin{cases} 1 & \text{if a ``yes'' vote}, \\ 0 & \text{if a ``no'' vote}. \end{cases} $$

If senator i did not act on bill j, entry $b_{ij}$ is undefined and is treated as missing data. (See Materials and Methods for treatment of missing data.)

As with the previous examples, the goal (achieved by BiFold) is to obtain an embedding that captures both the within-class relationships (senator-to-senator and bill-to-bill) and the between-class relationships (senator-to-bill). From the data, we quantify the dissimilarity between two senators (bills) as the estimated probability that they vote (were voted) differently. Dissimilarity between senator i and bill j is the estimated likelihood that senator i objects to bill j.

As shown in Figure 4, the BiFold plot clearly shows the two-party structure of the senate, allowing for convenient visual comparison of the relative “spread” of the parties, and identification of senators that are “moderate” versus those that are more “extreme” (Figure 4, top panels). The pattern of bills revealed by BiFold is reminiscent of the diamond structure previously identified from classical MDS [30]. In addition, BiFold provides visual information regarding the relationships between bills and senators by positioning bills “close to” the senators supporting them. This unique feature enables a clear classification of the main clusters of bills as shown in Figure 4:

Bills in the “left” (liberal) cluster received strong support from the Democratic Senators;
Bills in the “right” (conservative) cluster received strong support from the Republicans;
Bills in the “top” (bipartisan supportive) cluster were strongly supported by both parties, as visually being “pulled” between the two parties;
Bills in the “bottom” (bipartisan opposition) cluster are pushed far away from both parties, indicating bills that were supported only by a small number of senators.

Thus, by simultaneous embedding of both the senators and the bills, the BiFold visualization not only captures patterns within the senators and those within the bills, but also reveals salient features of the senator-bill cross relations.

2.3 Association datasets: a recipe - ingredient dataset

We envision the BiFold approach to be broadly useful, certainly beyond the visualization of voting data. Another important category of binary data captures the association between “members” and “affiliations.” A key feature of such association datasets is that the non-association relations carry little information compared to the association relations; in sharp contrast, in a voting dataset the “yes” and “no” votes both convey valuable information about the relation between the decision makers and the choices. Association datasets are often collected to form sparse, bipartite networks, where sparsity arises from the reality that there are (typically) many more non-associations than associations in these data.

We focus here on a specific example relating recipes with their included ingredients. A recipe defines a procedure for cooking, along with a list of food items (ingredients) used in the recipe. Gathering this data over a broad spectrum of recipes allows us to more completely understand how ingredients are used in combination, which may vary from one cuisine to another. As our data source, we consider the recipe-ingredient association dataset generated in [31], which assembled over $50{,}000$ recipes taken from two American and one Korean online repository. The data is (again) represented by a matrix B, where $b_{ij}=1$ indicates that recipe i contains ingredient j, and 0 otherwise. To proceed with the BiFold approach, we must define the dissimilarities between the entities: Recall that in the voting examples, both a “1” (a yes vote) and a “0” (a no vote) contain actual information regarding a voter’s opinion. In contrast, in the recipe-ingredient dataset, a given recipe typically includes only a small fraction of all available ingredients and carries essentially no information on those ingredients that are not used in the recipe.

Between-class dissimilarity measure is as before, $\delta _{ij}=1-b_{ij}$. However, the within-class dissimilarities require more careful consideration. If we were to quantify the dissimilarity between two recipes in the same way that we did for two voters, we would conclude that most recipes are very “similar.” This apparent similarity is artificial, resulting not from commonality of ingredients they share, but due to the overwhelmingly large set of ingredients that neither recipe contains. A dissimilarity measure that symmetrically incorporates “1”s and “0”s will therefore be dominated by the sparsity of the data rather than the actual relation between the entities of interest. In this context, we would consider the 0s as carrying relatively little information. As such, the Jaccard distance provides a natural measure of dissimilarity [32], where we treat rows (or columns) of B as a characteristic function indicating set membership. For two recipes, the Jaccard distance is

$$ J^{\mathrm{R}}=1-\frac{\#\mbox{ ingredients shared by the two recipes}}{\#\mbox{ ingredients needed to make both recipes}}. $$

(6)

Likewise, the Jaccard distance between two ingredients is

$$ J^{\mathrm{I}}=1-\frac{\#\mbox{ recipes using both ingredients}}{\#\mbox{ recipes using either ingredient}}. $$

(7)

In addition to the recipe-ingredient relationship information, the original dataset also categorized each recipe as belonging to a particular cuisine. We focus our analysis on a random subsample (of 1,000 recipes) of the five cuisines in the original dataset that contain more than 1,000 recipes. We compute a 2-D BiFold embedding to support visualization of this reduced dataset. In Figure 5, we use the BiFold coordinates to plot food ingredients (circles, colored by ingredient category), with that layout the same for all five cuisines. Each cuisine is visualized in its own panel, where we use a density plot to capture the distribution of recipes from that cuisine.

As expected, ingredients that are commonly used together in recipes are positioned near each other in the plot, and recipes with similar ingredients appear close together as well. A unique outcome of applying BiFold to this data is that we may now visually associate ingredients to cuisines, whereas the original data only associates recipes to cuisines, facilitating an entirely new level of interpretation enabled by embedding both recipes and ingredients using a common coordinate frame:

From the collection of cuisine plots, we can visually identify similar cuisines (North America - Western Europe, Latin America - Southern Europe).
The East Asian cuisine appears visually distinct from the western heritage cuisines.
The protein group, primarily meat, appear centrally in the figure of ingredients, with all the cuisines showing significant density in that region of the plot. (In other words, the meat group does not identify any particular cuisine.)
The density plots allow to visually identify certain ingredients as the “signature” of a cuisine: basil and oregano (Southern European); sesame oil and soy sauce (East Asian); cocoa and vanilla (North American and Western European).

3 Discussion

The BiFold framework described in this article has primarily focused on a fixed, binary dataset, interpretable as associations between two types of objects. We consider that framework to be broadly applicable to datasets describing relationships between entities from different classes, where we want to be able to simultaneously visualize the different classes such that visual distance can be associated to a dissimilarity measure, both within class and between classes. For the datasets examined, we would remark that although the knowledge discovery facilitated by the visualization are possibly achievable by other analysis techniques, BiFold has a unique ability to simultaneously visualize those discoveries. Note that the extent to which BiFold plot (or any visualization) reflects the actual similarities and dissimilarities between objects in the dataset - as measured by the stress function - depends intrinsically on the dataset itself. In typical real-world datasets, the representation would not be perfect, even if the dimensionality of embedding is large. For the datasets considered here, we find that in the Southern Women example, as well as the two US voting examples, a low-dimensional (2-D or 3-D) BiFold embedding achieves an almost minimal stress which cannot be further decreased by increasing dimensionality (see Figure 6), supporting the notion that the opinions are well expressed by a low-dimensional model. On the other hand, for the recipe-ingredient example, increase of dimensionality beyond 3-D continue to decrease stress and improve the match to the original data (Figure 6), suggesting an enormous diversity and complexity in the cuisine space which cannot be accounted for using just a few variables or parameters.

In addition, we note that the BiFold framework described here may be easily extended to a number of interesting and related problems:

As an (almost trivial) extension, we note that interpretation of the data as representing a bipartite network implies that BiFold could act as a graph layout algorithm for bipartite network data.
BiFold can be viewed as a generalization of several other classical techniques which can be recovered by specific choice of parameters:
- $w^{(xy)}_{ij}=0$: Only within-class dissimilarities are considered, yielding separate MDS embeddings of the two types of objects [9, 10].
- $w^{(x)}_{ij}=w^{(y)}_{ij}=0$: only between-class dissimilarities are considered, yielding an unfolding of the data [9, 10, 20].
The entries in the data matrix, B, need not be binary, but could represent a continuous or ordinal variable, such as ratings, rankings, or preferences.
Some dataset might naturally contain more than two groups, such as actors, movies, and viewers. Such datasets can be treated as multipartite, rather than bipartite data. We envision a natural extension of BiFold, where the joint dissimilarity and weighting matrices must be appropriately constructed based on the within-group and between-group relationships.
We focused on Hamming distance and Jaccard distance to compute within-in class dissimilarity, with each providing a natural interpretation for the datasets considered. We note that the BiFold framework is not dependent upon any particular choice of dissimilarity measure, and a reasonable practitioner may choose other methods for defining dissimilarities (and weights) that might be appropriate for their data. The BiFold approach - based on the joint dissimilarity matrix, will still provide a means to develop the joint visualization.
For some of the methods, we interpret the raw (binary) from Bayesian perspective, but with uninformed prior. That approach could easily added to accommodate other a priori understanding of the data.
For dynamic datasets (parameterized by time, for example) each data “snapshot” would yield a BiFold layout. A stress functional that incorporates a regularity condition in time could compute an optimal sequence of layouts, computed over many snapshots.

As caution, we note some of the challenges associated with analysis via the BiFold framework:

Computational complexity of the stress minimization as an optimization problem using the SMACOF algorithm is roughly $O(n^{4})$ for reaching at a local, approximate solution. As such the current implementation of stress minimization will likely struggle with very large datasets. Because the technique is meant to support visual knowledge discovery (human interaction), speed of visualization is important. Data aggregation might be a way to handle large datasets, but the aggregation procedures will almost certainly be domain specific.
Comparing one BiFold layout to another (exploring parameter space) can be challenging in that the solution layout is rotation and reflection invariant. Normalizing the orientation of the generated solution is important. As additional complication, the configuration solution to the optimization problem is a local minimizer, so that solution may “jump” to a different minimizer under small changes in the data.
The non-Euclidean nature of the dissimilarity measures results in a dissimilarity matrix that is not necessarily well approximated by a low dimensional embedding. Under such case, visually interesting effects may sometimes be an artifact of the data, particularly with sparse datasets.

Despite these challenges, we note that the proposed BiFold framework developed here appears to have broad applicability in many settings related to complex networks, social sciences, and those areas of data analysis that focus on binary relations.

4 Materials and methods

4.1 Datasets

The Southern Women dataset is a popular dataset used in social network analysis. The dataset first appeared in the book “Deep South: A Social Anthropological Study of Caste and Class” [26] (p.148), and can also be found in several online network data repositories. Collected in the 1930s in a small southern town Natchez (Mississippi, United States), the data records the participation of 18 women in a series of 14 informal social events over a nine-month period. Only the events for which at least two women participated are included in the dataset. Figure 1 shows the data table without including the names of the women or dates of the events. We represent the dataset by a woman-by-event matrix $B=[b_{ij}]_{18\times14}$, where $b_{ij}=1$ indicates that woman i attended event j, and $b_{ij}=0$ otherwise.
The U.S. presidential election dataset considered in this paper includes the state-level voting results of the United States presidential elections for the period from 1976 to 2012. The dataset, available at the U.S. government archive (http://www.archives.gov/federal-register/electoral-college/), includes the state voting outcome from the 51 voting entities (50 states plus the District of Columbia) for the past 10 presidential elections. We alphabetically numbering the states from 1 to 51 by name, and the elections from 1 to 10 in chronological order. We then represent the dataset by a state-by-president matrix $B=[b_{ij}]_{51\times10}$, where $b_{ij}=1$ indicates that state i voted for the elected president in the jth election, and $b_{ij}=0$ otherwise. For example, in all past 10 elections Ohio has always voted for the president candidate who eventually won the election regardless of his party affiliation. Florida and Nevada both “missed” one election: in the 1992 election, Florida voted for G.H.W. Bush (the elected president was B. Clinton); in the 1976 election, Nevada voted for G. Ford (the elected president was J. Carter). All three are well-known examples of “swing” states characterized by flexible voting patterns and importance in determining the election outcome.
The U.S. Senate Congressional Voting dataset used in this paper is obtained from the congressional voting records of the 112th United States congress, first session of the Senate. There are at most 100 senators at any time, with occasional need to replace a senator in mid session, which happened once during the voting portion of this session. As such the roll calls indicate 101 senators voting, 51 Democrats (D), 48 Republicans (R), and 2 Independents (I). There were 235 recorded roll call votes, 167 passed and 68 rejected. We number the senators from 1 to 101 by last name, and the bills from 1 to 235 in chronological order. We formulate data matrix $B=[b_{ij}]_{101\times235}$ by defining $b_{ij}$ using the voting of senator i on bill j: for a “yes” vote $b_{ij}=1$, for a “no” vote $b_{ij}=0$. The abstained votes are treated as “missing” data in the matrix (see the “Treatment of partial and missing data” section below for details).
The recipe-ingredient dataset is retrieved from the Supplementary Information of Ref. [31], a paper that studied the similarity and difference in food pairings across different geographical regions. The dataset contains more than 50,000 recipes extracted from three cuisine websites: allrecipes.com, epicurious.com, and menupan.com. The recipes were divided into 11 geographical regions, covering ∼50 popular cuisines around the world. The recipes and ingredients are indexed. Focusing on the 5 geographical regions (cuisines) that contain over 1,000 recipes, we construct data matrix $B=[b_{ij}]_{5{,}000\times335}$, with $b_{ij}=1$ if recipe i contains ingredient j. This subsample of the original dataset contains 1,000 randomly selected recipes from each of the 5 selected cuisines: East Asian, Latin American, North American, Southern European, and Western European. The subsampled data contains a total of 335 different ingredients.

4.2 The BiFold framework: dissimilarity measures, weights, and stress minimization

The BiFold framework describes a general approach to produce a low-dimensional embedding from a data matrix, where that matrix encodes the relationship between two classes of objects. First, one needs to create a joint dissimilarity matrix using some appropriate within-class and cross-class dissimilarity measures as well as scaling to make the within-class and cross-class dissimilarities commensurate. Secondly, one needs to construct a weighting matrix to reflect the relative focus to be given to the computed dissimilarities. Finally, the BiFold embedding is obtained by minimizing a weighed stress function similar to the determination of an MDS solution.

We now present the mathematical details of the BiFold procedure. For a given data matrix $B=[b_{ij}]_{m\times n}$, a d-dimensional BiFold embedding is based upon minimization of the multivariate stress function $S:\mathbb{R}^{d\times(m+n)}\rightarrow\mathbb{R}$, given by

$$ S(\boldsymbol{z}_{1},\boldsymbol{z}_{2},\dots, \boldsymbol{z}_{m+n})=\sum_{k,\ell=1}^{m+n} w_{k\ell} \Phi\bigl( \Vert \boldsymbol{z}_{k}-\boldsymbol{z}_{\ell} \Vert _{2},\delta_{k\ell} \bigr). $$

(8)

The joint dissimilarity matrix is given by Eq. (2), where $\Delta^{(x)}=[\delta^{(x)}_{ij}]_{m\times m}$ and $\Delta ^{(y)}=[\delta^{(y)}_{ij}]_{n\times n}$ are the within-class dissimilarity matrices and $\Delta^{(xy)}=[\delta^{(xy)}_{ij}]_{m\times n}$ is the cross-class dissimilarity matrix ($\Delta^{(yx)}=\Delta ^{(xy)\top}$). The parameters: $\alpha_{x}$, $\alpha_{y}$, and $\alpha _{xy}$ provide flexible scaling of the within-class and cross-class distances in the embedded space, while β can be used to visually translate the type-1 objects away from the type-2 in the embedding.
The weighting matrix is defined in Eq. (3), where $W^{(x)}=[w^{(x)}]_{m\times m}$, $W^{(y)}=[w^{(y)}]_{n\times n}$ are the within-class weighting matrices and $W^{(xy)}=[w^{(xy)}]_{m\times n}$ is the cross-class weighting matrix ($W^{(yx)}=W^{(xy)\top}$).
As typical choice for the above stress function S is to let $\Phi(d,\delta)=(d-\delta)^{2}$. For a given dissimilarity and weight matrix, this fully specified stress function may then be minimized to obtain coordinates $\{z_{1}, \ldots,z_{m+n}\}$.

4.3 Dissimilarity measures and weights used in the examples

In the data matrix $B=[b_{ij}]_{m\times n}$ of the Southern Women dataset, $b_{ij}=1$ if woman i attended event j and $b_{ij}=0$ otherwise. For the BiFold plot in Figure 1, we used the following within-class and cross-class dissimilarities:

$$ \textstyle\begin{cases} \mbox{(woman-to-woman dissimilarity)}&\delta^{(x)}_{ij}=\sum_{k=1}^{n} \vert b_{ik}-b_{jk} \vert ,\\ \mbox{(event-to-event dissimilarity)}&\delta^{(y)}_{ij}=\sum_{k=1}^{m} \vert b_{ki}-b_{kj} \vert ,\\ \mbox{(woman-to-event dissimilarity)}&\delta^{(xy)}_{ij}=1-b_{ij}. \end{cases} $$

(9)

Then, to balance the spread of the points from the two classes in the embedding, we set the scaling parameters $\alpha_{x}=1/n$, $\alpha _{y}=1/m$, and $\alpha_{xy}=1$. The shifting parameter $\beta=0$. All entries of the joint weighting matrix W equal to 1. These choices were made primarily for simplicity and are unlikely to be appropriate for the other, much larger datasets considered in the paper. Below we develop a set of dissimilarity measures and corresponding weights suitable for two common types of data matrices encoding voting and association relations, respectively.

Voting data: the BiFold Bernoulli Method. Where the data matrix B represents ‘voting’ data, such that $b_{ij}$ indicates that object $X^{i}$ voted positively for object $Y^{j}$, one may consider that the preference selection (‘1’ or ‘0’) is a forced binary decision on a continuous variable that represents preference. One model for this situation would be to view $b_{ij}$ as the observation of the forced decision outcome, treated as a Bernoulli trial, where Bernoulli parameter $p:=p_{ij}=:p_{ij}^{(xy)}$ is not known. (For real data sets of voting data, we treat ‘yes’ as ‘1’ and ‘no’ as ‘0.’ As a third outcome, sometimes a voter will ‘abstain’ on a particular vote, which we view as “missing data” with technique described below.) Applying this model within a group (for example, within group 1) we could assert a Bernoulli process with $p:=p_{ij}^{(xx)}$ the (unknown) probability that object $X^{i}$ and $X^{j}$ would vote the same way on an arbitrarily selected vote. Comparing rows i and j in the data matrix B would provide n observations of outcomes from that Bernoulli process. Comparison of columns treated in the same way, would represent m observation of the Bernoulli process associated to objects $Y^{i}$ and $Y^{j}$. Ideally, we would like to construct a BiFold configuration using dissimilarities computed from the actual values for preference - the unknown values for $p_{ij}^{kl}$. Instead, we must assign dissimalities from estimated probabilities , $\delta_{ij}^{(*)}:=1-\hat{p}_{ij}^{(*)}$. Following standard development for estimating proportions, we count the number of within group differences between pairs of entities in each class :
$$\begin{aligned}& s_{ij}^{(x)}= \sum_{k} \vert b_{ik}-b_{jk} \vert , \end{aligned}$$
(10)

$$\begin{aligned}& s_{ij}^{(y)}= \sum_{k} \vert b_{ki}-b_{kj} \vert . \end{aligned}$$
(11)
For the cross-class data, we pool all observations to define an average rate of positive voting:
$$ \bar{p}=\frac{\sum_{i,j} b_{ij}}{nm}. $$
(12)

Because we have significantly more observations for the ‘within class’ data, we expect those estimates to be more accurate. Consequently, we choose weights $w_{ij}$ proportional to the information content. Borrowing from approaches used in regression of heteroscedastic data, we weight the error term (stress) inversely as the (estimated) variance in the observation, as applied in equations (8) and (3). We focus on three primary alternatives for the estimation of the parameters and the variance: (1) Bayesian, with uniform prior; (2) Bayesian, with Jeffreys’ prior; and (3) Non-Bayesian, maximum likelihood estimate. Table 1 shows the resultant formulas associated to these methods. We note that the specific Bayesian approaches described assume no prior belief regarding the parameters $p_{ij}$. However, the concept is obviously easily generalized to those cases where prior information is available, where one would simply encode that knowledge into assumed prior distribution.
Table 1 Bifold Bernoulli methods: coefficient estimation formulas for the distances
Full size table
Association data: the BiFold Membership Method. For association data (such as the recipe-ingredient dataset), the sparse biadjacency matrix $b_{ij}=1$ indicates an association between object i from class x with object j from class y. Unlike the case of voting datasets a “0” in an association dataset carries relatively little information as opposed to a “1”. This asymmetry, if not accounted for appropriately, will result in an embedding (and visualization) that is dominated by the count of 1s instead of revealing more useful features.

Between class dissimilarity measure is quantified as
$$ \delta_{ij}^{(xy)}=1-b_{ij}. $$
(13)
The within-class dissimilarities are computed using a Jaccard distance. Specifically, for two objects represented by rows i and j of the matrix B, their dissimilarity is given by
$$ \delta^{(x)}_{ij}=1-\frac{\sum_{k} b_{ik}b_{jk}}{\sum_{k} (b_{ik}+b_{jk}-b_{ik}b_{jk} )}. $$
(14)
Likewise, the dissimilarity between columns i and j is computed as
$$ \delta^{(y)}_{ij}=1-\frac{\sum_{k} b_{ki}b_{kj}}{\sum_{k} (b_{ki}+b_{kj}-b_{ki}b_{kj} )}. $$
(15)

For weights, we treat $b_{ij}=1$ as representing unit information, while $b_{ij}=0$ carries no information, so that

$$ w_{ij}^{(xy)}=1-b_{ij}. $$

(16)

For within class, the weights are computed by counting the number of common “1’s,” yielding

$$ w^{(x)}_{ij}=\sum_{k} b_{ik}b_{jk}, \qquad w^{(y)}_{ij}=\sum _{k} b_{ki}b_{kj}. $$

(17)

As a result of the typical sparsity in such dataset, matrix W will also be sparse. We remark that

$$ w_{ij}=0 \quad \iff\quad \delta_{ij}=1, $$

(18)

meaning that under this condition of maximal dissimilarity of i with j, that particular dissimilarity does not directly affect the computed stress functional or the resultant BiFold embedding. Without this weighting scheme, a sparse association dataset would be completely dominated (visually) by the large number of objects forced to lie at the outside of the unit ball because most objects are ‘very far’ from most other objects.

5 Stress minimization

After formulating a stress function (8) and embedding dimension d, a BiFold representation of the data is obtained by minimizing the stress function over the coordinates of $m+n$ points in a d-dimensional Euclidean space. This optimization problem is within the class of MDS problems, with several alternative tools available to find a local minimum [9, 10]. For the BiFold plots reported in this paper, the stress minimization is done via the (iterative) SMACOF algorithm [10]. For reproducibility of results, for the initial iteration of the algorithm, the starting configuration for the coordinates is obtained by a classical MDS solution of the joint dissimilarity matrix (without weighting). After applying the SMACOF algorithm to obtain a set of coordinates, we further perform a PCA (principal component analysis) to standardize the alignment, noting that the stress function is invariant under such transformations. As a consequence, in all BiFold plots the horizontal axis is the principal direction.

6 Treatment of partial and missing data

For real datasets, the choice of methods for dealing with missing data can become a critical component of the data processing. In general, the BiFold approach admits a very reasoned approach that does not depend upon imputation and remains robust in a wide variety of datasets. The key enabler is recognizing that data matrix B contains mn pieces of information, while the solution (a configuration) allows just $d \times (m+n)$ free variables. Under typical scenarios, with the visualization dimension $d=2$ or $d=3$, and $n,m\gg d$, we may view this as the data matrix as providing significant amount of “redundant” information. In the same way that a regression line should not suffer too much if a small fraction of the data set is removed, a similar robustness should persist in the BiFold visualization. As such, we follow two general guidelines when dealing with missing data:

1.
Use only available data when computing dissimilarities $\delta_{ij}$.
2.
Weights $w_{ij}$ should be selected to account for the actual (non missing) data that is used to compute the associated dissimilarity.

Consider, for example, the congressional voting data described above. For these data, it is typical that not all senators would vote on every bill. Some may “abstain” during the roll call, but others may simply not be present. In this case, a typical dataset structure might assign

$$ b_{ij} = \mathrm{NA} $$

(19)

if senator i did not vote on bill j. To perform BiFold under this condition of missing data, we proceed as follows:

If $b_{ij} = \mathrm{NA}$ then $w_{ij}^{(xy)}=0$, and $\delta _{ij}^{(xy)}=c$, where c is an arbitrary, finite constant.
For within group differences for group 1, define index sets $\kappa_{ij}$ as
$$\kappa_{ij} = \{k\mid b_{ik} \neq\mathrm{NA}, b_{jk} \neq\mathrm{NA} \}, $$
compute
$$ s_{ij}^{(x)}= \sum_{k \in\kappa_{ij}} \vert b_{ik}-b_{jk} \vert , $$
(20)
and determine the number of information elements as
$$ n_{ij}^{(x)}= \vert \kappa_{ij} \vert . $$
(21)
Apply Table 1 formulae to compute $\delta_{ij}^{(x)}$ and $w_{ij}^{(x)}$, replacing n by $n_{ij}$.
Use similarly modified formulas to compute $\delta_{ij}^{(y)}$ and $w_{ij}^{(y)}$.

After forming the data matrices Δ and W, then we may simply minimize the weighted stress to determine an coordinate representation.

References

Fekete J-D, van Wijk JJ, Stasko JT, North C (2008) The value of information visualization. In: Kerren A et al. (eds) Information visualization. LNCS, vol 4950, pp 1-18
Chapter Google Scholar
Spence I, Garrison RF (1993) A remarkable scatterplot. Am Stat 47(1):12-19
Google Scholar
Fayyad U, Grinstein G, Wierse A (eds.) (2001) Information visualization in data mining and knowledge discovery. Kaufmann, San Francisco
Google Scholar
Gastner MT, Newman MEJ (2004) Diffusion-based method for producing density equalizing maps. Proc Natl Acad Sci USA 101:7499-7504
Article MathSciNet MATH Google Scholar
Sims GE, Choi I-G, Kim S-H (2005) Protein conformational space in higher order ϕ-ψ maps. Proc Natl Acad Sci USA 102:618-621
Article Google Scholar
Chen M et al. (2009) Data, information and knowledge in visualization. IEEE Comput Graph Appl 29(1):12-19
Article Google Scholar
Nishikawa T, Motter AE (2011) Discovering network structure beyond communities. Sci Rep 1:151
Article Google Scholar
Shekhar K, Brodin P, Davis MM, Chakraborty AK (2014) Automatic classification of cellular expression by nonlinear stochastic embedding (ACCENSE). Proc Natl Acad Sci USA 111:202-207
Article Google Scholar
Cox TF, Cox MAA (2000) Multidimensional scaling, 2nd edn. Chapman & Hall/CRC, London
MATH Google Scholar
Borg I, Groenen PJF (2005) Modern multidimensional scaling: theory and applications, 2nd edn. Sprinter, New York
MATH Google Scholar
Vogelstein JT et al. (2014) Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure learning. Science 344:386-392
Article Google Scholar
Engeszer RE, Wang G, Ryan MJ, Parichy DM (2008) Sex-specific perceptual spaces for a vertebrate basal social aggregative behavior. Proc Natl Acad Sci USA 105:929-933
Article Google Scholar
Carmeño P, Falkowski PG (2009) Controls on diatom biogeography in the ocean. Science 325:1539-1541
Article Google Scholar
Bronstein AM, Bronstein MM, Kimmel R (2006) Generalized multidimensional scaling: a framework for isometry-invariant partial surface matching. Proc Natl Acad Sci USA 103:1168-1172
Article MathSciNet MATH Google Scholar
Shieh AD, Hashimoto TB, Airoldi EM (2011) Tree preserving embedding. Proc Natl Acad Sci USA 108:16916-16921
Article Google Scholar
Aflaloa Y, Kimmel R (2013) Spectral multidimensional scaling. Proc Natl Acad Sci USA 110:18052-18057
Article MathSciNet MATH Google Scholar
Bauer-Mehren A et al. (2011) Gene-disease network analysis reveals functional modules in mendelian, complex and environmental diseases. PLoS ONE 6(6):e20284
Article Google Scholar
Craciun G, Feinberg M (2006) Multiple equilibria in complex chemical reaction networks: II. The species-reaction graph. SIAM J Appl Math 66(4):1321-1338
Article MathSciNet MATH Google Scholar
Gabriel KR (1971) The biplot graphic display of matrices with application to principal component analysis. Biometrika 58(3):453-467
Article MathSciNet MATH Google Scholar
Bennett JF, Hays HL (1960) Multidimensional unfolding: determining the dimensionality of ranked preference data. Psychometrika 25(1):27-43
Article MathSciNet Google Scholar
Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24:417-441, 498-520
MATH Google Scholar
Richardson M, Kuder GF (1933) Making a rating scale that measures. Pers J 12:36-40
Google Scholar
Hirschfeld HO (1935) A connection between correlation and contingency. Proc Camb Philos Soc 31:520-524
Article MATH Google Scholar
Gower JC, Harding SA (1988) Nonlinear biplots. Biometrika 75(3):445-455
Article MathSciNet MATH Google Scholar
Gower JC (1992) Generalized biplots. Biometrika 79(3):475-493
Article MathSciNet MATH Google Scholar
Davis A, Gardner BB, Gardner MR (1941) Deep South: a social anthropological study of caste and class. University of Chicago Press, Chicago
Google Scholar
Freeman LC (2003) Finding social groups: a meta-analysis of the southern women data. In: Ronald B, Kathleen C, Philippa P (eds) Dynamic social network modeling and analysis. National Academies Press, Washington
Google Scholar
Field S, Frank KA, Schill K (2006) Identifying positions from affiliation networks: preserving the duality of people and events. Soc Netw 28:97-123
Article Google Scholar
Beineke LW, Wilso RJ (2004) Topics in algebraic graph theory. Cambridge University Press, Cambridge
Book Google Scholar
Porter MA, Mucha PJ, Newman MEJ, Warmbrand CW (2005) A network analysis of committees in the U.S. House of Representatives. Proc Natl Acad Sci USA 102(20):7057-7062
Article Google Scholar
Ahn Y-Y, Ahnert SE, Bagrow JP, Barabási A-L (2011) Flavor network and the principles of food pairing. Sci Rep 1:196
Article Google Scholar
Levandowsky M, Winter D (1971) Distance between sets. Nature 234:34-35
Article Google Scholar

Download references

Acknowledgements

The authors wish to thank Daniel B. Larremore for useful feedback on the manuscript. This work was partially supported by a Clarkson University Provost Award, Army Research Office grants W911NF-12-1-0276 and W911NF-16-1-0081, and the Simons Foundation grant 318812. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the funding agencies.

Author information

Authors and Affiliations

Department of Mathematics, Clarkson University, Potsdam, NY, 13699, USA
Yazhen Jiang, Joseph D Skufca & Jie Sun
Clarkson Center for Complex Systems Science, Potsdam, NY, 13699, USA
Jie Sun
Department of Physics, Clarkson University, Potsdam, NY, 13699, USA
Jie Sun
Department of Computer Science, Clarkson University, Potsdam, NY, 13699, USA
Jie Sun

Authors

Yazhen Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Joseph D Skufca
View author publications
You can also search for this author in PubMed Google Scholar
Jie Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jie Sun.

Additional information

Competing interests

The authors declare no competing financial interests.

Authors’ contributions

JDS and JS designed the research. All authors contributed to methodological and algorithm developments, data collection, visualization and analysis. JDS and JS wrote the manuscript. All authors have read and approved the final manuscript.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Jiang, Y., Skufca, J.D. & Sun, J. BiFold visualization of bipartite datasets. EPJ Data Sci. 6, 2 (2017). https://doi.org/10.1140/epjds/s13688-017-0098-4

Download citation

Received: 14 December 2016
Accepted: 08 April 2017
Published: 20 April 2017
DOI: https://doi.org/10.1140/epjds/s13688-017-0098-4

BiFold visualization of bipartite datasets

Abstract

1 Introduction

2 Results

2.1 The BiFold approach

2.2 Examples: voting datasets

2.2.1 Presidential election by states

2.2.2 Senate congressional roll call votes

2.3 Association datasets: a recipe - ingredient dataset

3 Discussion

4 Materials and methods

4.1 Datasets

4.2 The BiFold framework: dissimilarity measures, weights, and stress minimization

4.3 Dissimilarity measures and weights used in the examples

5 Stress minimization

6 Treatment of partial and missing data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors’ contributions

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords