 Regular article
 Open Access
 Published:
BiFold visualization of bipartite datasets
EPJ Data Science volume 6, Article number: 2 (2017)
Abstract
The emerging domain of dataenabled science necessitates development of algorithms and tools for knowledge discovery. Human interaction with data through wellconstructed graphical representation can take special advantage of our visual ability to identify patterns. We develop a data visualization framework, called BiFold, for exploratory analysis of bipartite datasets that describe binary relationships between groups of objects. Typical data examples would include voting records, organizational memberships, and pairwise associations, or other binary datasets. BiFold provides a low dimensional embedding of data that represents similarity by visual nearness, analogous to Multidimensional Scaling (MDS). The unique and new feature of BiFold is its ability to simultaneously capture both withingroup and betweengroup relationships among objects, enhancing knowledge discovery. We benchmark BiFold using the Southern Women Dataset, where social groups are now visually evident. We construct BiFold plots for two US voting datasets: For the presidential election outcomes since 1976, BiFold illustrates the evolving geopolitical structures that underlie these election results. For Senate congressional voting, BiFold identifies a partisan coordinate, separating senators into two parties while simultaneously visualizing a bipartisancoalition coordinate which captures the ultimate fate of the bills (pass/fail). Finally, we consider a global cuisine dataset of the association between recipes and food ingredients. BiFold allows us to visually compare and contrast cuisines while also allowing identification of signature ingredients of individual cuisines.
Introduction
Despite the dominance of automated algorithms for data mining and knowledge discovery, it has been increasingly recognized that human perception can play an essential and often favorable role in exploring patterns and developing insights [1]. For instance, the Hertzsprung Russell diagram of stellar luminoscity versus temperature provides a classic example of a data analysis problem easily tackled by a person but remains a challenge for automated methods [2]. Typically, the utilization of human cognition in exploratory data analysis relies on proper representation and visualization of the data in a lowdimensional embedding space [3–8].
The standard concept of a “dataset” is a tabular array, where each row corresponds to an object in the dataset and every column corresponds to a variable (or factor) measured on each object. A natural question about such a dataset is “how are objects like (or unlike) other objects and are there relevant relationships among collections of objects?” Multidimensional scaling (MDS) refers to a family of techniques that address these questions by visualizing the objects as a set of points embedded in a lowdimensional (typically 2D or 3D) geometric space, with the goal of representing the dissimilarities between objects by the distances between the corresponding points in the embedded space [9, 10]. The generality of MDS approaches makes them suitable for a broad range of practical problems, as demonstrated in many classical examples [9, 10] as well as in several recent scientific breakthroughs: mapping of brainwide neural behavior [11], discovery of sexspecific and speciesspecific perceptual spaces among different biological species [12], and analysis of biogeographic differentiation between geographical regions [13]. On a more fundamental level, several recent developments focused on generalizing different measures of “distance” in the MDS formulation to allow for embedding from and/or onto general nonlinear manifolds [14–16].
Frequently, we encounter dataset which encodes a binary relation between two sets (or “classes”) of objects, with elements of one set corresponding to the rows, elements of the other corresponding to the columns, and the data entries (“1” or “0”) indicating whether or not there is a relationship between the associated row and column. Common examples include politicians and bills they supported, or moviegoers and the movies that they attend, or students and the courses in which they enroll. Such examples can be regarded as decisionmakers and choices, while we note that similar datasets arise in many contexts that are often described by bipartite graphs, such as the association between genes and diseases [17], relation between chemical reactants and reactions [18].
Knowledge discovery on binary relation datasets can benefit from a visualization of both decisionmakers and choices in a common embedding space, where (simultaneously)

(1)
“Similar” objects (whether decisionmakers or choices) ought to be “nearby” in the visualization;

(2)
Decisionmakers should be positioned “close” to their preferred choices.
The BiFold method developed here relates to a set of ordination methods that attempts to resolve various aspects of this challenge. From a classical perspective, one should build that framework upon three primary choices, where we would point the reader to [9, 10] and the references therein for details of common methods: Biplot [19] aims to satisfy requirement (1), with points (typically referred to as “samples”) representing one set of objects and coordinate axis (often referred to as “variables” and plotted as position vectors) describing the other set; Unfolding [20] considers only betweenclass distances, and therefore focuses only on requirement (2); Correspondence analysis [21–23] focused on contingency table data rather than general binary relation data. The BiFold method developed herein merges the respective goals of Biplot and Unfolding methods, satisfying both requirements (1) and (2), and it is this connection that motivates the name.
In addition to the classical ordination methods described above, we note that BiFold has similar goals to nonlinear and generalized biplot methods described in [24, 25]. In particular, the generalized biplot addresses categorical variables (of which dichotomous variables are a subset) with consideration of both requirements (1) and (2) in developing the ordination. Their approach is to ordinate each entry in the dataset, such that each level of a categorical variable is separately visualized. For binary variables, that approach would require representation of one of the classes of objects by two sets of ordination coordinates, one to represent the “1” and the other to represent the “0” in the data. Our original contribution here is twofold: (i) Our treatment is completely symmetric with respect to the classes, with neither being treated as the “variables.” The resulting ordination is identical, even if we transpose our dataset. Consequently, each object, regardless of class, is assigned only one coordinate. (ii) We consider an ordination scheme that accounts for the difference in information quality of crossgroup and withingroup distances, as well as the difference in information content across groups of different size, as specialized to the binary data framework. The ordination approach is more naturally able to account for the difference in information content arising from the nonsquare data matrices, missing data, and differences in interpretation of matches for categorical variables [25].
We begin with an introductory example of BiFold below, leaving the details of the approach and more examples to the later sections.
An introductory example  BiFold plot of the southern women dataset
Consider the Southern Women dataset, collected in the 1930s in a small town in the southern United States. The data records the participation of 18 ladies (Southern Women) in 14 social events [26] and can be represented by matrix \(B=[b_{ij}]_{18\times 14}\), where \(b_{ij}=1\) indicates that woman i attended event j, and \(b_{ij}=0\) otherwise (see Figure 1 middle panel as well as Materials and Methods). Due to its relatively small size and simple structure, the dataset serves as a popular benchmark for techniques that consider social stratification, group formation, and other social structure questions [27].
One way to visualize the Southern Women dataset is to use MDS to place the 18 women at suitable 2D locations, where distance between embedded coordinates reflects the degree to which the women attended similar events, as in Figure 1top left. In this case, we are treating the women as the entities to be plotted, while the events are regarded as factors that characterize each individual. Alternatively, we can treat the events as entities and the women as factors, allowing us to obtain an MDS configuration of events, as shown in Figure 1bottom left. The goal is to “overlay” the two embeddings to not only capture the withinclass relationships (woman to woman, event to event) but also the crossclass relationship of woman to event. BiFold produces such a joint visualization (Figure 1right panel) in which social group structure [27, 28] is easily identified through proximity: (a) nearby women attended similar events, (b) nearby events were attended by similar groups of women, (c) nearby womanevent pairs indicate that the woman likely attended that event.
Results
The BiFold approach
BiFold provides a procedural framework to produce a lowdimensional embedding from a binary data matrix. First, we create a joint dissimilarity matrix that appropriately fuses information from both withinclass and crossclass relations. Secondly, we construct a weighting matrix to reflect the relative uncertainty associated with the dissimilarities. Finally, we minimize a weighted stress function to obtain a BiFold embedding, coordinates in \(\mathbb{R}^{d}\) for each row and each column of the data matrix. In this section, we describe and explain this framework, leaving the detailed specification of the algorithms and parameters to Materials and Methods.
Given a binary relation between two types (classes) of objects, encoded as matrix \(B=[b_{ij}]_{m\times n}\), where
Such data equivalently encodes a bipartite graph, where an edge in the graph corresponds to a binary relationship, and the matrix B is the biadjacency matrix of the graph [29].
Focusing on objects represented by the rows of B, we quantify, using some appropriate measure, the dissimilarity between row i and row j, denoted \(\delta^{(x)}_{ij}\), producing matrix \(\Delta ^{(x)}=[\delta^{(x)}_{ij}]_{m\times m}\). Likewise, we generate dissimilarity matrix \(\Delta^{(y)}=[\delta ^{(y)}_{ij}]_{n\times n}\) by comparing the columns of B. Finally, the dissimilarity between row object i and column object j is defined by a monotonic transformation of the entries in matrix B, which yields a crossclass dissimilarity matrix \(\Delta^{(xy)}=[\delta ^{(xy)}_{ij}]_{m\times n}\). A binary relation dataset typically falls into one of the two categories: (1) choice data, for which each data entry (whether “0” or a “1”) reflects an active decision (either a positive or negative relation); and (2) association data, for which the “0”s indicate only an absence of a relation, and are usually much less informative than the “1”s. For data from each of these categories, we have developed some sensible dissimilarity measures (see Materials and Methods).
Given withinclass dissimilarity matrices \(\Delta^{(x)}\) and \(\Delta ^{(y)}\) together with the crossclass dissimilarity matrix \(\Delta ^{(xy)}\), we form a joint dissimilarity matrix of size \((m+n)\times(m+n)\), as:
where 1 is a matrix of 1s, and \(\Delta^{(yx)}\) is the matrix transpose of \(\Delta^{(xy)}\). The joint dissimilarity matrix contains a few tunable parameters: \(\alpha_{x}\), \(\alpha_{y}\), and \(\alpha_{xy}\) control the relative scale of the withinclass and crossclass distances in the embedded space, while β allows for explicit translation (i.e., shifting) of the row objects away from the column objects. In the examples of this paper, we use \(\beta= 0\) (no translation). Note, however, that for some datasets the visualization might be improved by a translation (realized by a nonzero β value) as determined by the enduser.
Dissimilarities are generated from data and should be viewed as a measurement with uncertainty. To capture such uncertainty, we associate a “weight” to each dissimilarity following the principle that the weight should reflect the information content (or reliability). We denote the corresponding joint weighting matrix as
Once the joint dissimilarity and weighting matrices are specified, a ddimensional BiFold embedding yields coordinates \(X=(\boldsymbol {x}_{1},\boldsymbol {x}_{2},\dots,\boldsymbol{x}_{m})\) and \(Y=(\boldsymbol{y}_{1},\boldsymbol {y}_{2},\dots,\boldsymbol{y}_{n})\) to denote the sets of points corresponding row objects and column objects. Denote the full coordinate set as \(Z=(X,Y)\). Such an embedding is computed by minimization of the multivariate stress function \(S:\mathbb{R}^{d\times(m+n)}\rightarrow\mathbb{R}\), defined as
Here \(\Vert \cdot \Vert \) denotes a metric distance in the common embedding space. The stress function S, which is given by a weighted sum of the discrepancies between the embedded distances and the dissimilarities, is a standard type of loss function frequently used in MDS [9, 10].
In the following sections, we illustrate BiFold via three additional binary datasets: US presidential election results, US senate voting records from the 112th Congress, and a foodrecipe relational dataset from five major global cuisines.
Examples: voting datasets
A common type of binary relation comes from voting data: for each item to be voted on, a voter either votes “for” or “against,” with (perhaps) the ability to abstain. We consider two such examples: US presidential election results for past ten elections, for which the “voters” are the individual US states and the “items” are the winning presidents in each election. We also examine senate congressional roll call votes, with US senators as voters and the items are senate bills.
Presidential election by states
Consider the statelevel votes for the United States presidential elections for the period from 1976 to 2012. There are 51 decision makers (50 states plus the District of Columbia) and a total of 10 decisions, resulting in a data matrix \(B=[b_{ij}]_{51 \times10}\) where
As with the Southern Women example, we seek a lowdimensional visualization which captures the withinclass relationships (statetostate and presidenttopresident) while also accounting for the betweenclass relationships (statetopresident). To quantify these relationships, we define the dissimilarity between two states as the fraction of elections for which they voted differently; the dissimilarity between two elections is computed as the fraction of states which voted differently in those elections; finally, the dissimilarity between state i and election j is quantified as \(1b_{ij}\). Given these dissimilarities, Figure 2 visualizes these election results, where coordinates are determined using BiFold.
Using BiFold for positional layout, we may encode additional information using other aesthetics. In Figure 3, states are colored according to party affinity (based on the fraction of these 10 elections in which that stated voted for the presidential candidate from that party, with Republican in red and Democrat in blue).
Aided by the additional encoded information, the BiFold layout yields some interesting observations.

Not surprisingly, the primary coordinate axis (left/right) strongly encodes the party affinity (blue state/red state).

Over time, the election positions have (generally) moved toward the left/right extremities, capturing the increasing partisanship of the elections.

Most of the purple colored “swing states” are near the center of the visualization, which implies that they align with most of the election winners, with slight variation based on the particular set of presidents that they supported. As interesting exception, West Virginia lies far above the main cloud, which we attribute to its trend of having often supported the nonwinning candidates.

Noting several “paired” election coordinates, we observe that such pairs associate with twoterm presidents, likely because their constituent support did not change much between elections.

Positional outlier Carter ‘76 reflects support from a nontypical coalition of states, likely attributed to Carter being the first president elected from the Deep South since the Civil War. Reagan ‘84 is the most centrally positioned, reflecting broad national support. Figure 3b connects each of these elections to the supporting states.

Comparing Bush ‘00 to Obama ‘12 (Figure 3c) we see both elections driven primarily by partisan support.
We remark that any of the visually indicated hypothesis should be viewed as exploratory and be confirmed by additional quantitative analysis (as would also be appropriate for most other data visualizations). However, we note that the BiFold visualization motivates a rich palette of such hypotheses, many of which directly exploit the betweenclass information.
Senate congressional roll call votes
We consider the voting record of the United States (U.S.) Senate. The U.S. legislative body is composed of two chambers, known as the Senate and the House of Representatives. A particular congress serves for two years with this time frame divided into two sessions. We focus on voting data from the 112th congress, first session of the Senate, which conducted 235 roll call votes. For each roll call vote, there are at most 100 senators. However, the replacement of Senator Ensign by Senator Heller midsession leads to a data matrix with 101 rows and 235 columns, recording the action of senator i on bill j
If senator i did not act on bill j, entry \(b_{ij}\) is undefined and is treated as missing data. (See Materials and Methods for treatment of missing data.)
As with the previous examples, the goal (achieved by BiFold) is to obtain an embedding that captures both the withinclass relationships (senatortosenator and billtobill) and the betweenclass relationships (senatortobill). From the data, we quantify the dissimilarity between two senators (bills) as the estimated probability that they vote (were voted) differently. Dissimilarity between senator i and bill j is the estimated likelihood that senator i objects to bill j.
As shown in Figure 4, the BiFold plot clearly shows the twoparty structure of the senate, allowing for convenient visual comparison of the relative “spread” of the parties, and identification of senators that are “moderate” versus those that are more “extreme” (Figure 4, top panels). The pattern of bills revealed by BiFold is reminiscent of the diamond structure previously identified from classical MDS [30]. In addition, BiFold provides visual information regarding the relationships between bills and senators by positioning bills “close to” the senators supporting them. This unique feature enables a clear classification of the main clusters of bills as shown in Figure 4:

Bills in the “left” (liberal) cluster received strong support from the Democratic Senators;

Bills in the “right” (conservative) cluster received strong support from the Republicans;

Bills in the “top” (bipartisan supportive) cluster were strongly supported by both parties, as visually being “pulled” between the two parties;

Bills in the “bottom” (bipartisan opposition) cluster are pushed far away from both parties, indicating bills that were supported only by a small number of senators.
Thus, by simultaneous embedding of both the senators and the bills, the BiFold visualization not only captures patterns within the senators and those within the bills, but also reveals salient features of the senatorbill cross relations.
Association datasets: a recipe  ingredient dataset
We envision the BiFold approach to be broadly useful, certainly beyond the visualization of voting data. Another important category of binary data captures the association between “members” and “affiliations.” A key feature of such association datasets is that the nonassociation relations carry little information compared to the association relations; in sharp contrast, in a voting dataset the “yes” and “no” votes both convey valuable information about the relation between the decision makers and the choices. Association datasets are often collected to form sparse, bipartite networks, where sparsity arises from the reality that there are (typically) many more nonassociations than associations in these data.
We focus here on a specific example relating recipes with their included ingredients. A recipe defines a procedure for cooking, along with a list of food items (ingredients) used in the recipe. Gathering this data over a broad spectrum of recipes allows us to more completely understand how ingredients are used in combination, which may vary from one cuisine to another. As our data source, we consider the recipeingredient association dataset generated in [31], which assembled over \(50{,}000\) recipes taken from two American and one Korean online repository. The data is (again) represented by a matrix B, where \(b_{ij}=1\) indicates that recipe i contains ingredient j, and 0 otherwise. To proceed with the BiFold approach, we must define the dissimilarities between the entities: Recall that in the voting examples, both a “1” (a yes vote) and a “0” (a no vote) contain actual information regarding a voter’s opinion. In contrast, in the recipeingredient dataset, a given recipe typically includes only a small fraction of all available ingredients and carries essentially no information on those ingredients that are not used in the recipe.
Betweenclass dissimilarity measure is as before, \(\delta _{ij}=1b_{ij}\). However, the withinclass dissimilarities require more careful consideration. If we were to quantify the dissimilarity between two recipes in the same way that we did for two voters, we would conclude that most recipes are very “similar.” This apparent similarity is artificial, resulting not from commonality of ingredients they share, but due to the overwhelmingly large set of ingredients that neither recipe contains. A dissimilarity measure that symmetrically incorporates “1”s and “0”s will therefore be dominated by the sparsity of the data rather than the actual relation between the entities of interest. In this context, we would consider the 0s as carrying relatively little information. As such, the Jaccard distance provides a natural measure of dissimilarity [32], where we treat rows (or columns) of B as a characteristic function indicating set membership. For two recipes, the Jaccard distance is
Likewise, the Jaccard distance between two ingredients is
In addition to the recipeingredient relationship information, the original dataset also categorized each recipe as belonging to a particular cuisine. We focus our analysis on a random subsample (of 1,000 recipes) of the five cuisines in the original dataset that contain more than 1,000 recipes. We compute a 2D BiFold embedding to support visualization of this reduced dataset. In Figure 5, we use the BiFold coordinates to plot food ingredients (circles, colored by ingredient category), with that layout the same for all five cuisines. Each cuisine is visualized in its own panel, where we use a density plot to capture the distribution of recipes from that cuisine.
As expected, ingredients that are commonly used together in recipes are positioned near each other in the plot, and recipes with similar ingredients appear close together as well. A unique outcome of applying BiFold to this data is that we may now visually associate ingredients to cuisines, whereas the original data only associates recipes to cuisines, facilitating an entirely new level of interpretation enabled by embedding both recipes and ingredients using a common coordinate frame:

From the collection of cuisine plots, we can visually identify similar cuisines (North America  Western Europe, Latin America  Southern Europe).

The East Asian cuisine appears visually distinct from the western heritage cuisines.

The protein group, primarily meat, appear centrally in the figure of ingredients, with all the cuisines showing significant density in that region of the plot. (In other words, the meat group does not identify any particular cuisine.)

The density plots allow to visually identify certain ingredients as the “signature” of a cuisine: basil and oregano (Southern European); sesame oil and soy sauce (East Asian); cocoa and vanilla (North American and Western European).
Discussion
The BiFold framework described in this article has primarily focused on a fixed, binary dataset, interpretable as associations between two types of objects. We consider that framework to be broadly applicable to datasets describing relationships between entities from different classes, where we want to be able to simultaneously visualize the different classes such that visual distance can be associated to a dissimilarity measure, both within class and between classes. For the datasets examined, we would remark that although the knowledge discovery facilitated by the visualization are possibly achievable by other analysis techniques, BiFold has a unique ability to simultaneously visualize those discoveries. Note that the extent to which BiFold plot (or any visualization) reflects the actual similarities and dissimilarities between objects in the dataset  as measured by the stress function  depends intrinsically on the dataset itself. In typical realworld datasets, the representation would not be perfect, even if the dimensionality of embedding is large. For the datasets considered here, we find that in the Southern Women example, as well as the two US voting examples, a lowdimensional (2D or 3D) BiFold embedding achieves an almost minimal stress which cannot be further decreased by increasing dimensionality (see Figure 6), supporting the notion that the opinions are well expressed by a lowdimensional model. On the other hand, for the recipeingredient example, increase of dimensionality beyond 3D continue to decrease stress and improve the match to the original data (Figure 6), suggesting an enormous diversity and complexity in the cuisine space which cannot be accounted for using just a few variables or parameters.
In addition, we note that the BiFold framework described here may be easily extended to a number of interesting and related problems:

As an (almost trivial) extension, we note that interpretation of the data as representing a bipartite network implies that BiFold could act as a graph layout algorithm for bipartite network data.

BiFold can be viewed as a generalization of several other classical techniques which can be recovered by specific choice of parameters:

The entries in the data matrix, B, need not be binary, but could represent a continuous or ordinal variable, such as ratings, rankings, or preferences.

Some dataset might naturally contain more than two groups, such as actors, movies, and viewers. Such datasets can be treated as multipartite, rather than bipartite data. We envision a natural extension of BiFold, where the joint dissimilarity and weighting matrices must be appropriately constructed based on the withingroup and betweengroup relationships.

We focused on Hamming distance and Jaccard distance to compute withinin class dissimilarity, with each providing a natural interpretation for the datasets considered. We note that the BiFold framework is not dependent upon any particular choice of dissimilarity measure, and a reasonable practitioner may choose other methods for defining dissimilarities (and weights) that might be appropriate for their data. The BiFold approach  based on the joint dissimilarity matrix, will still provide a means to develop the joint visualization.

For some of the methods, we interpret the raw (binary) from Bayesian perspective, but with uninformed prior. That approach could easily added to accommodate other a priori understanding of the data.

For dynamic datasets (parameterized by time, for example) each data “snapshot” would yield a BiFold layout. A stress functional that incorporates a regularity condition in time could compute an optimal sequence of layouts, computed over many snapshots.
As caution, we note some of the challenges associated with analysis via the BiFold framework:

Computational complexity of the stress minimization as an optimization problem using the SMACOF algorithm is roughly \(O(n^{4})\) for reaching at a local, approximate solution. As such the current implementation of stress minimization will likely struggle with very large datasets. Because the technique is meant to support visual knowledge discovery (human interaction), speed of visualization is important. Data aggregation might be a way to handle large datasets, but the aggregation procedures will almost certainly be domain specific.

Comparing one BiFold layout to another (exploring parameter space) can be challenging in that the solution layout is rotation and reflection invariant. Normalizing the orientation of the generated solution is important. As additional complication, the configuration solution to the optimization problem is a local minimizer, so that solution may “jump” to a different minimizer under small changes in the data.

The nonEuclidean nature of the dissimilarity measures results in a dissimilarity matrix that is not necessarily well approximated by a low dimensional embedding. Under such case, visually interesting effects may sometimes be an artifact of the data, particularly with sparse datasets.
Despite these challenges, we note that the proposed BiFold framework developed here appears to have broad applicability in many settings related to complex networks, social sciences, and those areas of data analysis that focus on binary relations.
Materials and methods
Datasets

The Southern Women dataset is a popular dataset used in social network analysis. The dataset first appeared in the book “Deep South: A Social Anthropological Study of Caste and Class” [26] (p.148), and can also be found in several online network data repositories. Collected in the 1930s in a small southern town Natchez (Mississippi, United States), the data records the participation of 18 women in a series of 14 informal social events over a ninemonth period. Only the events for which at least two women participated are included in the dataset. Figure 1 shows the data table without including the names of the women or dates of the events. We represent the dataset by a womanbyevent matrix \(B=[b_{ij}]_{18\times14}\), where \(b_{ij}=1\) indicates that woman i attended event j, and \(b_{ij}=0\) otherwise.

The U.S. presidential election dataset considered in this paper includes the statelevel voting results of the United States presidential elections for the period from 1976 to 2012. The dataset, available at the U.S. government archive (http://www.archives.gov/federalregister/electoralcollege/), includes the state voting outcome from the 51 voting entities (50 states plus the District of Columbia) for the past 10 presidential elections. We alphabetically numbering the states from 1 to 51 by name, and the elections from 1 to 10 in chronological order. We then represent the dataset by a statebypresident matrix \(B=[b_{ij}]_{51\times10}\), where \(b_{ij}=1\) indicates that state i voted for the elected president in the jth election, and \(b_{ij}=0\) otherwise. For example, in all past 10 elections Ohio has always voted for the president candidate who eventually won the election regardless of his party affiliation. Florida and Nevada both “missed” one election: in the 1992 election, Florida voted for G.H.W. Bush (the elected president was B. Clinton); in the 1976 election, Nevada voted for G. Ford (the elected president was J. Carter). All three are wellknown examples of “swing” states characterized by flexible voting patterns and importance in determining the election outcome.

The U.S. Senate Congressional Voting dataset used in this paper is obtained from the congressional voting records of the 112th United States congress, first session of the Senate. There are at most 100 senators at any time, with occasional need to replace a senator in mid session, which happened once during the voting portion of this session. As such the roll calls indicate 101 senators voting, 51 Democrats (D), 48 Republicans (R), and 2 Independents (I). There were 235 recorded roll call votes, 167 passed and 68 rejected. We number the senators from 1 to 101 by last name, and the bills from 1 to 235 in chronological order. We formulate data matrix \(B=[b_{ij}]_{101\times235}\) by defining \(b_{ij}\) using the voting of senator i on bill j: for a “yes” vote \(b_{ij}=1\), for a “no” vote \(b_{ij}=0\). The abstained votes are treated as “missing” data in the matrix (see the “Treatment of partial and missing data” section below for details).

The recipeingredient dataset is retrieved from the Supplementary Information of Ref. [31], a paper that studied the similarity and difference in food pairings across different geographical regions. The dataset contains more than 50,000 recipes extracted from three cuisine websites: allrecipes.com, epicurious.com, and menupan.com. The recipes were divided into 11 geographical regions, covering ∼50 popular cuisines around the world. The recipes and ingredients are indexed. Focusing on the 5 geographical regions (cuisines) that contain over 1,000 recipes, we construct data matrix \(B=[b_{ij}]_{5{,}000\times335}\), with \(b_{ij}=1\) if recipe i contains ingredient j. This subsample of the original dataset contains 1,000 randomly selected recipes from each of the 5 selected cuisines: East Asian, Latin American, North American, Southern European, and Western European. The subsampled data contains a total of 335 different ingredients.
The BiFold framework: dissimilarity measures, weights, and stress minimization
The BiFold framework describes a general approach to produce a lowdimensional embedding from a data matrix, where that matrix encodes the relationship between two classes of objects. First, one needs to create a joint dissimilarity matrix using some appropriate withinclass and crossclass dissimilarity measures as well as scaling to make the withinclass and crossclass dissimilarities commensurate. Secondly, one needs to construct a weighting matrix to reflect the relative focus to be given to the computed dissimilarities. Finally, the BiFold embedding is obtained by minimizing a weighed stress function similar to the determination of an MDS solution.
We now present the mathematical details of the BiFold procedure. For a given data matrix \(B=[b_{ij}]_{m\times n}\), a ddimensional BiFold embedding is based upon minimization of the multivariate stress function \(S:\mathbb{R}^{d\times(m+n)}\rightarrow\mathbb{R}\), given by

The joint dissimilarity matrix is given by Eq. (2), where \(\Delta^{(x)}=[\delta^{(x)}_{ij}]_{m\times m}\) and \(\Delta ^{(y)}=[\delta^{(y)}_{ij}]_{n\times n}\) are the withinclass dissimilarity matrices and \(\Delta^{(xy)}=[\delta^{(xy)}_{ij}]_{m\times n}\) is the crossclass dissimilarity matrix (\(\Delta^{(yx)}=\Delta ^{(xy)\top}\)). The parameters: \(\alpha_{x}\), \(\alpha_{y}\), and \(\alpha _{xy}\) provide flexible scaling of the withinclass and crossclass distances in the embedded space, while β can be used to visually translate the type1 objects away from the type2 in the embedding.

The weighting matrix is defined in Eq. (3), where \(W^{(x)}=[w^{(x)}]_{m\times m}\), \(W^{(y)}=[w^{(y)}]_{n\times n}\) are the withinclass weighting matrices and \(W^{(xy)}=[w^{(xy)}]_{m\times n}\) is the crossclass weighting matrix (\(W^{(yx)}=W^{(xy)\top}\)).

As typical choice for the above stress function S is to let \(\Phi(d,\delta)=(d\delta)^{2}\). For a given dissimilarity and weight matrix, this fully specified stress function may then be minimized to obtain coordinates \(\{z_{1}, \ldots,z_{m+n}\}\).
Dissimilarity measures and weights used in the examples
In the data matrix \(B=[b_{ij}]_{m\times n}\) of the Southern Women dataset, \(b_{ij}=1\) if woman i attended event j and \(b_{ij}=0\) otherwise. For the BiFold plot in Figure 1, we used the following withinclass and crossclass dissimilarities:
Then, to balance the spread of the points from the two classes in the embedding, we set the scaling parameters \(\alpha_{x}=1/n\), \(\alpha _{y}=1/m\), and \(\alpha_{xy}=1\). The shifting parameter \(\beta=0\). All entries of the joint weighting matrix W equal to 1. These choices were made primarily for simplicity and are unlikely to be appropriate for the other, much larger datasets considered in the paper. Below we develop a set of dissimilarity measures and corresponding weights suitable for two common types of data matrices encoding voting and association relations, respectively.

Voting data: the BiFold Bernoulli Method. Where the data matrix B represents ‘voting’ data, such that \(b_{ij}\) indicates that object \(X^{i}\) voted positively for object \(Y^{j}\), one may consider that the preference selection (‘1’ or ‘0’) is a forced binary decision on a continuous variable that represents preference. One model for this situation would be to view \(b_{ij}\) as the observation of the forced decision outcome, treated as a Bernoulli trial, where Bernoulli parameter \(p:=p_{ij}=:p_{ij}^{(xy)}\) is not known. (For real data sets of voting data, we treat ‘yes’ as ‘1’ and ‘no’ as ‘0.’ As a third outcome, sometimes a voter will ‘abstain’ on a particular vote, which we view as “missing data” with technique described below.) Applying this model within a group (for example, within group 1) we could assert a Bernoulli process with \(p:=p_{ij}^{(xx)}\) the (unknown) probability that object \(X^{i}\) and \(X^{j}\) would vote the same way on an arbitrarily selected vote. Comparing rows i and j in the data matrix B would provide n observations of outcomes from that Bernoulli process. Comparison of columns treated in the same way, would represent m observation of the Bernoulli process associated to objects \(Y^{i}\) and \(Y^{j}\). Ideally, we would like to construct a BiFold configuration using dissimilarities computed from the actual values for preference  the unknown values for \(p_{ij}^{kl}\). Instead, we must assign dissimalities from estimated probabilities , \(\delta_{ij}^{(*)}:=1\hat{p}_{ij}^{(*)}\). Following standard development for estimating proportions, we count the number of within group differences between pairs of entities in each class :
$$\begin{aligned}& s_{ij}^{(x)}= \sum_{k} \vert b_{ik}b_{jk} \vert , \end{aligned}$$(10)$$\begin{aligned}& s_{ij}^{(y)}= \sum_{k} \vert b_{ki}b_{kj} \vert . \end{aligned}$$(11)For the crossclass data, we pool all observations to define an average rate of positive voting:
$$ \bar{p}=\frac{\sum_{i,j} b_{ij}}{nm}. $$(12)Because we have significantly more observations for the ‘within class’ data, we expect those estimates to be more accurate. Consequently, we choose weights \(w_{ij}\) proportional to the information content. Borrowing from approaches used in regression of heteroscedastic data, we weight the error term (stress) inversely as the (estimated) variance in the observation, as applied in equations (8) and (3). We focus on three primary alternatives for the estimation of the parameters and the variance: (1) Bayesian, with uniform prior; (2) Bayesian, with Jeffreys’ prior; and (3) NonBayesian, maximum likelihood estimate. Table 1 shows the resultant formulas associated to these methods. We note that the specific Bayesian approaches described assume no prior belief regarding the parameters \(p_{ij}\). However, the concept is obviously easily generalized to those cases where prior information is available, where one would simply encode that knowledge into assumed prior distribution.

Association data: the BiFold Membership Method. For association data (such as the recipeingredient dataset), the sparse biadjacency matrix \(b_{ij}=1\) indicates an association between object i from class x with object j from class y. Unlike the case of voting datasets a “0” in an association dataset carries relatively little information as opposed to a “1”. This asymmetry, if not accounted for appropriately, will result in an embedding (and visualization) that is dominated by the count of 1s instead of revealing more useful features.
Between class dissimilarity measure is quantified as
$$ \delta_{ij}^{(xy)}=1b_{ij}. $$(13)The withinclass dissimilarities are computed using a Jaccard distance. Specifically, for two objects represented by rows i and j of the matrix B, their dissimilarity is given by
$$ \delta^{(x)}_{ij}=1\frac{\sum_{k} b_{ik}b_{jk}}{\sum_{k} (b_{ik}+b_{jk}b_{ik}b_{jk} )}. $$(14)Likewise, the dissimilarity between columns i and j is computed as
$$ \delta^{(y)}_{ij}=1\frac{\sum_{k} b_{ki}b_{kj}}{\sum_{k} (b_{ki}+b_{kj}b_{ki}b_{kj} )}. $$(15)
For weights, we treat \(b_{ij}=1\) as representing unit information, while \(b_{ij}=0\) carries no information, so that
For within class, the weights are computed by counting the number of common “1’s,” yielding
As a result of the typical sparsity in such dataset, matrix W will also be sparse. We remark that
meaning that under this condition of maximal dissimilarity of i with j, that particular dissimilarity does not directly affect the computed stress functional or the resultant BiFold embedding. Without this weighting scheme, a sparse association dataset would be completely dominated (visually) by the large number of objects forced to lie at the outside of the unit ball because most objects are ‘very far’ from most other objects.
Stress minimization
After formulating a stress function (8) and embedding dimension d, a BiFold representation of the data is obtained by minimizing the stress function over the coordinates of \(m+n\) points in a ddimensional Euclidean space. This optimization problem is within the class of MDS problems, with several alternative tools available to find a local minimum [9, 10]. For the BiFold plots reported in this paper, the stress minimization is done via the (iterative) SMACOF algorithm [10]. For reproducibility of results, for the initial iteration of the algorithm, the starting configuration for the coordinates is obtained by a classical MDS solution of the joint dissimilarity matrix (without weighting). After applying the SMACOF algorithm to obtain a set of coordinates, we further perform a PCA (principal component analysis) to standardize the alignment, noting that the stress function is invariant under such transformations. As a consequence, in all BiFold plots the horizontal axis is the principal direction.
Treatment of partial and missing data
For real datasets, the choice of methods for dealing with missing data can become a critical component of the data processing. In general, the BiFold approach admits a very reasoned approach that does not depend upon imputation and remains robust in a wide variety of datasets. The key enabler is recognizing that data matrix B contains mn pieces of information, while the solution (a configuration) allows just \(d \times (m+n)\) free variables. Under typical scenarios, with the visualization dimension \(d=2\) or \(d=3\), and \(n,m\gg d\), we may view this as the data matrix as providing significant amount of “redundant” information. In the same way that a regression line should not suffer too much if a small fraction of the data set is removed, a similar robustness should persist in the BiFold visualization. As such, we follow two general guidelines when dealing with missing data:

1.
Use only available data when computing dissimilarities \(\delta_{ij}\).

2.
Weights \(w_{ij}\) should be selected to account for the actual (non missing) data that is used to compute the associated dissimilarity.
Consider, for example, the congressional voting data described above. For these data, it is typical that not all senators would vote on every bill. Some may “abstain” during the roll call, but others may simply not be present. In this case, a typical dataset structure might assign
if senator i did not vote on bill j. To perform BiFold under this condition of missing data, we proceed as follows:

If \(b_{ij} = \mathrm{NA}\) then \(w_{ij}^{(xy)}=0\), and \(\delta _{ij}^{(xy)}=c\), where c is an arbitrary, finite constant.

For within group differences for group 1, define index sets \(\kappa_{ij}\) as
$$\kappa_{ij} = \{k\mid b_{ik} \neq\mathrm{NA}, b_{jk} \neq\mathrm{NA} \}, $$compute
$$ s_{ij}^{(x)}= \sum_{k \in\kappa_{ij}} \vert b_{ik}b_{jk} \vert , $$(20)and determine the number of information elements as
$$ n_{ij}^{(x)}= \vert \kappa_{ij} \vert . $$(21) 
Apply Table 1 formulae to compute \(\delta_{ij}^{(x)}\) and \(w_{ij}^{(x)}\), replacing n by \(n_{ij}\).

Use similarly modified formulas to compute \(\delta_{ij}^{(y)}\) and \(w_{ij}^{(y)}\).
After forming the data matrices Δ and W, then we may simply minimize the weighted stress to determine an coordinate representation.
References
 1.
Fekete JD, van Wijk JJ, Stasko JT, North C (2008) The value of information visualization. In: Kerren A et al. (eds) Information visualization. LNCS, vol 4950, pp 118
 2.
Spence I, Garrison RF (1993) A remarkable scatterplot. Am Stat 47(1):1219
 3.
Fayyad U, Grinstein G, Wierse A (eds.) (2001) Information visualization in data mining and knowledge discovery. Kaufmann, San Francisco
 4.
Gastner MT, Newman MEJ (2004) Diffusionbased method for producing density equalizing maps. Proc Natl Acad Sci USA 101:74997504
 5.
Sims GE, Choi IG, Kim SH (2005) Protein conformational space in higher order ϕψ maps. Proc Natl Acad Sci USA 102:618621
 6.
Chen M et al. (2009) Data, information and knowledge in visualization. IEEE Comput Graph Appl 29(1):1219
 7.
Nishikawa T, Motter AE (2011) Discovering network structure beyond communities. Sci Rep 1:151
 8.
Shekhar K, Brodin P, Davis MM, Chakraborty AK (2014) Automatic classification of cellular expression by nonlinear stochastic embedding (ACCENSE). Proc Natl Acad Sci USA 111:202207
 9.
Cox TF, Cox MAA (2000) Multidimensional scaling, 2nd edn. Chapman & Hall/CRC, London
 10.
Borg I, Groenen PJF (2005) Modern multidimensional scaling: theory and applications, 2nd edn. Sprinter, New York
 11.
Vogelstein JT et al. (2014) Discovery of brainwide neuralbehavioral maps via multiscale unsupervised structure learning. Science 344:386392
 12.
Engeszer RE, Wang G, Ryan MJ, Parichy DM (2008) Sexspecific perceptual spaces for a vertebrate basal social aggregative behavior. Proc Natl Acad Sci USA 105:929933
 13.
Carmeño P, Falkowski PG (2009) Controls on diatom biogeography in the ocean. Science 325:15391541
 14.
Bronstein AM, Bronstein MM, Kimmel R (2006) Generalized multidimensional scaling: a framework for isometryinvariant partial surface matching. Proc Natl Acad Sci USA 103:11681172
 15.
Shieh AD, Hashimoto TB, Airoldi EM (2011) Tree preserving embedding. Proc Natl Acad Sci USA 108:1691616921
 16.
Aflaloa Y, Kimmel R (2013) Spectral multidimensional scaling. Proc Natl Acad Sci USA 110:1805218057
 17.
BauerMehren A et al. (2011) Genedisease network analysis reveals functional modules in mendelian, complex and environmental diseases. PLoS ONE 6(6):e20284
 18.
Craciun G, Feinberg M (2006) Multiple equilibria in complex chemical reaction networks: II. The speciesreaction graph. SIAM J Appl Math 66(4):13211338
 19.
Gabriel KR (1971) The biplot graphic display of matrices with application to principal component analysis. Biometrika 58(3):453467
 20.
Bennett JF, Hays HL (1960) Multidimensional unfolding: determining the dimensionality of ranked preference data. Psychometrika 25(1):2743
 21.
Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24:417441, 498520
 22.
Richardson M, Kuder GF (1933) Making a rating scale that measures. Pers J 12:3640
 23.
Hirschfeld HO (1935) A connection between correlation and contingency. Proc Camb Philos Soc 31:520524
 24.
Gower JC, Harding SA (1988) Nonlinear biplots. Biometrika 75(3):445455
 25.
Gower JC (1992) Generalized biplots. Biometrika 79(3):475493
 26.
Davis A, Gardner BB, Gardner MR (1941) Deep South: a social anthropological study of caste and class. University of Chicago Press, Chicago
 27.
Freeman LC (2003) Finding social groups: a metaanalysis of the southern women data. In: Ronald B, Kathleen C, Philippa P (eds) Dynamic social network modeling and analysis. National Academies Press, Washington
 28.
Field S, Frank KA, Schill K (2006) Identifying positions from affiliation networks: preserving the duality of people and events. Soc Netw 28:97123
 29.
Beineke LW, Wilso RJ (2004) Topics in algebraic graph theory. Cambridge University Press, Cambridge
 30.
Porter MA, Mucha PJ, Newman MEJ, Warmbrand CW (2005) A network analysis of committees in the U.S. House of Representatives. Proc Natl Acad Sci USA 102(20):70577062
 31.
Ahn YY, Ahnert SE, Bagrow JP, Barabási AL (2011) Flavor network and the principles of food pairing. Sci Rep 1:196
 32.
Levandowsky M, Winter D (1971) Distance between sets. Nature 234:3435
Acknowledgements
The authors wish to thank Daniel B. Larremore for useful feedback on the manuscript. This work was partially supported by a Clarkson University Provost Award, Army Research Office grants W911NF1210276 and W911NF1610081, and the Simons Foundation grant 318812. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the funding agencies.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare no competing financial interests.
Authors’ contributions
JDS and JS designed the research. All authors contributed to methodological and algorithm developments, data collection, visualization and analysis. JDS and JS wrote the manuscript. All authors have read and approved the final manuscript.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Jiang, Y., Skufca, J.D. & Sun, J. BiFold visualization of bipartite datasets. EPJ Data Sci. 6, 2 (2017). https://doi.org/10.1140/epjds/s1368801700984
Received:
Accepted:
Published:
Keywords
 bipartite datasets
 BiFold visualization
 low dimensional embedding