BiFold visualization of bipartite datasets
 Yazhen Jiang^{1},
 Joseph D Skufca^{1} and
 Jie Sun^{1, 2, 3, 4}Email author
Received: 14 December 2016
Accepted: 8 April 2017
Published: 20 April 2017
Abstract
The emerging domain of dataenabled science necessitates development of algorithms and tools for knowledge discovery. Human interaction with data through wellconstructed graphical representation can take special advantage of our visual ability to identify patterns. We develop a data visualization framework, called BiFold, for exploratory analysis of bipartite datasets that describe binary relationships between groups of objects. Typical data examples would include voting records, organizational memberships, and pairwise associations, or other binary datasets. BiFold provides a low dimensional embedding of data that represents similarity by visual nearness, analogous to Multidimensional Scaling (MDS). The unique and new feature of BiFold is its ability to simultaneously capture both withingroup and betweengroup relationships among objects, enhancing knowledge discovery. We benchmark BiFold using the Southern Women Dataset, where social groups are now visually evident. We construct BiFold plots for two US voting datasets: For the presidential election outcomes since 1976, BiFold illustrates the evolving geopolitical structures that underlie these election results. For Senate congressional voting, BiFold identifies a partisan coordinate, separating senators into two parties while simultaneously visualizing a bipartisancoalition coordinate which captures the ultimate fate of the bills (pass/fail). Finally, we consider a global cuisine dataset of the association between recipes and food ingredients. BiFold allows us to visually compare and contrast cuisines while also allowing identification of signature ingredients of individual cuisines.
Keywords
1 Introduction
Despite the dominance of automated algorithms for data mining and knowledge discovery, it has been increasingly recognized that human perception can play an essential and often favorable role in exploring patterns and developing insights [1]. For instance, the Hertzsprung Russell diagram of stellar luminoscity versus temperature provides a classic example of a data analysis problem easily tackled by a person but remains a challenge for automated methods [2]. Typically, the utilization of human cognition in exploratory data analysis relies on proper representation and visualization of the data in a lowdimensional embedding space [3–8].
The standard concept of a “dataset” is a tabular array, where each row corresponds to an object in the dataset and every column corresponds to a variable (or factor) measured on each object. A natural question about such a dataset is “how are objects like (or unlike) other objects and are there relevant relationships among collections of objects?” Multidimensional scaling (MDS) refers to a family of techniques that address these questions by visualizing the objects as a set of points embedded in a lowdimensional (typically 2D or 3D) geometric space, with the goal of representing the dissimilarities between objects by the distances between the corresponding points in the embedded space [9, 10]. The generality of MDS approaches makes them suitable for a broad range of practical problems, as demonstrated in many classical examples [9, 10] as well as in several recent scientific breakthroughs: mapping of brainwide neural behavior [11], discovery of sexspecific and speciesspecific perceptual spaces among different biological species [12], and analysis of biogeographic differentiation between geographical regions [13]. On a more fundamental level, several recent developments focused on generalizing different measures of “distance” in the MDS formulation to allow for embedding from and/or onto general nonlinear manifolds [14–16].
Frequently, we encounter dataset which encodes a binary relation between two sets (or “classes”) of objects, with elements of one set corresponding to the rows, elements of the other corresponding to the columns, and the data entries (“1” or “0”) indicating whether or not there is a relationship between the associated row and column. Common examples include politicians and bills they supported, or moviegoers and the movies that they attend, or students and the courses in which they enroll. Such examples can be regarded as decisionmakers and choices, while we note that similar datasets arise in many contexts that are often described by bipartite graphs, such as the association between genes and diseases [17], relation between chemical reactants and reactions [18].
 (1)
“Similar” objects (whether decisionmakers or choices) ought to be “nearby” in the visualization;
 (2)
Decisionmakers should be positioned “close” to their preferred choices.
In addition to the classical ordination methods described above, we note that BiFold has similar goals to nonlinear and generalized biplot methods described in [24, 25]. In particular, the generalized biplot addresses categorical variables (of which dichotomous variables are a subset) with consideration of both requirements (1) and (2) in developing the ordination. Their approach is to ordinate each entry in the dataset, such that each level of a categorical variable is separately visualized. For binary variables, that approach would require representation of one of the classes of objects by two sets of ordination coordinates, one to represent the “1” and the other to represent the “0” in the data. Our original contribution here is twofold: (i) Our treatment is completely symmetric with respect to the classes, with neither being treated as the “variables.” The resulting ordination is identical, even if we transpose our dataset. Consequently, each object, regardless of class, is assigned only one coordinate. (ii) We consider an ordination scheme that accounts for the difference in information quality of crossgroup and withingroup distances, as well as the difference in information content across groups of different size, as specialized to the binary data framework. The ordination approach is more naturally able to account for the difference in information content arising from the nonsquare data matrices, missing data, and differences in interpretation of matches for categorical variables [25].
We begin with an introductory example of BiFold below, leaving the details of the approach and more examples to the later sections.
An introductory example  BiFold plot of the southern women dataset
One way to visualize the Southern Women dataset is to use MDS to place the 18 women at suitable 2D locations, where distance between embedded coordinates reflects the degree to which the women attended similar events, as in Figure 1top left. In this case, we are treating the women as the entities to be plotted, while the events are regarded as factors that characterize each individual. Alternatively, we can treat the events as entities and the women as factors, allowing us to obtain an MDS configuration of events, as shown in Figure 1bottom left. The goal is to “overlay” the two embeddings to not only capture the withinclass relationships (woman to woman, event to event) but also the crossclass relationship of woman to event. BiFold produces such a joint visualization (Figure 1right panel) in which social group structure [27, 28] is easily identified through proximity: (a) nearby women attended similar events, (b) nearby events were attended by similar groups of women, (c) nearby womanevent pairs indicate that the woman likely attended that event.
2 Results
2.1 The BiFold approach
BiFold provides a procedural framework to produce a lowdimensional embedding from a binary data matrix. First, we create a joint dissimilarity matrix that appropriately fuses information from both withinclass and crossclass relations. Secondly, we construct a weighting matrix to reflect the relative uncertainty associated with the dissimilarities. Finally, we minimize a weighted stress function to obtain a BiFold embedding, coordinates in \(\mathbb{R}^{d}\) for each row and each column of the data matrix. In this section, we describe and explain this framework, leaving the detailed specification of the algorithms and parameters to Materials and Methods.
Focusing on objects represented by the rows of B, we quantify, using some appropriate measure, the dissimilarity between row i and row j, denoted \(\delta^{(x)}_{ij}\), producing matrix \(\Delta ^{(x)}=[\delta^{(x)}_{ij}]_{m\times m}\). Likewise, we generate dissimilarity matrix \(\Delta^{(y)}=[\delta ^{(y)}_{ij}]_{n\times n}\) by comparing the columns of B. Finally, the dissimilarity between row object i and column object j is defined by a monotonic transformation of the entries in matrix B, which yields a crossclass dissimilarity matrix \(\Delta^{(xy)}=[\delta ^{(xy)}_{ij}]_{m\times n}\). A binary relation dataset typically falls into one of the two categories: (1) choice data, for which each data entry (whether “0” or a “1”) reflects an active decision (either a positive or negative relation); and (2) association data, for which the “0”s indicate only an absence of a relation, and are usually much less informative than the “1”s. For data from each of these categories, we have developed some sensible dissimilarity measures (see Materials and Methods).
In the following sections, we illustrate BiFold via three additional binary datasets: US presidential election results, US senate voting records from the 112th Congress, and a foodrecipe relational dataset from five major global cuisines.
2.2 Examples: voting datasets
A common type of binary relation comes from voting data: for each item to be voted on, a voter either votes “for” or “against,” with (perhaps) the ability to abstain. We consider two such examples: US presidential election results for past ten elections, for which the “voters” are the individual US states and the “items” are the winning presidents in each election. We also examine senate congressional roll call votes, with US senators as voters and the items are senate bills.
2.2.1 Presidential election by states

Not surprisingly, the primary coordinate axis (left/right) strongly encodes the party affinity (blue state/red state).

Over time, the election positions have (generally) moved toward the left/right extremities, capturing the increasing partisanship of the elections.

Most of the purple colored “swing states” are near the center of the visualization, which implies that they align with most of the election winners, with slight variation based on the particular set of presidents that they supported. As interesting exception, West Virginia lies far above the main cloud, which we attribute to its trend of having often supported the nonwinning candidates.

Noting several “paired” election coordinates, we observe that such pairs associate with twoterm presidents, likely because their constituent support did not change much between elections.

Positional outlier Carter ‘76 reflects support from a nontypical coalition of states, likely attributed to Carter being the first president elected from the Deep South since the Civil War. Reagan ‘84 is the most centrally positioned, reflecting broad national support. Figure 3b connects each of these elections to the supporting states.

Comparing Bush ‘00 to Obama ‘12 (Figure 3c) we see both elections driven primarily by partisan support.
We remark that any of the visually indicated hypothesis should be viewed as exploratory and be confirmed by additional quantitative analysis (as would also be appropriate for most other data visualizations). However, we note that the BiFold visualization motivates a rich palette of such hypotheses, many of which directly exploit the betweenclass information.
2.2.2 Senate congressional roll call votes
As with the previous examples, the goal (achieved by BiFold) is to obtain an embedding that captures both the withinclass relationships (senatortosenator and billtobill) and the betweenclass relationships (senatortobill). From the data, we quantify the dissimilarity between two senators (bills) as the estimated probability that they vote (were voted) differently. Dissimilarity between senator i and bill j is the estimated likelihood that senator i objects to bill j.

Bills in the “left” (liberal) cluster received strong support from the Democratic Senators;

Bills in the “right” (conservative) cluster received strong support from the Republicans;

Bills in the “top” (bipartisan supportive) cluster were strongly supported by both parties, as visually being “pulled” between the two parties;

Bills in the “bottom” (bipartisan opposition) cluster are pushed far away from both parties, indicating bills that were supported only by a small number of senators.
2.3 Association datasets: a recipe  ingredient dataset
We envision the BiFold approach to be broadly useful, certainly beyond the visualization of voting data. Another important category of binary data captures the association between “members” and “affiliations.” A key feature of such association datasets is that the nonassociation relations carry little information compared to the association relations; in sharp contrast, in a voting dataset the “yes” and “no” votes both convey valuable information about the relation between the decision makers and the choices. Association datasets are often collected to form sparse, bipartite networks, where sparsity arises from the reality that there are (typically) many more nonassociations than associations in these data.
We focus here on a specific example relating recipes with their included ingredients. A recipe defines a procedure for cooking, along with a list of food items (ingredients) used in the recipe. Gathering this data over a broad spectrum of recipes allows us to more completely understand how ingredients are used in combination, which may vary from one cuisine to another. As our data source, we consider the recipeingredient association dataset generated in [31], which assembled over \(50{,}000\) recipes taken from two American and one Korean online repository. The data is (again) represented by a matrix B, where \(b_{ij}=1\) indicates that recipe i contains ingredient j, and 0 otherwise. To proceed with the BiFold approach, we must define the dissimilarities between the entities: Recall that in the voting examples, both a “1” (a yes vote) and a “0” (a no vote) contain actual information regarding a voter’s opinion. In contrast, in the recipeingredient dataset, a given recipe typically includes only a small fraction of all available ingredients and carries essentially no information on those ingredients that are not used in the recipe.

From the collection of cuisine plots, we can visually identify similar cuisines (North America  Western Europe, Latin America  Southern Europe).

The East Asian cuisine appears visually distinct from the western heritage cuisines.

The protein group, primarily meat, appear centrally in the figure of ingredients, with all the cuisines showing significant density in that region of the plot. (In other words, the meat group does not identify any particular cuisine.)

The density plots allow to visually identify certain ingredients as the “signature” of a cuisine: basil and oregano (Southern European); sesame oil and soy sauce (East Asian); cocoa and vanilla (North American and Western European).
3 Discussion

As an (almost trivial) extension, we note that interpretation of the data as representing a bipartite network implies that BiFold could act as a graph layout algorithm for bipartite network data.

BiFold can be viewed as a generalization of several other classical techniques which can be recovered by specific choice of parameters:

The entries in the data matrix, B, need not be binary, but could represent a continuous or ordinal variable, such as ratings, rankings, or preferences.

Some dataset might naturally contain more than two groups, such as actors, movies, and viewers. Such datasets can be treated as multipartite, rather than bipartite data. We envision a natural extension of BiFold, where the joint dissimilarity and weighting matrices must be appropriately constructed based on the withingroup and betweengroup relationships.

We focused on Hamming distance and Jaccard distance to compute withinin class dissimilarity, with each providing a natural interpretation for the datasets considered. We note that the BiFold framework is not dependent upon any particular choice of dissimilarity measure, and a reasonable practitioner may choose other methods for defining dissimilarities (and weights) that might be appropriate for their data. The BiFold approach  based on the joint dissimilarity matrix, will still provide a means to develop the joint visualization.

For some of the methods, we interpret the raw (binary) from Bayesian perspective, but with uninformed prior. That approach could easily added to accommodate other a priori understanding of the data.

For dynamic datasets (parameterized by time, for example) each data “snapshot” would yield a BiFold layout. A stress functional that incorporates a regularity condition in time could compute an optimal sequence of layouts, computed over many snapshots.

Computational complexity of the stress minimization as an optimization problem using the SMACOF algorithm is roughly \(O(n^{4})\) for reaching at a local, approximate solution. As such the current implementation of stress minimization will likely struggle with very large datasets. Because the technique is meant to support visual knowledge discovery (human interaction), speed of visualization is important. Data aggregation might be a way to handle large datasets, but the aggregation procedures will almost certainly be domain specific.

Comparing one BiFold layout to another (exploring parameter space) can be challenging in that the solution layout is rotation and reflection invariant. Normalizing the orientation of the generated solution is important. As additional complication, the configuration solution to the optimization problem is a local minimizer, so that solution may “jump” to a different minimizer under small changes in the data.

The nonEuclidean nature of the dissimilarity measures results in a dissimilarity matrix that is not necessarily well approximated by a low dimensional embedding. Under such case, visually interesting effects may sometimes be an artifact of the data, particularly with sparse datasets.
4 Materials and methods
4.1 Datasets

The Southern Women dataset is a popular dataset used in social network analysis. The dataset first appeared in the book “Deep South: A Social Anthropological Study of Caste and Class” [26] (p.148), and can also be found in several online network data repositories. Collected in the 1930s in a small southern town Natchez (Mississippi, United States), the data records the participation of 18 women in a series of 14 informal social events over a ninemonth period. Only the events for which at least two women participated are included in the dataset. Figure 1 shows the data table without including the names of the women or dates of the events. We represent the dataset by a womanbyevent matrix \(B=[b_{ij}]_{18\times14}\), where \(b_{ij}=1\) indicates that woman i attended event j, and \(b_{ij}=0\) otherwise.

The U.S. presidential election dataset considered in this paper includes the statelevel voting results of the United States presidential elections for the period from 1976 to 2012. The dataset, available at the U.S. government archive (http://www.archives.gov/federalregister/electoralcollege/), includes the state voting outcome from the 51 voting entities (50 states plus the District of Columbia) for the past 10 presidential elections. We alphabetically numbering the states from 1 to 51 by name, and the elections from 1 to 10 in chronological order. We then represent the dataset by a statebypresident matrix \(B=[b_{ij}]_{51\times10}\), where \(b_{ij}=1\) indicates that state i voted for the elected president in the jth election, and \(b_{ij}=0\) otherwise. For example, in all past 10 elections Ohio has always voted for the president candidate who eventually won the election regardless of his party affiliation. Florida and Nevada both “missed” one election: in the 1992 election, Florida voted for G.H.W. Bush (the elected president was B. Clinton); in the 1976 election, Nevada voted for G. Ford (the elected president was J. Carter). All three are wellknown examples of “swing” states characterized by flexible voting patterns and importance in determining the election outcome.

The U.S. Senate Congressional Voting dataset used in this paper is obtained from the congressional voting records of the 112th United States congress, first session of the Senate. There are at most 100 senators at any time, with occasional need to replace a senator in mid session, which happened once during the voting portion of this session. As such the roll calls indicate 101 senators voting, 51 Democrats (D), 48 Republicans (R), and 2 Independents (I). There were 235 recorded roll call votes, 167 passed and 68 rejected. We number the senators from 1 to 101 by last name, and the bills from 1 to 235 in chronological order. We formulate data matrix \(B=[b_{ij}]_{101\times235}\) by defining \(b_{ij}\) using the voting of senator i on bill j: for a “yes” vote \(b_{ij}=1\), for a “no” vote \(b_{ij}=0\). The abstained votes are treated as “missing” data in the matrix (see the “Treatment of partial and missing data” section below for details).

The recipeingredient dataset is retrieved from the Supplementary Information of Ref. [31], a paper that studied the similarity and difference in food pairings across different geographical regions. The dataset contains more than 50,000 recipes extracted from three cuisine websites: allrecipes.com, epicurious.com, and menupan.com. The recipes were divided into 11 geographical regions, covering ∼50 popular cuisines around the world. The recipes and ingredients are indexed. Focusing on the 5 geographical regions (cuisines) that contain over 1,000 recipes, we construct data matrix \(B=[b_{ij}]_{5{,}000\times335}\), with \(b_{ij}=1\) if recipe i contains ingredient j. This subsample of the original dataset contains 1,000 randomly selected recipes from each of the 5 selected cuisines: East Asian, Latin American, North American, Southern European, and Western European. The subsampled data contains a total of 335 different ingredients.
4.2 The BiFold framework: dissimilarity measures, weights, and stress minimization
The BiFold framework describes a general approach to produce a lowdimensional embedding from a data matrix, where that matrix encodes the relationship between two classes of objects. First, one needs to create a joint dissimilarity matrix using some appropriate withinclass and crossclass dissimilarity measures as well as scaling to make the withinclass and crossclass dissimilarities commensurate. Secondly, one needs to construct a weighting matrix to reflect the relative focus to be given to the computed dissimilarities. Finally, the BiFold embedding is obtained by minimizing a weighed stress function similar to the determination of an MDS solution.

The joint dissimilarity matrix is given by Eq. (2), where \(\Delta^{(x)}=[\delta^{(x)}_{ij}]_{m\times m}\) and \(\Delta ^{(y)}=[\delta^{(y)}_{ij}]_{n\times n}\) are the withinclass dissimilarity matrices and \(\Delta^{(xy)}=[\delta^{(xy)}_{ij}]_{m\times n}\) is the crossclass dissimilarity matrix (\(\Delta^{(yx)}=\Delta ^{(xy)\top}\)). The parameters: \(\alpha_{x}\), \(\alpha_{y}\), and \(\alpha _{xy}\) provide flexible scaling of the withinclass and crossclass distances in the embedded space, while β can be used to visually translate the type1 objects away from the type2 in the embedding.

The weighting matrix is defined in Eq. (3), where \(W^{(x)}=[w^{(x)}]_{m\times m}\), \(W^{(y)}=[w^{(y)}]_{n\times n}\) are the withinclass weighting matrices and \(W^{(xy)}=[w^{(xy)}]_{m\times n}\) is the crossclass weighting matrix (\(W^{(yx)}=W^{(xy)\top}\)).

As typical choice for the above stress function S is to let \(\Phi(d,\delta)=(d\delta)^{2}\). For a given dissimilarity and weight matrix, this fully specified stress function may then be minimized to obtain coordinates \(\{z_{1}, \ldots,z_{m+n}\}\).
4.3 Dissimilarity measures and weights used in the examples

Voting data: the BiFold Bernoulli Method. Where the data matrix B represents ‘voting’ data, such that \(b_{ij}\) indicates that object \(X^{i}\) voted positively for object \(Y^{j}\), one may consider that the preference selection (‘1’ or ‘0’) is a forced binary decision on a continuous variable that represents preference. One model for this situation would be to view \(b_{ij}\) as the observation of the forced decision outcome, treated as a Bernoulli trial, where Bernoulli parameter \(p:=p_{ij}=:p_{ij}^{(xy)}\) is not known. (For real data sets of voting data, we treat ‘yes’ as ‘1’ and ‘no’ as ‘0.’ As a third outcome, sometimes a voter will ‘abstain’ on a particular vote, which we view as “missing data” with technique described below.) Applying this model within a group (for example, within group 1) we could assert a Bernoulli process with \(p:=p_{ij}^{(xx)}\) the (unknown) probability that object \(X^{i}\) and \(X^{j}\) would vote the same way on an arbitrarily selected vote. Comparing rows i and j in the data matrix B would provide n observations of outcomes from that Bernoulli process. Comparison of columns treated in the same way, would represent m observation of the Bernoulli process associated to objects \(Y^{i}\) and \(Y^{j}\). Ideally, we would like to construct a BiFold configuration using dissimilarities computed from the actual values for preference  the unknown values for \(p_{ij}^{kl}\). Instead, we must assign dissimalities from estimated probabilities , \(\delta_{ij}^{(*)}:=1\hat{p}_{ij}^{(*)}\). Following standard development for estimating proportions, we count the number of within group differences between pairs of entities in each class :$$\begin{aligned}& s_{ij}^{(x)}= \sum_{k} \vert b_{ik}b_{jk} \vert , \end{aligned}$$(10)For the crossclass data, we pool all observations to define an average rate of positive voting:$$\begin{aligned}& s_{ij}^{(y)}= \sum_{k} \vert b_{ki}b_{kj} \vert . \end{aligned}$$(11)$$ \bar{p}=\frac{\sum_{i,j} b_{ij}}{nm}. $$(12)Because we have significantly more observations for the ‘within class’ data, we expect those estimates to be more accurate. Consequently, we choose weights \(w_{ij}\) proportional to the information content. Borrowing from approaches used in regression of heteroscedastic data, we weight the error term (stress) inversely as the (estimated) variance in the observation, as applied in equations (8) and (3). We focus on three primary alternatives for the estimation of the parameters and the variance: (1) Bayesian, with uniform prior; (2) Bayesian, with Jeffreys’ prior; and (3) NonBayesian, maximum likelihood estimate. Table 1 shows the resultant formulas associated to these methods. We note that the specific Bayesian approaches described assume no prior belief regarding the parameters \(p_{ij}\). However, the concept is obviously easily generalized to those cases where prior information is available, where one would simply encode that knowledge into assumed prior distribution.Table 1
Bifold Bernoulli methods: coefficient estimation formulas for the distances
Groups
Uniform prior
Jeffreys’ prior
NonBayes
\(\boldsymbol{\delta_{ij}}\)
\(\boldsymbol{1/w_{ij}}\)
\(\boldsymbol{\delta_{ij}}\)
\(\boldsymbol{1/w_{ij}}\)
\(\boldsymbol{\delta _{ij}}\)
\(\boldsymbol{1/w_{ij}}\)
1↔2
\(\frac{2b_{ij}}{3}\)
p̄(1 − p̄)
\(\frac{3/2b_{ij}}{2}\)
p̄(1 − p̄)
\(1b_{ij}\)
p̄(1 − p̄)
1↔1
\(\frac{s_{ij}^{(11)}+1}{n+2}\)
\(\frac{\delta_{ij} (1\delta_{ij})}{n}\)
\(\frac{s_{ij}^{(11)} +1/2}{n+1}\)
\(\frac{\delta_{ij} (1\delta_{ij})}{n}\)
\(\frac{s_{ij}^{(11)}}{n}\)
\(\frac{ (s_{ij}^{(11)} +1/2 ) (ns_{ij}^{(11)} +1/2 )}{(n+1)^{2} n}\)
2↔2
\(\frac{s_{ij}^{(22)}+1}{m+2}\)
\(\frac{\delta_{ij} (1\delta_{ij})}{m}\)
\(\frac{s_{ij}^{(22)} +1/2}{m+1}\)
\(\frac{\delta_{ij} (1\delta_{ij})}{m}\)
\(\frac{s_{ij}^{(22)}}{m}\)
\(\frac{ (s_{ij}^{(11)} +1/2 ) (ms_{ij}^{(11)} +1/2 )}{(m+1)^{2} m}\)

Association data: the BiFold Membership Method. For association data (such as the recipeingredient dataset), the sparse biadjacency matrix \(b_{ij}=1\) indicates an association between object i from class x with object j from class y. Unlike the case of voting datasets a “0” in an association dataset carries relatively little information as opposed to a “1”. This asymmetry, if not accounted for appropriately, will result in an embedding (and visualization) that is dominated by the count of 1s instead of revealing more useful features.
Between class dissimilarity measure is quantified asThe withinclass dissimilarities are computed using a Jaccard distance. Specifically, for two objects represented by rows i and j of the matrix B, their dissimilarity is given by$$ \delta_{ij}^{(xy)}=1b_{ij}. $$(13)Likewise, the dissimilarity between columns i and j is computed as$$ \delta^{(x)}_{ij}=1\frac{\sum_{k} b_{ik}b_{jk}}{\sum_{k} (b_{ik}+b_{jk}b_{ik}b_{jk} )}. $$(14)$$ \delta^{(y)}_{ij}=1\frac{\sum_{k} b_{ki}b_{kj}}{\sum_{k} (b_{ki}+b_{kj}b_{ki}b_{kj} )}. $$(15)
5 Stress minimization
After formulating a stress function (8) and embedding dimension d, a BiFold representation of the data is obtained by minimizing the stress function over the coordinates of \(m+n\) points in a ddimensional Euclidean space. This optimization problem is within the class of MDS problems, with several alternative tools available to find a local minimum [9, 10]. For the BiFold plots reported in this paper, the stress minimization is done via the (iterative) SMACOF algorithm [10]. For reproducibility of results, for the initial iteration of the algorithm, the starting configuration for the coordinates is obtained by a classical MDS solution of the joint dissimilarity matrix (without weighting). After applying the SMACOF algorithm to obtain a set of coordinates, we further perform a PCA (principal component analysis) to standardize the alignment, noting that the stress function is invariant under such transformations. As a consequence, in all BiFold plots the horizontal axis is the principal direction.
6 Treatment of partial and missing data
 1.
Use only available data when computing dissimilarities \(\delta_{ij}\).
 2.
Weights \(w_{ij}\) should be selected to account for the actual (non missing) data that is used to compute the associated dissimilarity.

If \(b_{ij} = \mathrm{NA}\) then \(w_{ij}^{(xy)}=0\), and \(\delta _{ij}^{(xy)}=c\), where c is an arbitrary, finite constant.

For within group differences for group 1, define index sets \(\kappa_{ij}\) ascompute$$\kappa_{ij} = \{k\mid b_{ik} \neq\mathrm{NA}, b_{jk} \neq\mathrm{NA} \}, $$and determine the number of information elements as$$ s_{ij}^{(x)}= \sum_{k \in\kappa_{ij}} \vert b_{ik}b_{jk} \vert , $$(20)$$ n_{ij}^{(x)}= \vert \kappa_{ij} \vert . $$(21)

Apply Table 1 formulae to compute \(\delta_{ij}^{(x)}\) and \(w_{ij}^{(x)}\), replacing n by \(n_{ij}\).

Use similarly modified formulas to compute \(\delta_{ij}^{(y)}\) and \(w_{ij}^{(y)}\).
After forming the data matrices Δ and W, then we may simply minimize the weighted stress to determine an coordinate representation.
Declarations
Acknowledgements
The authors wish to thank Daniel B. Larremore for useful feedback on the manuscript. This work was partially supported by a Clarkson University Provost Award, Army Research Office grants W911NF1210276 and W911NF1610081, and the Simons Foundation grant 318812. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the funding agencies.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Fekete JD, van Wijk JJ, Stasko JT, North C (2008) The value of information visualization. In: Kerren A et al. (eds) Information visualization. LNCS, vol 4950, pp 118 View ArticleGoogle Scholar
 Spence I, Garrison RF (1993) A remarkable scatterplot. Am Stat 47(1):1219 Google Scholar
 Fayyad U, Grinstein G, Wierse A (eds.) (2001) Information visualization in data mining and knowledge discovery. Kaufmann, San Francisco Google Scholar
 Gastner MT, Newman MEJ (2004) Diffusionbased method for producing density equalizing maps. Proc Natl Acad Sci USA 101:74997504 MathSciNetView ArticleMATHGoogle Scholar
 Sims GE, Choi IG, Kim SH (2005) Protein conformational space in higher order ϕψ maps. Proc Natl Acad Sci USA 102:618621 View ArticleGoogle Scholar
 Chen M et al. (2009) Data, information and knowledge in visualization. IEEE Comput Graph Appl 29(1):1219 View ArticleGoogle Scholar
 Nishikawa T, Motter AE (2011) Discovering network structure beyond communities. Sci Rep 1:151 View ArticleGoogle Scholar
 Shekhar K, Brodin P, Davis MM, Chakraborty AK (2014) Automatic classification of cellular expression by nonlinear stochastic embedding (ACCENSE). Proc Natl Acad Sci USA 111:202207 View ArticleGoogle Scholar
 Cox TF, Cox MAA (2000) Multidimensional scaling, 2nd edn. Chapman & Hall/CRC, London MATHGoogle Scholar
 Borg I, Groenen PJF (2005) Modern multidimensional scaling: theory and applications, 2nd edn. Sprinter, New York MATHGoogle Scholar
 Vogelstein JT et al. (2014) Discovery of brainwide neuralbehavioral maps via multiscale unsupervised structure learning. Science 344:386392 View ArticleGoogle Scholar
 Engeszer RE, Wang G, Ryan MJ, Parichy DM (2008) Sexspecific perceptual spaces for a vertebrate basal social aggregative behavior. Proc Natl Acad Sci USA 105:929933 View ArticleGoogle Scholar
 Carmeño P, Falkowski PG (2009) Controls on diatom biogeography in the ocean. Science 325:15391541 View ArticleGoogle Scholar
 Bronstein AM, Bronstein MM, Kimmel R (2006) Generalized multidimensional scaling: a framework for isometryinvariant partial surface matching. Proc Natl Acad Sci USA 103:11681172 MathSciNetView ArticleMATHGoogle Scholar
 Shieh AD, Hashimoto TB, Airoldi EM (2011) Tree preserving embedding. Proc Natl Acad Sci USA 108:1691616921 View ArticleGoogle Scholar
 Aflaloa Y, Kimmel R (2013) Spectral multidimensional scaling. Proc Natl Acad Sci USA 110:1805218057 MathSciNetView ArticleMATHGoogle Scholar
 BauerMehren A et al. (2011) Genedisease network analysis reveals functional modules in mendelian, complex and environmental diseases. PLoS ONE 6(6):e20284 View ArticleGoogle Scholar
 Craciun G, Feinberg M (2006) Multiple equilibria in complex chemical reaction networks: II. The speciesreaction graph. SIAM J Appl Math 66(4):13211338 MathSciNetView ArticleMATHGoogle Scholar
 Gabriel KR (1971) The biplot graphic display of matrices with application to principal component analysis. Biometrika 58(3):453467 MathSciNetView ArticleMATHGoogle Scholar
 Bennett JF, Hays HL (1960) Multidimensional unfolding: determining the dimensionality of ranked preference data. Psychometrika 25(1):2743 MathSciNetView ArticleGoogle Scholar
 Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24:417441, 498520 MATHGoogle Scholar
 Richardson M, Kuder GF (1933) Making a rating scale that measures. Pers J 12:3640 Google Scholar
 Hirschfeld HO (1935) A connection between correlation and contingency. Proc Camb Philos Soc 31:520524 View ArticleMATHGoogle Scholar
 Gower JC, Harding SA (1988) Nonlinear biplots. Biometrika 75(3):445455 MathSciNetView ArticleMATHGoogle Scholar
 Gower JC (1992) Generalized biplots. Biometrika 79(3):475493 MathSciNetView ArticleMATHGoogle Scholar
 Davis A, Gardner BB, Gardner MR (1941) Deep South: a social anthropological study of caste and class. University of Chicago Press, Chicago Google Scholar
 Freeman LC (2003) Finding social groups: a metaanalysis of the southern women data. In: Ronald B, Kathleen C, Philippa P (eds) Dynamic social network modeling and analysis. National Academies Press, Washington Google Scholar
 Field S, Frank KA, Schill K (2006) Identifying positions from affiliation networks: preserving the duality of people and events. Soc Netw 28:97123 View ArticleGoogle Scholar
 Beineke LW, Wilso RJ (2004) Topics in algebraic graph theory. Cambridge University Press, Cambridge View ArticleGoogle Scholar
 Porter MA, Mucha PJ, Newman MEJ, Warmbrand CW (2005) A network analysis of committees in the U.S. House of Representatives. Proc Natl Acad Sci USA 102(20):70577062 View ArticleGoogle Scholar
 Ahn YY, Ahnert SE, Bagrow JP, Barabási AL (2011) Flavor network and the principles of food pairing. Sci Rep 1:196 View ArticleGoogle Scholar
 Levandowsky M, Winter D (1971) Distance between sets. Nature 234:3435 View ArticleGoogle Scholar