While the focus has been on the history of the ‘genome wars’, they did not occur in isolation, as discussions regarding information access have been occurring simultaneously in a variety of disciplines (Figure 1). Two forces are currently impacting the research community - the need to protect individual privacy regarding information and the push towards open access to data - and the outcomes are not yet clear. Data available from such sources as mobile phone calls, social media information, consumer information collected by companies, and government data have been described as revolutionizing the social sciences to an extent comparable to the effect that the microscope had on the biological sciences [41]. However such data individually (and, more importantly, in combination) could render current notions of privacy obsolete.
Large proprietary datasets have been made available to particular researchers because of the relationships or contractual agreements they have forged with the data-generating companies, which raises questions for reproducibility and review for the social science community [42]. For example, Wang et al. in 2009 [43] published research relating to the way that mobile viruses can spread, which has implications for the telecommunications infrastructure. Their findings were based on the anonymized billing records provided by 6.2 million mobile phone subscribers. As the privacy of the records were protected by law, the authors noted in the Supplemental Online Material that they would provide further information on request and told Science that as long as the researchers were willing to observe the same privacy, technological, security, and legal limitations that they were subject to at the time of the request, they would be glad to facilitate data access at their center (personal communication, A.-L. Barabasi). The motivations for some of these companies to release data to investigators are to learn more about their own clients and operations or to forge deals in which the researchers look at questions of interest to them. However, commentators have complained that such ‘private’ data threaten the capacity for independent replication on which science is based [44]. (As Bernardo Huberman said ‘if an independent set of data fails to validate results derived from privately owned data, how do we know whether it is because those data are not universal or because the authors made a mistake?’ [45].) Whether privately arranged access to data that form the basis of scientific publications or release only of aggregated data that protect company or individual privacy will continue remains to be seen. Certainly further investment in technology that can ensure control over the anonymization of data is warranted.
Nor is it easy for private companies to release such data without strings attached. Understanding how people search for information is an active area of research in the social sciences. In 2006 researchers employed by America Online, Inc. (AOL) published log files encompassing 36,389,567 searches done by users of AOL’s proprietary client software on the Internet. This was done without the consent of the individuals involved and, even though screen names had been removed, New York Times reporters and others rapidly showed that it was possible to infer the identities of some of the searchers from the information that was released [46]. Even with these problems, some scientists said that the data release was a service to the research community [47]. AOL was widely criticized in the blogosphere and in mainstream media, some individuals at AOL lost their jobs, and the company was the target of a class-action lawsuit.
There is a long history of efforts to protect individual privacy in the United States and Europe covering a wide range of human activity. Some recent events should be noted as they could affect researchers’ ability to gather data. In the European Union, a General Data Protection Regulation was proposed in January of 2012 and is expected to take effect by 2016 [48]. Two provisions could be especially challenging for the research community. The regulation stipulates that data can only be saved for as long as a need can be demonstrated. It also codifies a ‘right to be forgotten’ - that any individual has the right to have his/her data removed from a database at any time. The European Commission has stated that personal data ‘is any information relating to an individual, whether it relates to his or her private, professional or public life. It can be anything from a name, a photo, an email address, bank details, your posts on social networking websites, your medical information, or your computer’s IP address’ [48]. In the United States, the Federal Trade Commission recently released recommendations for protecting consumer privacy, which included provisions for controls over how much data companies can collect on individuals and how long they can retain it as well as a recommendation that companies establish a ‘do-not-track’ mechanism for consumers who do not wish to have their information gathered [49]. There need to be consistent, transparent regulations that will safeguard the public but allow research to move forward.
While concerns about privacy could restrict data dissemination, another force is acting to promote access. The idea that sequence information should be freely available was a reflection of a much broader effort by diverse parts of the scientific community to make access to the results of scientific research faster and cost-free. The open source movement started between the 1970s and 1980s with the goal of creating, developing, and disseminating free computer operating systems that would break the hold of corporate entities [50]. While journals that are freely available to readers began to appear by the early 1980s [51] the open access movement gained momentum at about the same time as the human genome project was officially starting. In 1991, a year after the official start to the human genome project, Paul Ginsparg created ArXiv, a pre-publication server whose purpose was to facilitate rapid dissemination of scientific information, without subscription fees, page charges, or peer review [52]. It is currently possible to access more than 800,000 e-prints on ArXiv in physics, mathematics, computer science, quantitative biology, quantitative finance, and statistics.b The National Library of Medicine formed PubMedCentral in 2000 in order to have a free repository for NIH-funded research. Initially deposition was voluntary, but Congressional action (the Consolidated Appropriations Act of 2008 (H.R. 2764)) made deposition mandatory with release no later than a year post-publication, after statistics showed poor compliance by the fundees.
At roughly the same time as the report of the Cech committee came down hard on the idea of restrictions to data access in published papers two major manifestos of the open access movement, the Budapest open access initiative [53] and the Berlin declaration [54] appeared. Certainly some of the statements of groups and individuals who were most vociferous in their opposition to Science’s efforts to look for alternatives in publishing data reflect the philosophy of the open access movement. The Wellcome Trust, which has been a prime mover in supporting open access declared recently that ‘Our support for open access publishing was a natural progression of our involvement in the international Human Genome Project during the 1990s and early 2000s, where the decision to place the human genetic sequence in the public domain immediately as it was generated helped to ensure this key research resource could be used by scientists the world over’ [55].
There have been several attempts, via proposals in 2006, 2010, and 2012 to legislate a Federal Research Public Access Act to require that research funded by 11 federal agencies be made freely and publicly available in government sponsored repositories within 6 months of publication. While the advantages and disadvantages for researchers, funding agencies, and publishers are still being debated, the open access movement has a great deal of energy and backers willing to finance it (at least in the short term).
Does publication, whether in a repository or a journal, mean that enough information is released to form a solid foundation for future research? Despite high-minded principles, the published literature reflects the fact that many researchers do not, left to their own devices, rush to share data. Even when the Journal of Biostatistics offered to give formal recognition to authors who provided enough data and methods in their papers to allow an editor to replicate the findings, only a small percent complied [56]. Although the extent of sharing varies from field to field, common reasons given for withholding data are similar: it is too much work, it removes the competitive advantage from the scientists who generated the data and who require publications for their careers, or the raw data was received under confidentiality agreements. The Dataverse Network is a repository for social science data that allows depositors to note concerns and restrictions. In a 2010 survey of the conditions for use posted by more than 30,000 users, Stodden [57] found the most common were ‘maintaining subject confidentiality, preventing further sharing, making a specific citation form a condition of use, restricting access by commercial or profit-making entities, and restricting use to a specific community, such as that of the researcher’s home institution.’
Certainly the granting and funding and tenure cultures need to enforce good behavior, which people have been saying for years. NIH now mandates that provisions for data sharing be included in research applications for $500,000 or more of direct costs in any single year and several other agencies have similar provisions. The NSF states that to apply for a grant as of January 18, 2011 ‘All proposals must describe plans for data management and sharing of the products of research, or assert the absence of the need for such plans’ [58]. The two-page data management plan submitted as part of the application may include ‘policies and provisions for re-use, re-distribution, and the production of derivatives’ and ‘plans for archiving data, samples, and other research products, and for preservation of access to them.’ Although the NSF-wide mandate takes precedence, there are variations within the directorates [59]. For example, the NSF Division of Earth Sciences allows up to 2 years of exclusive data use for selected principal investigators. The directorate for Social, Behavioral, and Economic Sciences allows for the possibility of ethical and legal restrictions regarding access to non-aggregated data.
Community service, whether through generation of shared data or sharing of knowledge and communication with the public needs to be formally recognized. A Data and Informatics Working Group has recommended that NIH provide incentives for data sharing by providing information on the number of times datasets in its central repository are accessed or downloaded [60].
Even when academic communities are willing to share data, public repositories do not always exist and those that do are under siege in unstable economic environments. It can be easy to drum up funds to create databases, but not so easy to find federal or other moneys to sustain them. Data repositories need continuing support.
The momentum in academia is clearly that releasing and not hoarding data is a virtue. However, the history and examples cited in this paper show that while data sharing may become second nature, it is not an easy, seamless process and is not happening without challenges and compromises.