The Open Data movement: international consolidation

Image source: JulieBeck.

As reported in this blog ¹, the first issue of the open access journal Scientific Data² was published on 24^th June by Nature Publishing Group, which is responsible for the prestigious Nature collection. This new publication will act as a host for the formal publication of datasets in the primary article type known as Data Descriptors³.

The journal is peer-reviewed and published in an electronic version only. Authors pay a publication fee starting from US $900, according to the type of license and country of affiliation, which ensures their article’s immediate availability in open access, with its content under a Creative Commons Attribution License (CC-BY). The metadata of the datasets is also made available in machine readable form.

The launch of this journal has been brought about by the growing awareness on the part of the academic world, research institutions, funding agencies, the private sector, governments and civil society of the importance of making available the experimental data resulting from scientific research and providing the interoperability of this data with the articles which resulted from it. The ability of science to advance based on previously conducted research, and the ability for self-correction on a continual basis, finds one of its major cornerstones in the open availability of data.

A few decades ago, the world experienced a paradigm shift in scholarly communication as a result of the Internet, and digital technologies with the online digital publication of journals and new forms of dissemination, evaluation, and communication between authors, editors, referees and readers. A natural consequence of this process –and possibly a new paradigm –envisages that the data used in the creation of a scientific article should be made available in open access repositories, in the same way as an ever increasing number of articles is becoming available in open access, after the period of embargo imposed by the publishers has elapsed (Green Road), unless these articles are already available, as in the case of the Golden Road.

According to the editorial in the first issue(2014) of Scientific Data, “the question is no longer whether research data should be shared, but how to make effective data sharing a common and well-rewarded part of research culture”. With this new journal, a space is opening up for researchers enabling them to formally describe a body of datasets and the techniques used in obtaining them, and referring readers to articles which have already incorporated this data. This also allows due credit to be given to researchers responsible for the production of data who would not qualify as authors in a traditional publication.

With this new paradigm shift of the sharing of datasets, which are duly refereed and able to be cited, it is hoped that the scientific community will respond in such a way as to recognize and give credit to the authors of this data, in the same way as happens today with publications which have been submitted to a peer-review process.

The sharing of datasets has particular relevance in the areas of climate change and health sciences, in the opinion of the Editor-in-Chief of Nature, Philip Campbell. Campbell visited Brazil in March 2014, when he took part in the conference entitled “Science as an open enterprise : open data for an open science”⁴ which was held at the headquarters of FAPESP in São Paulo. It was at this conference that the Editor of Nature stressed that it is necessary to consider the costs and implications of the management of large quantities of data, and cited the report⁵ published in 2012 by the Royal Society with the same title as that of the São Paulo conference. This report consists of a collection of chapters written by UK specialists which analyze the impact of the new technologies which are dominating scholarly communication, and shows how researchers should adapt to changes that are in the pipeline. The report makes a series of recommendations on the storage, availability, sharing and interoperability of research data so it can be better used and re-used.

In addition to the interest in datasets for the reasons already given, it is necessary to consider that this practice is bound to increase the reproducibility of research outcomes. The greater the number of researchers making their data openly available in open access repositories, the greater the probability that others will be able to reproduce their work, with obvious benefit for all. As has already been noted in this blog ⁶, irreproducibility in research outcomes is an issue that concerns not only the scientific community but also private enterprise, governments and society as a whole.

A particularly important reason for the adoption of storage and sharing policies for datasets is tied in with the long-term digital preservation of these items. Research carried out by researchers in Canada⁷ evaluated the preservation of data by the authors of articles written as a consequence of this data, and published between two and twenty two years ago. Results indicate that the loss of data is greater the older the publication is. The probability that an author keeps the data from an article goes down by a factor of 17% per annum. Add to this the difficulty of tracking down authors, since older publications do not include e-mail addresses, or if they do, these may be out of date. The probability of being able to contact an author drops at the rate of 7% per annum. It is, therefore, estimated that 80% of data will not be available twenty years after it was generated.

Funding agencies constitute important partners which are supporting –and financing –initiatives such as repositories for the storage, retrieval and sharing of data. An example of this is the US National Science Foundation which implemented a detailed policy on the deposition of data from research funded by the institution⁸.

Canadian federal funding agencies are developing a joint initiative to improve access to publicly funded research – and data relating to them – in line with international norms and standards. The Tri-Council Open Access Policy terms are available on the University of Waterloo web site⁹.

Datasets, as stated above, are peer reviewed content, constituting a reference source for which authors will receive credit, as happens with traditional journal publications. In addition, datasets will be assigned a DOI (digital object identifier), thus making them citable. Because of this, it is hoped that citations to an article will increase due to the open availability of its data.

Keeping in mind this extensive source of citations, Thomson Reuters created the Data Citation Index which complements the existing Science Citation Index, Social Science Citation Index and the SciELO Citation Index which came online at the beginning of 2014. These are all part of the Thomson Reuters Web of Science service, the world’s largest and most prestigious international database of scientific journals.

The Data Citation Index¹⁰, inaugurated at the beginning of 2013, allows researchers to access numerous repositories of datasets in one single database, thus bringing the impact of research beyond the published content. By standardizing the practice of citing data, researchers will have more opportunity to get credit for their work. Similarly, funding agencies will ensure greater visibility and impact of the research they fund, including allowing for the research results to be used by other researchers.

Keeping in mind the progress of the open data movement, many publishers have already taken steps to elaborate policies and methodologies for the storage and retrieval of data. Let’s look at some examples.

PLoS

The Public Library of Science, PLoS, one of the most important pioneering initiatives in open access publishing, has published Data Access for the Open Access Literature: PLOS’s Data Policy¹¹, a policy which has been in force since March 1, 2014.

The Editor-in-Chief of PLoS Biology, Theo Blom, explains that, in line with the open access policy of the collection, basic data should be freely available to researchers for replication, re-analysis, interpretation or inclusion in meta-analyses, to facilitate the reproducibility of research results and scientific progress. Since their creation, PLoS journals have requested that research data be made available; however it wasn’t until 2013 that a set of methodologies and policies for data storage was drawn up.

According to these policies, upon the online submission of a paper, authors must provide a Data Availability Statement to be published once the paper is approved for publication. The refusal to share data and related metadata provides sufficient reason for the rejection of the paper. PLoS recommends depositing data in public repositories such as the Dryad Digital Repository. Genetic sequences, protein structures, clinical trials and biological models can be deposited in specific databases, such as GenBank, Protein Data Bank and ClinicalTrials.gov, respectively. PLoS further defines a minimal dataset as the dataset used to reach the conclusions contained in the paper. This dataset should likewise be made available.

Taylor and Francis

This publisher has not yet defined its policies for the sharing of datasets originating from the articles published in its journals, some of which are Gold Road Open Access while others are hybrids. However, the publisher’s web site¹² specifies that authors who use other researchers’ datasets in their articles must indicate how this data was selected, and specify the URL of the source of the data so that its reproduction is possible by other authors.

Springer

The publisher advises on its web site that the submission of a paper to an open access journal in the Springer Open collection implies that the reproducible elements described in the paper, including relevant basic data, are made available to any researcher that wants to use them for non-commercial use. Springer does not have a repository or specific policies for this procedure, but advises that repositories for this purpose are widely available, such as databases for sequences of nucleic acids and proteins, and the repositories of funding agencies. A complete list of these is available on the Web¹⁴.

SciELO

The SciELO Program has closely been following the global trends in the sharing of research data. At the SciELO 15 Years Conference, the preservation and sharing of data was the main topic of the talk given by Todd Vision, biologist and researcher at the University of North Carolina at Chapel Hill and co-founder of the Dryad Digital Repository, an initiative undertaken in conjunction with journal publishers and scholarly societies. This initiative is concerned with the preservation and availability, in open access, of datasets in the scientific and medical literature.

Starting in 2015, SciELO intends to begin a policy of asking authors who publish in the journals of the SciELO collections to make their research data available in repositories. The SciELO Program is collaborating with the DataFAIRport initiative –Find, Access, Interoperate & Re-use. This initiative, founded in January 2014, aims to make scientific research data “FAIRer”in the sense of being Findable, Accessible, Interoperable and Re-usable. It is being developed by a network of specialists and institutions.

Thus, one of the most important programs of publishing journals in open access in the southern hemisphere – and the world – distinguishes itself once again by adopting state-of-the-art methodologies and policies in favor of open access in all respects.

Notes

¹Scientific Data: Nature Publishing Group moves the communication of scientific data forward with its new online open access publication. SciELO in Perspective. [viewed 01 June 2014]. Available from: http://blog.scielo.org/en/2014/02/04/scientific-data-nature-publishing-group-moves-the-communication-of-scientific-data-forward-with-its-new-online-open-access-publication/

² YU, Y., et al. Comprehensive RNA-Seq transcriptomic profiling across 11 organs, 4 ages, and 2 sexes of Fischer 344 rats. Scientific Data. 2014. Available from: http://www.nature.com/articles/sdata201413

³Data Descriptors – http://www.nature.com/sdata/

⁴ CAMPBELL, P. Conferencia Science as an Open Enterprise: open data for open Science. 2013. Available from: http://www.fapesp.br/eventos/scienceOpenEnterprise

⁵ Science as an open enterprise. The Royal Society Science Policy Centre report. 2012, n. 02. Available from: http://royalsociety.org/uploadedFiles/Royal_Society_Content/policy/projects/sape/2012-06-20-SAOE.pdf

⁶Reproducibility of research results: the tip of the iceberg. SciELO in Perspective. [viewed 01 June 2014]. Available from: http://blog.scielo.org/en/2014/02/27/reproducibility-of-research-results-the-tip-of-the-iceberg/

⁷ Vines, T. H., et al. The Availability of Research Data Declines Rapidly with Article Age. Curr. Biol. 2014, vol. 24, n. 1. Available from: http://www.cell.com/current-biology/abstract/S0960-9822%2813%2901400-0

⁸Dissemination and Sharing of Research Results. NSF Data Sharing Policy. Available from: http://www.nsf.gov/bfa/dias/policy/dmp.jsp

⁹ Open data Guide. University of Waterloo. Available from: http://subjectguides.uwaterloo.ca/content.php?pid=333963&sid=3122909

¹⁰ Citation Index – http://wokinfo.com//products_tools/multidisciplinary/dci/

¹¹ Data Access for the Open Access Literature: PLOS’s Data Policy. Plos. Available from: http://www.plos.org/data-access-for-the-open-access-literature-ploss-data-policy/

¹² Datasets. Taylor & Francis Author Services. Available from: http://journalauthors.tandf.co.uk/preparation/writing.asp#link21

¹³ Availability of supporting data. Springer One. Available from: http://www.springeropen.com/about/supportingdata

¹⁴ DataCite. Repositórios para dados de pesquisa – http://www.datacite.org/repolist/

¹⁵ Conferencia SciELO 15 anos – http://www.scielo15.org/todd-vision/

Vídeo da apresentação (em inglês) – https://www.youtube.com/watch?v=-4xshxMqZsU

Reference

More bang for your byte. Editorial. Scientific Data. 2014. Available from: http://www.nature.com/articles/sdata201410

External links

Dryad Digital Repositort – http://datadryad.org/

BioModels database – http://clinicaltrials.gov/

GenBank – http://www.ncbi.nlm.nih.gov/Genbank/

Protein Data Bank – http://www.rcsb.org/pdb/

ClinicalTrials.gov – http://clinicaltrials.gov/

Data FAIRport initiative –http://www.datafairport.org/

DTL – http://www.dtls.nl/

ELIXIR – http://www.elixir-europe.org/

Force11 Data Citation Principles – https://www.force11.org/datacitation

Nature – Available from: http://www.nature.com/

About Lilian Nassi-Calò

Lilian Nassi-Calò studied chemistry at Instituto de Química – USP, holds a doctorate in Biochemistry by the same institution and a post-doctorate as an Alexander von Humboldt fellow in Wuerzburg, Germany. After her studies, she was a professor and researcher at IQ-USP. She also worked as an industrial chemist and presently she is Coordinator of Scientific Communication at BIREME/PAHO/WHO and a collaborator of SciELO.

Translated from the original in Portuguese by Nicholas Cop Consulting.

Como citar este post [ISO 690/2010]:

NASSI-CALÒ, L. The Open Data movement: international consolidation [online]. SciELO in Perspective, 2014 [viewed ]. Available from: https://blog.scielo.org/en/2014/07/14/the-open-data-movement-international-consolidation/

The Open Data movement: international consolidation