By Claudia Bauzer Medeiros
The management of scientific data covers the so-called “life cycle” of the data, i.e., from collection to long-term storage, through a series of cleaning, curation, annotation, indexing, and transformation processing. Much of today’s scientific research requires some sort of analysis and data processing. Therefore, the management planning of data used and generated in a research project became an integral part of the scientific methodology, being considered as one of the necessary items of good research practices.
Research projects focus mainly on the beginning and middle of the cycle – that is, planning the collection of data to be used, the elimination of errors (data cleaning) and their storage in an appropriate way, to carry out the desired analyzes for knowledge production. These activities present major challenges, both for researchers who will use and produce data in their own research and for those who develop research on data management. The latter may be, for example, Computing researchers or data librarians (usually engaged in curative and preservation activities). Regardless of the denomination, the management of research data has been giving rise to many new research lines in Computer Sciences, and this number tends to increase with the appearance of new challenges.
The figure above1, from the JISC2 website in the UK (https://www.jisc.ac.uk/), shows one of the many possible views of the research data life cycle.
The prioritization, in the figure above, of data maintenance and preservation aspects, points to a very important fact – management planning goes far beyond the duration of a project, since it is necessary to guarantee the availability of the data for the longest time possible. This raises the problem of cost associated with the life cycle. Several studies show that the cost of preservation rises over time and that, in medium and long term, it far outweighs the initial collection (or generation) and cleaning costs. One reason for this is the technological evolution of digital storage media – in some years they become obsolete, requiring data curators to copy data to a different, more modern, media in order to avoid becoming unreadable.
In this way, curatorial activity also needs to take into consideration which sets of data should be preserved and for how long. A 2013 study found that, after 20 years, 80% of the data used to produce scientific articles are no longer available.
The figure above3, taken from the article by Gibney and Van Norden4, illustrates the disappearance of these data.
Open Science presupposes Open Data (where the concept of “data” is very broad, including any type of stored digital object). There are several definitions for what is “open data”, but perhaps the most interesting is the one that defines it as data sets whose metadata is mandatorily public. In other words, anyone can figure out, using search engines, if the data exists, and how to get it. However, the data itself is not necessarily public – and can only be used by restricted research groups, for ethical or privacy reasons, for example.
In this context, data management presents yet another challenge: How to specify metadata in a way that allows the associated data to be considered “open”? This, in turn, requires the development of new metadata standards, organization of metadata repositories, and metadata mining and search systems.
Notes
1. KAYE, J. Storing and sharing research data after the ‘Space Race’ [online]. Jisc. 2015 [viewed 22 June 2018]. Available from: https://www.jisc.ac.uk/blog/storing-and-sharing-research-data-after-the-space-race-25-feb-2015
2. This vision privileges the aspects of storage, preservation and organization of scientific data repositories. JISC is one of the leading UK bodies supporting the management and curation of scientific data associated with education. It therefore supports universities and educational institutions in all aspects associated with data management. Another equally important UK body is the Digital Curation Center (DCC – http://www.dcc.ac.uk/), which is mainly concerned with data curation. DCC and JISC provide a wealth of teaching materials on scientific data management, as well as training for researchers and information management professionals. Several other large centers deal with these aspects, such as Australian ANDS (http://ands.org.au), Dutch DANS (https://dans.knaw.nl/en), or Canadian Portage (https://portagenetwork.ca).
3. GIBNEY, E. and VAN NOORDEN, R. Scientists losing data at a rapid rate [online]. Nature. 2013 [viewed 22 June 2018]. Available from: https://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416
4. It should be noted that the article only examined data associated with publications. However, there are huge data sets that serve as a basis for research of all kinds, but which are not directly associated to any particular article. A typical example are satellite images time series, that feed studies on crop forecasting or climatology. Yet another example is data from carbon capture towers, installed around the world, used in global warming research. These types of data, once collected and preserved, serve over many years for a large number of studies. Another major challenge of data management is, thus, associated with preservation procedures.
References
GIBNEY, E. and VAN NOORDEN, R. Scientists losing data at a rapid rate [online]. Nature. 2013 [viewed 22 June 2018]. Available from: https://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416
KAYE, J. Storing and sharing research data after the ‘Space Race’ [online]. Jisc. 2015 [viewed 22 June 2018]. Available from: https://www.jisc.ac.uk/blog/storing-and-sharing-research-data-after-the-space-race-25-feb-2015
About Claudia Bauzer Medeiros
Full Professor at Instituto de Computação, UNICAMP, received national and international awards for excellence in teaching, research, and for attracting women in IT. Coordinates the FAPESP eScience and Data Science Program. Commander of the National Order of Scientific Merit; Doctor Honoris Causa at University Antenor Arrego (Peru) and University Paris-Dauphine (France). Member of the Research Data Alliance Board.
Translated from the original in Portuguese by Lilian Nassi-Calò.
Como citar este post [ISO 690/2010]:
Read the comment in spanish, by Santovenia Diaz:
https://blog.scielo.org/es/2018/06/22/gestion-de-datos-cientificos-de-la-recoleccion-a-la-preservacion/#comment-41654