{"id":2942,"date":"2018-06-22T16:00:24","date_gmt":"2018-06-22T19:00:24","guid":{"rendered":"http:\/\/blog.scielo.org\/en\/?p=2942"},"modified":"2018-07-05T11:45:17","modified_gmt":"2018-07-05T14:45:17","slug":"scientific-data-management-from-collection-to-preservation","status":"publish","type":"post","link":"https:\/\/blog.scielo.org\/en\/2018\/06\/22\/scientific-data-management-from-collection-to-preservation\/","title":{"rendered":"Scientific Data Management \u2013 from collection to preservation"},"content":{"rendered":"<p><strong>By Claudia Bauzer Medeiros<\/strong><\/p>\n<div id=\"attachment_2944\" style=\"width: 310px\" class=\"wp-caption alignright\"><a href=\"http:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/gestao-de-dados.jpg\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-2944\" class=\"wp-image-2944 size-medium\" src=\"http:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/gestao-de-dados-300x199.jpg\" alt=\"\" width=\"300\" height=\"199\" srcset=\"https:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/gestao-de-dados-300x199.jpg 300w, https:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/gestao-de-dados-768x510.jpg 768w, https:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/gestao-de-dados-150x100.jpg 150w, https:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/gestao-de-dados.jpg 1000w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><p id=\"caption-attachment-2944\" class=\"wp-caption-text\"><i>Image: <a href=\"https:\/\/www.flickr.com\/photos\/rh2ox\/9990024683\/\" target=\"_blank\" rel=\"noopener\">rh2ox<\/a>.<\/i><\/p><\/div>\n<p>The management of scientific data covers the so-called &#8220;life cycle&#8221; of the data, i.e., from collection to long-term storage, through a series of cleaning, curation, annotation, indexing, and transformation processing. Much of today&#8217;s scientific research requires some sort of analysis and data processing. Therefore, the management planning of data used and generated in a research project became an integral part of the scientific methodology, being considered as one of the necessary items of good research practices.<\/p>\n<p>Research projects focus mainly on the beginning and middle of the cycle \u2013 that is, planning the collection of data to be used, the elimination of errors (data cleaning) and their storage in an appropriate way, to carry out the desired analyzes for knowledge production. These activities present major challenges, both for researchers who will use and produce data in their own research and for those who develop research on data management. The latter may be, for example, Computing researchers or data librarians (usually engaged in curative and preservation activities). Regardless of the denomination, the management of research data has been giving rise to many new research lines in Computer Sciences, and this number tends to increase with the appearance of new challenges.<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/fig-1.jpg\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-2945 size-full\" src=\"http:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/fig-1.jpg\" alt=\"\" width=\"700\" height=\"702\" srcset=\"https:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/fig-1.jpg 700w, https:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/fig-1-150x150.jpg 150w, https:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/fig-1-300x300.jpg 300w\" sizes=\"auto, (max-width: 700px) 100vw, 700px\" \/><\/a><strong>The figure above<sup>1<\/sup>, from the JISC<sup>2<\/sup> website in the UK (<a href=\"https:\/\/www.jisc.ac.uk\/\" target=\"_blank\" rel=\"noopener\">https:\/\/www.jisc.ac.uk\/<\/a>), shows one of the many possible views of the research data life cycle.<\/strong><\/p>\n<p>The prioritization, in the figure above, of data maintenance and preservation aspects, points to a very important fact \u2013 management planning goes far beyond the duration of a project, since it is necessary to guarantee the availability of the data for the longest time possible. This raises the problem of cost associated with the life cycle. Several studies show that the cost of preservation rises over time and that, in medium and long term, it far outweighs the initial collection (or generation) and cleaning costs. One reason for this is the technological evolution of digital storage media \u2013 in some years they become obsolete, requiring data curators to copy data to a different, more modern, media in order to avoid becoming unreadable.<\/p>\n<p>In this way, curatorial activity also needs to take into consideration which sets of data should be preserved and for how long. A 2013 study found that, after 20 years, 80% of the data used to produce scientific articles are no longer available.<\/p>\n<p style=\"text-align: center;\"><a href=\"http:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/fig-2.jpg\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-2943 size-full\" src=\"http:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/fig-2.jpg\" alt=\"\" width=\"949\" height=\"776\" srcset=\"https:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/fig-2.jpg 949w, https:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/fig-2-300x245.jpg 300w, https:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/fig-2-768x628.jpg 768w, https:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/fig-2-150x123.jpg 150w\" sizes=\"auto, (max-width: 949px) 100vw, 949px\" \/><\/a><strong>The figure above<sup>3<\/sup>, taken from the article by Gibney and Van Norden<sup>4<\/sup>, illustrates the disappearance of these data.<\/strong><\/p>\n<p>Open Science presupposes Open Data (where the concept of &#8220;data&#8221; is very broad, including any type of stored digital object). There are several definitions for what is &#8220;open data&#8221;, but perhaps the most interesting is the one that defines it as data sets whose metadata is mandatorily public. In other words, anyone can figure out, using search engines, if the data exists, and how to get it. However, the data itself is not necessarily public \u2013 and can only be used by restricted research groups, for ethical or privacy reasons, for example.<\/p>\n<p>In this context, data management presents yet another challenge: How to specify metadata in a way that allows the associated data to be considered &#8220;open&#8221;? This, in turn, requires the development of new metadata standards, organization of metadata repositories, and metadata mining and search systems.<\/p>\n<h3>Notes<\/h3>\n<p>1. KAYE, J. Storing and sharing research data after the \u2018Space Race\u2019 [online]. Jisc. 2015 [viewed 22 June 2018]. Available from: <a href=\"https:\/\/www.jisc.ac.uk\/blog\/storing-and-sharing-research-data-after-the-space-race-25-feb-2015\" target=\"_blank\" rel=\"noopener\">https:\/\/www.jisc.ac.uk\/blog\/storing-and-sharing-research-data-after-the-space-race-25-feb-2015<\/a><\/p>\n<p>2. This vision privileges the aspects of storage, preservation and organization of scientific data repositories. JISC is one of the leading UK bodies supporting the management and curation of scientific data associated with education. It therefore supports universities and educational institutions in all aspects associated with data management. Another equally important UK body is the Digital Curation Center (DCC &#8211; <a href=\"http:\/\/www.dcc.ac.uk\/\" target=\"_blank\" rel=\"noopener\">http:\/\/www.dcc.ac.uk\/<\/a>), which is mainly concerned with data curation. DCC and JISC provide a wealth of teaching materials on scientific data management, as well as training for researchers and information management professionals. Several other large centers deal with these aspects, such as Australian ANDS (<a href=\"http:\/\/ands.org.au\" target=\"_blank\" rel=\"noopener\">http:\/\/ands.org.au<\/a>), Dutch DANS (<a href=\"https:\/\/dans.knaw.nl\/en\" target=\"_blank\" rel=\"noopener\">https:\/\/dans.knaw.nl\/en<\/a>), or Canadian Portage (<a href=\"https:\/\/portagenetwork.ca\" target=\"_blank\" rel=\"noopener\">https:\/\/portagenetwork.ca<\/a>).<\/p>\n<p>3. GIBNEY, E. and VAN NOORDEN, R. Scientists losing data at a rapid rate [online]. Nature. 2013 [viewed 22 June 2018]. Available from: <a href=\"https:\/\/www.nature.com\/news\/scientists-losing-data-at-a-rapid-rate-1.14416\" target=\"_blank\" rel=\"noopener\">https:\/\/www.nature.com\/news\/scientists-losing-data-at-a-rapid-rate-1.14416<\/a><\/p>\n<p>4. It should be noted that the article only examined data associated with publications. However, there are huge data sets that serve as a basis for research of all kinds, but which are not directly associated to any particular article. A typical example are satellite images time series, that feed studies on crop forecasting or climatology. Yet another example is data from carbon capture towers, installed around the world, used in global warming research. These types of data, once collected and preserved, serve over many years for a large number of studies. Another major challenge of data management is, thus, associated with preservation procedures.<\/p>\n<h3>References<\/h3>\n<p>GIBNEY, E. and VAN NOORDEN, R. Scientists losing data at a rapid rate [online]. Nature. 2013 [viewed 22 June 2018]. Available from: <a href=\"https:\/\/www.nature.com\/news\/scientists-losing-data-at-a-rapid-rate-1.14416\" target=\"_blank\" rel=\"noopener\">https:\/\/www.nature.com\/news\/scientists-losing-data-at-a-rapid-rate-1.14416<\/a><\/p>\n<p>KAYE, J. Storing and sharing research data after the \u2018Space Race\u2019 [online]. Jisc. 2015 [viewed 22 June 2018]. Available from: <a href=\"https:\/\/www.jisc.ac.uk\/blog\/storing-and-sharing-research-data-after-the-space-race-25-feb-2015\" target=\"_blank\" rel=\"noopener\">https:\/\/www.jisc.ac.uk\/blog\/storing-and-sharing-research-data-after-the-space-race-25-feb-2015<\/a><\/p>\n<p>&nbsp;<\/p>\n<h3>About Claudia Bauzer Medeiros<\/h3>\n<p><a href=\"http:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/BauzerMedeiros-Claudia2_carousel.jpg\" target=\"_blank\" rel=\"noopener\"><img loading=\"lazy\" decoding=\"async\" class=\"alignright wp-image-2955 size-thumbnail\" src=\"http:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/BauzerMedeiros-Claudia2_carousel-150x150.jpg\" alt=\"\" width=\"150\" height=\"150\" srcset=\"https:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/BauzerMedeiros-Claudia2_carousel-150x150.jpg 150w, https:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2018\/06\/BauzerMedeiros-Claudia2_carousel.jpg 300w\" sizes=\"auto, (max-width: 150px) 100vw, 150px\" \/><\/a><\/p>\n<p>Full Professor at Instituto de Computa\u00e7\u00e3o, UNICAMP, received national and international awards for excellence in teaching, research, and for attracting women in IT. Coordinates the FAPESP eScience and Data Science Program. Commander of the National Order of Scientific Merit; Doctor <em>Honoris Causa<\/em> at University Antenor Arrego (Peru) and University Paris-Dauphine (France). Member of the Research Data Alliance Board.<\/p>\n<p>&nbsp;<\/p>\n<p>Translated from the original in <a href=\"https:\/\/blog.scielo.org\/blog\/2018\/06\/22\/gestao-de-dados-cientificos-da-coleta-a-preservacao\/\" target=\"_blank\" rel=\"noopener\">Portuguese<\/a> by Lilian Nassi-Cal\u00f2.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Proper management of data used in scientific research has become a mandatory part of good research practices. The Open Science era has revolutionized scientific methodology, motivating the emergence of new lines of research in all areas of knowledge. This post describes some challenges of this management from the computational point of view. <span class=\"ellipsis\">&hellip;<\/span> <span class=\"more-link-wrap\"><a href=\"https:\/\/blog.scielo.org\/en\/2018\/06\/22\/scientific-data-management-from-collection-to-preservation\/\" class=\"more-link\"><span>Read More &rarr;<\/span><\/a><\/span><\/p>\n","protected":false},"author":5,"featured_media":2946,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_links_to":"","_links_to_target":""},"categories":[3],"tags":[44,50,68,67],"class_list":["post-2942","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-analysis","tag-digital-preservation","tag-open-data","tag-open-science","tag-scielo-20-years"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/posts\/2942","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/comments?post=2942"}],"version-history":[{"count":7,"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/posts\/2942\/revisions"}],"predecessor-version":[{"id":2957,"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/posts\/2942\/revisions\/2957"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/media\/2946"}],"wp:attachment":[{"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/media?parent=2942"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/categories?post=2942"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/tags?post=2942"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}