Why XML?

By Abel L. Packer, Eliana Salgado, Javani Araujo, Letícia Aquino, Renata Almeida, Jesner Santos, Suely Lucena, Caroline M. Soares

Image: Example of article produced with the “SciELO Publishing Schema”

A notable advance has been perfecting the editing, publishing and interoperability of SciELO journals. It is the structuring of full texts according to the XML language, which will be adopted in the operation of all SciELO journals from 2015. The SciELO program has been promoting the introduction of this improvement in the methodology and technology of text treatment since 2012. Health sciences journals began adopting this innovation from 2014.

Why this change? What are the advantages? What are the main challenges to be overcome?

XML is the acronym extracted from the phrase eXtensible Markup Language. It is a language, or rather, a meta- language that allows you to define rules (or languages, hence the name ‘extended’) that specify how to label significant parts of a text, including words, phrases, numbers, formulas, etc. For example, in the text of an article, you can mark its bibliographic elements such as title, authors, abstract, keywords, sections, paragraphs, tables, figures, citations, bibliographic references, etc. In the case of scientific articles, the original manuscript is normally prepared with the aid of a text editor and after several evaluation procedures and editing, it is ready for publication. The text of articles follow, in general, a certain structure, starting with the title, followed by the authors and so on. XML is used to precisely structure the elements of the texts in the contemporary publishing process of articles and other documents. Each element is defined by a label (tag). Thus, for example, the author Albert Einstein could be identified or labeled as:

<author> <surname>Einstein</surname><name>Albert</name></author>

Structured texts are subject to extensive processing by computer software. Thus, it is possible to extract from the marked text article metadata (title, authors, abstract, keywords, journal, volume, number, pages, date of submission, date of approval, and others) and form / build its bibliographical reference. In other words, the classic process of visual and manual preparation of bibliographic reference from the first page of the printed manuscript is no longer necessary. This extraction, moreover, ensures that the bibliographic reference is true to the text of the article, avoiding transcription errors. It is also important to consider that precisely identifying the bibliographic elements is customary, as part of processing of the marked texts and verifying their correctness and consistency. So that it is possible to verify whether all expected bibliographic elements are present and obey formation rules. For example, if there is a date of submission stating “April 31, 2014” it will be possible to detect the error. This ability that the marked text acquires applies to all structured elements. That is, you can check the specification and consistency of sections, paragraphs, tables, figures, and especially the bibliographic references of the documents cited in the text. This ability to have its structure and components identified and processed by computer software is the main feature that comes from the use of the XML standard. From this capacity, marked texts are subject to be stored in databases, being interoperated between systems on the Web and being presented in different formats.

Thus, in SciELO, articles in XML are exchanged with bibliographic indexes and other processing systems of scientific information, each one with their own computer systems and different text structures. Another key characteristic of XML texts is their ability to be presented in different reading formats, font sizes, line sizes, page sizes, browsing between sections, etc. This capability is particularly important to date, with the different reading devices, from traditional display in desktop computers to mobile devices such as tablets and smartphones.

With XML you can define different structures of articles text tagging. However, it is recommended to follow a structure which is an international standard, which is known by different international systems. SciELO could have defined its own article structure, but in this case, all interaction with external systems would require specific treatment or conversion to international standards. Therefore, SciELO adopted an already established international standard, which is called the Journal Article Tag Suite (JATS). JATS has its origin in the Journal Archiving and Interchange Tag Suite created by the National Library of Medicine of the United States to mark the texts of articles published and stored at PubMed Central (PMC). Revisions of the set of labels PMC gave rise to JATS as a standard of the National Information Standards Organization (NISO) of the United States , identified as JATS: Journal Article Tag Suite, version 1.0 (ANSI/NISO Z39.96 -2012).

However, to meet the demands of SciELO processing, it was necessary to add JATS new tags. The possibility of adding new tags is one of the features of the XML language. The new labels specify in detail the different levels of affiliation of the authors, including, for example, university, faculty and department. SciELO also added the label that identifies the specification of research funding agencies. Another important change aggregated by SciELO is the detailed specification of bibliographic references necessary for assembling the bibliometric database.

The use of XML to specify a particular structure of a text, being a scientific article, a cook recipe or an invoice, yields a specification that is usually stored in a file, which contains the rules to structure the text. This specification of rules may be formulated using one of two ways. The first is called Document Type Definition (DTD) and the other XML Schema Definition (XSD). DTD was the first way to specify XML application, but due to various limitations, it has been gradually replaced by XSD, that offers greater flexibility. The DTD or XSD files inform computer software how to “read” and process the marked texts. For SciELO articles, the XSD used is, as seen before, derived from JATS and identified as SciELO Publishing Schema.

SciELO uses text tagging since the creation of its publication methodology in 1997. At the time of the creation of SciELO, the text standard markup language was the Standard Generalized Markup Language (SGML). SciELO adopted SGML from ISO standard DTD for marking texts identified as ISO 12083-1994 (Electronic Manuscript Preparation and Markup). The use of DTD was restricted to markup of the bibliographic elements at the front of the article to generate the bibliographic reference, and the final part to identify the references. The text of the articles had labeled the beginning and end of a paragraph without further details. This tagging is done from the final text of the article usually in PDF format, previously converted to HTML. This solution allowed the operation of the texts marked by SciELO without interfering in the publication process of journals. However, this markup methodology no longer meets current demands of structuring, exchange and presentation of text articles and other scientific documents. For this reason, the SciELO Program initiated in 2012 to promote the adoption of the new system of detailed marking of the full texts.

An essential feature of this change is to focus on the final version of the XML text file, i.e. the PDF, ePUB and HTML presentation formats derived from XML. The XML files are also best suitable for digital preservation, as they are capable of being processed by new storage, transfer and presentation technologies.

This advancement in text structuring of SciELO articles is an integral part of the implementation of the priority lines of action of professionalization and internationalization of SciELO which translates in the development of national capacity to produce journals according to the international state of the art. To promote this advancement, SciELO encouraged domestic companies to develop capacity and their own solutions to process the new SciELO Publishing Schema as well as the participation of international companies. SciELO also offers training to journal teams that choose to process their own texts.

Important changes like that always present a challenge, particularly for journals that are produced in limited conditions of financial resources, of professionalization and capacity to adopt innovations. Added to these conditions are the difficulties that some publishers have to realize and evaluate the gains they will have with the adoption of XML. To respond to these situations, the SciELO Program has been promoting this change well in advance so that all journals develop the conditions to incorporate this change in the production forms.