Self-archiving covers how different forms of academic output such as original research, theses, study materials and hand-outs, and or other activities connected with academic knowledge are published in open access. There is nothing new in the fact that the majority of universities maintain institutional repositories. This does happen in Iberoamerica, but mainly occurs in developed countries.In order to ascertain if these efforts in having repositories accomplish their objectives, the most direct and comprehensive way is to find out what sort of presence and impact repositories have on the web, and in particular in the two major search engines Google and Google Scholar.
A recent article entitled “Are Latin-American repositories invisible on Google and Google Scholar¹”, revealed some surprisingly poor results compared with the expectations of the researchers who authored it. The results were the reason behind an interesting discussion which took place on the INCYT network² during the last week of June of this year.
In this post, we will analyze the most likely reasons which could account for the poor performance of repositories, and also question whether repositories are the right resource for increasing the visibility of academic output, and for that matter, its academic impact. We will examine the conclusions reached by the discussions on INCYT and will put forward additional technical material to support them.
The visibility, coverage and poor presence that repositories generally have in Google and Google Scholar is nothing new. A seminal article by Arlitsch and O’Brien³ in 2012, which analyzed 21 US university repositories, revealed that the indexing coverage in Google and Google Scholar is low with approximately 30% of the documents in the repositories indexed in Google Scholar. Subsequent to this investigation, a similar type of analysis was carried out on the document repository of the World Bank⁴, which revealed that only 17.5% of more than 15,000 documents are indexed in Google and Google Scholar. Finally, the results of the investigation we are commenting on analyzed the visibility and web impact of 127 Latin-American repositories containing a total of 113,000 PDF documents. It was found that Google only achieved a coverage of 48.3% with scarcely 2.5% of the documents being detected in Google Scholar. If the search is widened out to all document types, the retrieval ratio is much greater in Google, whereas Google Scholar only achieves a figure of around a third of the 113,000 documents contained in these repositories.
The questions which arise are, at the very least, the following:
- Why do repositories have such poor visibility?
- What measures can be taken to ensure that the documents contained therein are correctly indexed to make them visible?
- What effect does this situation have on the career prospects of a researcher who wants to publish in open access?
- Are repositories suitable tools for making academic output visible?
Some of the technical reasons behind these poor results could be:
- Problems with the Google and Google Scholar web crawlers and with the procedures for the retrieval of indexed documents. It must be noted that Google and Google Scholar use different databases as well as different crawlers and indexing criteria. For this reason results are so different in both cases.
- Problems with the structure of the deposited documents and their associated metadata which do not follow “best practices”.
- Problems with the architecture of the websites of the repositories which host the documents.
The reasons given above do not imply that institutional repositories are badly managed, or that they do not keep statistics relating to their contents or number of downloads, or that it is not possible to retrieve deposited information there. Two examples of important benchmark repositories are the Biblioteca Digital de Unicamp (Unicamp Digital Library) with more than 40,000 theses, and the Red Federada de Repositorios Institucionales de Publicaciones Científicas (Federated Network of Institutional Repositories of Academic Information) with more than 800,000 documents. The problem is that, generally speaking, people carrying out research do not go directly to a particular repository to find out what has been deposited there; rather, in the vast majority of cases, they look for what they want using Google and / or Google Scholar. In other words, it is these search engines themselves which have a significant impact on content visibility. Years ago, it used to be said that “If you are not on the Internet, then you don’t exist”. Now, it could be said, “If you are on the Internet, but not visible to Google, you don’t exist either”.
Some of the technical problems experienced by repositories are mentioned in references (ORDUÑA-MALEA 2014; TAY 2014; Inclusion Guidelines for Webmasters), given at the end of this article. These are summarized briefly below:
- Articles stored in a repository should provide the full text of articles in open access, or at the very least, an abstract prepared by the author.
- The repository should not request pre-registration by users or web crawlers to gain access to it, nor expect users to install special software, accept “disclaimers”, block popups, view advertisements, click on links or buttons, or scroll down the page before they can read the article abstracts.
- Repositories which have login pages, or simply provide bibliographic data without abstracts will not be included and will be removed from Google Scholar even if they have been previously indexed there.
- If the web crawlers are unable to retrieve pages because of server errors, incorrect configurations or slow response times, it may be that these documents, if they are already in Google or Google Scholar, will be removed from the respective search engine.
- HTML or PDF documents should be indexable, and must have searchable text i.e. users must be able to search for and find words in the document concerned using Adobe Acrobat Reader.
- Each document should be less than 5MB in size. If documents are larger than this or have pages containing images, they should be uploaded to Google Book Search.
- One of the most frequent reasons why documents are incorrectly indexed is the use of the Dublin Core (DC) metadata schema. Google clearly states in its Inclusion “Guidelines for Webmasters”that DC is the least recommended schema, and gives preference to other schemas such as those of Highwire Press, JSTOR, and ..…it is interesting to note that Google recommends the indexing schema of SciELO (see the paragraph at the end of reference Inclusion Guidelines for Webmasters⁵). The reason why Google Scholar is not favorable towards DC and does not support OAI-PMH is that DC metadata is not adequate for the description of journal articles because it is ambiguous when it comes to describing a journal title, volume, issue and pagination. In the previously mentioned article by Arlitsch (2012) on repositories of American universities, when the schema was changed from Dublin Core to other metadata schema more acceptable to Google and Goole Scholar, the visibility of the repositories increased considerably.
- Google does not include full text of articles in its indexing, but Google Scholar does. Google Scholar is not a limited version of Google. It is, in fact, a completely different version. For example, Google Scholar does not include entries in Wikipedia or blogs, whereas Google does. Only Google Scholar includes full text in its retrieval index. For this reason, Google Scholar has negotiated special rights with Elsevier, Sage, Science Direct and others to index the full text of articles which are behind a paywall, but for which the titles and abstracts appear in the search results6.
This research highlights the low visibility of Latin American academic output on the web which, for the most part, is not published in mainstream journals and so does not end up in WoS or Scopus. The low indexing coverage of repositories in Google and Google Scholar seriously affects the advantages of OA, in particular the Green Road OA because the mass of material in repositories will always be hidden from the users and won’t be found in Google Scholar. They will only be found by accessing them directly in their respective repositories.
One of the main reasons for this problem is flaws in the architecture of the systems used to create and maintain repositories. Another important reason is the use of inappropriate metadata schema, such as Dublin Core. Overcoming these problems is in the hands of Latin Americans.
Repositories are valuable institutional tools where materials produced by the institution’s academic activity are deposited, and these materials go beyond the traditional academic journal article to include items such as presentations at congresses, theses, slide presentations, videos, statistical information, and so on. Consequently, the value of these repositories should be measured from different perspectives and objetives, in the same way that journals that are not mainstream are evaluated.
When an academic deposits his or her own work in a repository, because they must comply with the “ritual”of “publish or perish”, their main intention is not to generate “impact”; or if the academic publishes in a local “vanity”journal, or deposits a PowerPoint presentation, or a graduate thesis, although the aims of the repository meet legitimate goals, they have nothing to do with the global competition for citations, impact, etc.
The objective changes once a researcher seeks career advancement and competes in the “Big League”. In this case, efforts will be directed at being published in the best possible journals in the field, and the deposit of a copy of the article in an OA repository is really a Plan B.
My thoughts
- Assessment policies for research, institutions and departments of research, research groups, and of individual researchers are by and large based on the traditional scientometric indicators, whether we like it or not. Thus, repositories carry little to no weight at all in the weighting of the performance of academic research and its players.
- Generally speaking, repositories do not have selection criteria based on quality and academic innovation, although they have other important purposes:
- bibliographic control.
- preservation.
- dealing with institutional and national policies of open access.
- complementing bibliographic indexes with access to the full texts.
- The evaluation of repositories in relation to the above mentioned functions should be performed using internal comparisons over time, measuring the increase in the number of documents, downloads, mentions in social networks, and possibly by comparing them with repositories of equal standing.
- Finally, looking beyond the technical problems of management, interoperability and visibility of repositories, the conclusion is that, as means of scholarly communication, they are very limited.
- If the aim is to work towards improving the impact and visibility of scholarly production, works should be published in journals that have professional support, that meet the highest technical requirements to achieve the maximum indexation and impact, that have state-of-the-art editorial processes, that have peer review, measures to combat plagiarism and so on.
This is what the SciELO Program does.
Notes
¹ ORDUÑA-MALEA, E., et al. Are Latin-American repositories invisible on Google and Google Scholar?.EC3 Google ScholarDigestReviews. 2014, nº 3.Available from: http://googlescholardigest.blogspot.com.es/2014/06/are-latin-americanrepositories.html
² INCYT: Indicadores en Ciencia y Tecnología. – http://listserv.rediris.es/cgi-bin/wa?A1=ind1406D&L=INCYT (archivo de Junio 2014).
³ ARLITSCH, K., andO’BRIAN, P.S. Invisible institutionalrepositories: addressingthelowindexing ratios of IRs in Google. Tech Library Hi Tech. 2012, vol. 30, nº 1, pp. 60-81. Available from: https://jira.duraspace.org/secure/attachment/13020/Invisible_institutional.pdf
⁴ MARTÍN-MARTÍN, A., et al.TheWorldBank’spolicyreports in Google Scholar. Are they visible, cited, and downloaded?.EC3 Google ScholarDigestReviews. 2014, nº 2.Available from: http://googlescholardigest.blogspot.com.es/2014/06/world-banks-policy-reports-google-scholar.html
⁵ InclusionGuidelinesforWebmasters. Google Scholar. Available from: http://scholar.google.com/intl/en/scholar/inclusion.html
6 AaronTay – http://3.bp.blogspot.com/-5ASx7eh_exA/U46oG0wE51I/AAAAAAAALms/Rf1d3sqf0Z8/s1600/eslevier2013.png
References
Inclusion Guidelines for Webmasters. Google Scholar. Available from: http://scholar.google.com/intl/en/scholar/inclusion.html
Inclusion Guidelines for Webmasters: indexing. Google Scholar. Available from: http://scholar.google.com.sg/intl/en/scholar/inclusion.html#indexing
ORDUÑA-MALEA, E., and LÓPEZ-CÓZAR, E.D.The dark side of Open Access in Google and Google Scholar: the case of Latin-American repositories. Paper accepted for publication in the scientometrics. Available from: http://arxiv.org/ftp/arxiv/papers/1406/1406.4331.pdf
ORDUÑA-MALEA, E., et al. Are Latin-American repositories invisible on Google and Google Scholar?. EC3 Google ScholarDigestReviews. 2014, nº 3. Availablefrom: http://googlescholardigest.blogspot.com.es/2014/06/are-latin-americanrepositories.html
TAY, A. 8 surprising things I learnt about Google Scholar. Musings about librarianship. 2014. Available from: http://musingsaboutlibrarianship.blogspot.sg/2014/06/8-surprising-things-i-learnt-about.html#.U95p9fnZSYI
External links
Unicamp Digital Library – http://www.bibliotecadigital.unicamp.br/indicadores/index.php
Red Federada de Repositorios Institucionales de Publicaciones Científicas – http://www.lareferencia.info/vufind/
About Ernesto Spinak
Collaborator on the SciELO program, a Systems Engineer with a Bachelor’s degree in Library Science, and a Diploma of Advanced Studies from the Universitat Oberta de Catalunya (Barcelona, Spain) and a Master’s in “Sociedad de la Información” (Information Society) from the same university. Currently has a consulting company that provides services in information projects to 14 government institutions and universities in Uruguay.
Translated from the original in Spanish by Nicholas Cop Consulting.
Como citar este post [ISO 690/2010]:
Recent Comments