Elsevier

Journal of Informetrics

Volume 12, Issue 4, November 2018, Pages 1160-1177
Journal of Informetrics

Regular article
Google Scholar, Web of Science, and Scopus: A systematic comparison of citations in 252 subject categories

https://doi.org/10.1016/j.joi.2018.09.002Get rights and content

Highlights

  • Google Scholar found nearly all citations found by WoS (95%) and Scopus (92%), and a large amount of unique citations.

  • About half of Google Scholar unique citations are not from journals. A significant minority (19–38%) are not in English.

  • Google Scholar unique citations have, on average, a much lower scientific impact than citations also found by WoS/Scopus.

  • Spearman correlations of citation counts between Google Scholar and WoS/Scopus are strong across all subjects (0.78–0.99).

Abstract

Despite citation counts from Google Scholar (GS), Web of Science (WoS), and Scopus being widely consulted by researchers and sometimes used in research evaluations, there is no recent or systematic evidence about the differences between them. In response, this paper investigates 2,448,055 citations to 2299 English-language highly-cited documents from 252 GS subject categories published in 2006, comparing GS, the WoS Core Collection, and Scopus. GS consistently found the largest percentage of citations across all areas (93%–96%), far ahead of Scopus (35%–77%) and WoS (27%–73%). GS found nearly all the WoS (95%) and Scopus (92%) citations. Most citations found only by GS were from non-journal sources (48%–65%), including theses, books, conference papers, and unpublished materials. Many were non-English (19%–38%), and they tended to be much less cited than citing sources that were also in Scopus or WoS. Despite the many unique GS citing sources, Spearman correlations between citation counts in GS and WoS or Scopus are high (0.78-0.99). They are lower in the Humanities, and lower between GS and WoS than between GS and Scopus. The results suggest that in all areas GS citation data is essentially a superset of WoS and Scopus, with substantial extra coverage.

Introduction

The launch of Google Scholar (GS) in November of 2004 brought the simplicity of Google searches to the academic environment, and revolutionized the way researchers and the public searched, found, and accessed academic information. Until that point, the coverage of academic databases depended on lists of selected sources (usually scientific journals). In contrast, and using automated methods, Google Scholar crawled the web and indexed any document with a seemingly academic structure. This inclusive approach gave GS potentially more comprehensive coverage of the scientific and scholarly literature compared to the two major existing multidisciplinary databases with selective journal-based inclusion policies, the Web of Science (WoS) and Scopus (Orduna-Malea, Ayllón, Martín-Martín, & Delgado López-Cózar, 2015).

Although citation data in Google Scholar was originally intended to be a means of identifying the most relevant documents for a given query, it could also be used for formal or informal research evaluations. The availability of free citation data in Google Scholar, together with the free software Publish or Perish (Harzing, 2007) to gather it made citation analysis possible without a citation database subscription (Harzing & van der Wal, 2008). Nevertheless, GS has not enabled bulk access to its data, reportedly because their agreements with publishers preclude it (Van Noorden, 2014). Thus, third-party web-scraping software is currently the only practical way to extract more data from GS than permitted by Publish or Perish.

Despite its known errors and limitations, which are consequence of its automated approach to document indexing (Delgado López-Cózar, Robinson-García, & Torres-Salinas, 2014; Jacsó, 2010), GS has been shown to be reliable and to have good coverage of disciplines and languages, especially in the Humanities and Social Sciences, where WoS and Scopus are known to be weak (Chavarro, Ràfols, & Tang, 2018; Mongeon & Paul-Hus, 2016; van Leeuwen, Moed, Tijssen, Visser, & Van Raan, 2001). Analyses of the coverage of GS, WoS, and Scopus across disciplines have compared the numbers of publications indexed or their average citation counts for samples of documents, authors, or journals, finding that GS consistently returned higher numbers of publications and citations (Harzing & Alakangas, 2016; Harzing, 2013; Mingers & Lipitakis, 2010; Prins, Costas, van Leeuwen, & Wouters, 2016). Citation counts from a range of different sources have been shown to correlate positively with GS citation counts at various levels of aggregation (Amara & Landry, 2012; De Groote & Raszewski, 2012; Delgado López-Cózar, Orduna-Malea, & Martín-Martín, 2018; Kousha & Thelwall, 2007; Martín-Martín, Orduna-Malea, & Delgado López-Cózar, 2018; Meho & Yang, 2007; Minasny, Hartemink, McBratney, & Jang, 2013; Moed, Bar-Ilan, & Halevi, 2016; Pauly & Stergiou, 2005; Rahimi & Chandrakumar, 2014; Wildgaard, 2015). See the supplementary materials1, Delgado López-Cózar et al. (2018); Orduña-Malea, Martín-Martín, Ayllón, and Delgado López-Cózar (2016), and Halevi, Moed, and Bar-Ilan (2017) for discussions of the wider strengths and weaknesses of GS.

A key issue is the ability of GS, WoS, and Scopus to find citations to documents, and the extent to which they index citations that the others cannot find. The results of prior studies are confusing, however, because they have examined different small (with one exception) sets of articles. A summary of the results found in these previous studies is presented in Table 1. For example, the number of citations that are unique to GS varies between 13% and 67%, with the differences probably being due to the study year or the document types or disciplines covered. The only multidisciplinary study (Moed et al., 2016) checked articles in 12 journals from 6 subject areas, which is still a limited set.

The fields previously compared for citation sources (Table 1) are Library and Information Science (5 out of 10 articles analyse case studies about LIS documents/journals/researchers), Medicine (3 papers, analysing oncology, general medicine, and dentistry), Physics (2 articles: general and condensed matter), Chemistry (2 articles: general and inorganic), Computer Science (2 articles: general, and computational linguistics), Biology (2 articles: general, and virology), Social Work, Political Science, and Chinese Studies (1 article each). From this list it is clear that most academic fields have not been analysed for Google Scholar coverage. The studies used small samples of documents and citations (9 out of 10 papers analysed less than 10,000 citations), probably because of the difficulty of extracting data from GS, caused by the lack of a public API (Else, 2018; Van Noorden, 2014). Moreover, the most recent data in these studies was collected in 2015 (three years before the current study), and the oldest data is from 2005 (13 years ago).

Given the limited nature of all prior studies of citing sources for GS and the need to update all previous research, a comprehensive analysis of citation sources in GS, WoS, and Scopus across all subject areas is needed. This information is important for those deciding whether to use GS citation counts for informal or formal research evaluations. The following research questions drive this investigation.

  • How much overlap is there between GS, WoS, and Scopus in the citations that they find to academic documents and does this vary by subject?

  • Do the citing documents that are only found by GS have a different type to non-unique GS citations, and does this vary by subject?

  • How similar are citation counts in GS to those found in WoS and Scopus, at the level of subjects?

Section snippets

Methods

The sample used for this study is taken from GS’s Classic Papers product (GSCP)2. The 2017 edition of GSCP lists 2515 highly-cited documents written in English and published in 20063. These documents were classified by GS into 252 subject categories within 8 broad subject areas. Background about GSCP can be found in Orduna-Malea, Martín-Martín, and Delgado López-Cózar (2018) and Martín-Martín et

RQ1: citing source overlap

Overall, 46.9% of all citations were found by the three databases (Fig. 3). GS found the most citations, including most of the citations found by WoS and Scopus. In contrast, only 6% of all citations were found by WoS and/or Scopus, and not by GS. An additional 10.2% of all citations were found by both GS and Scopus (7.7%), or GS and WoS (2.5%). Over a third (36.9%) of all citations were only found by GS.

When citations are disaggregated by the broad subject area in which the cited document was

Limitations

This study analyses a large sample of citations to highly-cited documents from all subject areas published in English. In order to generalize the results to all articles, it must be assumed that the population of documents that cite highly cited articles is not significantly different from the general population of documents that cite articles. This may not be fully true since, for example, highly cited articles are presumably more likely to be in emerging research areas and larger specialisms.

Conclusions

This study provides evidence that GS finds significantly more citations than the WoS Core Collection and Scopus across all subject areas. Nearly all citations found by WoS (95%) and Scopus (92%) were also found by GS, which found a substantial amount of unique citations that were not found by the other databases. In the Humanities, Literature & Arts, Social Sciences, and Business, Economics & Management, unique GS citations surpass 50% of all citations in the area.

About half (48%–65%, depending

Author contributions

Alberto Martín-Martín: Conceived and designed the analysis, Collected the data, Contributed data or analysis tools, Performed the analysis, Wrote the paper.

Enrique Orduna-Malea: Conceived and designed the analysis, Wrote the paper.

Mike Thelwall: Conceived and designed the analysis, Wrote the paper.

Emilio Delgado López-Cózar: Conceived and designed the analysis, Wrote the paper.

Acknowledgements

Alberto Martín-Martín is funded for a four-year doctoral fellowship (FPU2013/05863) granted by the Ministerio de Educación, Cultura, y Deportes (Spain). An international mobility grant from Universidad de Granada and CEI BioTic Granada funded a research stay at the University of Wolverhampton.

References (51)

  • F.J. Damerau

    A technique for computer detection and correction of spelling errors

    Communications of the ACM

    (1964)
  • D. De Solla Price

    A general theory of bibliometric and other cumulative advantage processes

    Journal of the American Society for Information Science

    (1976)
  • J.C.F. de Winter et al.

    The expansion of Google Scholar versus Web of Science: A longitudinal study

    Scientometrics

    (2014)
  • E. Delgado López-Cózar et al.

    Google scholar as a data source for research assessment

  • E. Delgado López-Cózar et al.

    The Google scholar experiment: How to index false papers and manipulate bibliometric indicators

    Journal of the Association for Information Science and Technology

    (2014)
  • M. Dowle et al.

    data.table: Extension of “data.frame

    (2018)
  • H. Else

    How I scraped data from Google Scholar

    Nature

    (2018)
  • Elsevier

    Scopus source list (April 2018)

    (2018)
  • A.-W. Harzing

    A longitudinal study of Google Scholar coverage between 2012 and 2013

    Scientometrics

    (2013)
  • A.W. Harzing

    Publish or Perish

    (2007)
  • A.-W. Harzing et al.

    Google Scholar, Scopus and the Web of Science: A longitudinal and cross-disciplinary comparison

    Scientometrics

    (2016)
  • A.W.K. Harzing et al.

    Google Scholar as a new source for citation analysis

    Ethics in Science and Environmental Politics

    (2008)
  • J. Jacimovic et al.

    A citation analysis of Serbian Dental Journal using Web of Science, Scopus and Google Scholar

    Stomatoloski Glasnik Srbije

    (2010)
  • P. Jacsó

    Metadata mega mess in Google Scholar

    Online Information Review

    (2010)
  • K. Kousha et al.

    Google Scholar citations and Google Web/URL citations: A multi-discipline exploratory analysis

    Journal of the American Society for Information Science and Technology

    (2007)
  • Cited by (1022)

    • Data work and practices in healthcare: A scoping review

      2024, International Journal of Medical Informatics
    View all citing articles on Scopus
    View full text