It is well known the problem of plagiarism in scientific literature. A recent study by two researchers at Cornell University published last December in the Proceedings of the National Academy of Sciences1 on the corpus of arXiv.org documents shows that the practice of reusing text of one article in another is more common in some countries than others, but (thankfully) the results seem to show that the authors who extensively copy texts from others are less cited.
The study was conducted using 757,000 full-text documents that arXiv has compiled between 1991 and 2012. This repository specializes in Physics, Mathematics and Computer Science receiving approximately 80,000 submissions per year. The research looked for duplicated excerpts among the documents using a computational linguistics technique called “n-gram” (see explanation in the note below). The fact that duplicated segments among text documents are found is not a demonstration of plagiarism, but above a certain threshold it may be considered a warning sign. Anyway, the term used by the authors is “text overlap” which is not the same as plagiarism, but it is just a warning indicator.
The study took several precautions to avoid false positives. For example, the software has the ability to exclude block quotes, italicized text, text in quotation marks, properly referenced quoting, review articles as well as conference proceedings, and dissertations. Moreover, the research was conducted only with arXiv documents, and no analysis with full texts from other publisher sources or in the Web were performed.
Using data from Ginsparg study, researchers of ScienceInsider2 journal produced a map of the countries reported by those authors. Of the 151 countries represented at arXiv.org only 57 countries that had contributed with more than 100 articles each were selected which means that the results were representative. The overall result was that 6% from the total number of articles were marked by high rate of “text overlap”. But the result was not similar across countries and regions.
Countries that consistently, regardless of the metrics used, contain the highest percentages of articles marked are (in alphabetical order):
Armenia, Bangladesh, Belarus, Bulgaria, Colombia, Cyprus, Egypt, Georgia, Greece, Iran, Jordan, Kazakhstan, Kyrgyzstan, Latvia, Luxembourg, Micronesia, Moldova, Saudi Arabia, Pakistan, Uzbekistan, Romania.
It is very important to consider the scientific tradition of the countries with these results, since in the United States, Canada and a few industrialized countries in Europe and Asia the amount of marked articles was around 1%, Japan 6%, China and India arriving at twice the average with 10% and higher values such as Iran (15%) and Bulgaria (20%) which is eight times higher than New Zealand.
While the study was conducted in some areas of research (arXiv.org specializes in Physics and Mathematics), it is not possible to extrapolate to other disciplines where the reuse of text is a common practice, as the description of infrastructure or experimental procedures. Moreover it is recognized by the authors of the research, in order to reduce false positives, that the alert thresholds that were set in the software, make the detection much more lenient than any arbitrated publication.
Facing the disparity in results among countries, Ginsparg and Citron attribute these practices which are close to scientific plagiarism, to “differences in academic infrastructure and mentoring, or incentives that emphasize quantity of publications over quality.”
This brings to mind an old saying: “the directors can not read, can only count”.
Note regarding the n-gram model
The n-gram model applied in computational linguistics may be used to detect duplication of text (or text overlap). The procedure used by Ginsparg and Citron considered 7-grams, i.e. sequences of 7 consecutive words extracted in the form of scrolling text. For example if we have the text of the beginning of this paragraph, the program will extract the following keys:
The n-gram model applied in computational linguistics
n-gram model applied in computational linguistics may
model applied in computational linguistics may be
applied in computational linguistics may be used
…
Until the end of the sentence, and it begins with the following sentence on until the document is completed.
Each extracted phrase becomes a key, and the document is represented by a set of keys that are expressed as a vector in a space of n dimensions. This vector representation of the document is called “fingerprint”. The procedure drops very common phrases such as “the rest of this article is organized” to avoid false positives.
Once you have the fingerprints of all documents in the corpus, the analysis of overlay keys begins. To do so, every pair of documents that exist in the collection is characterized by the number of shared keys. Those pairs of documents with over 100 7-grams in common is branded as “suspicious”.
Notes
1 CITRON, D.T., GINSPARG, P. Patterns of text reuse in scientific corpus. PNAS. 2014, vol. 112, nº 1. Available from: http://arxiv.org/ftp/arxiv/papers/1412/1412.2716.pdf
2 BOHANNON, J. Study of massive preprint archive hints at the geography of plagiarism. ScienceInsider. Available from: http://news.sciencemag.org/scientific-community/2014/12/study-massive-preprint-archive-hints-geography-plagiarism
References
BOHANNON, J. Study of massive preprint archive hints at the geography of plagiarism. ScienceInsider. Available from: http://news.sciencemag.org/scientific-community/2014/12/study-massive-preprint-archive-hints-geography-plagiarism
CITRON, D.T., GINSPARG, P. Patterns of text reuse in scientific corpus. PNAS. 2014, vol. 112, nº 1. Available from: http://arxiv.org/ftp/arxiv/papers/1412/1412.2716.pdf
CITRON, D.T., GINSPARG, P. Supplemental material for patterns of text resue in a scientific corpus. PNAS. 2014, vol. 112, nº 1. Available from: http://arxiv.org/ftp/arxiv/papers/1412/1412.2716.pdf
N-grama. Wikipedia. Available from: http://es.wikipedia.org/wiki/N-grama.
External Link
arXiv – <http://arxiv.org>
About Ernesto Spinak
Collaborator on the SciELO program, a Systems Engineer with a Bachelor’s degree in Library Science, and a Diploma of Advanced Studies from the Universitat Oberta de Catalunya (Barcelona, Spain) and a Master’s in “Sociedad de la Información” (Information Society) from the same university. Currently has a consulting company that provides services in information projects to 14 government institutions and universities in Uruguay.
Como citar este post [ISO 690/2010]:
Recent Comments