Editorial ethics: the detection of plagiarism by automated means

As we have seen in previous pieces1,2,3, plagiarism, in spite of it being a practice which violates ethical standards (in addition to copyright), is occurring far more frequently than people naively imagine. The major reason behind the growth in plagiarism is the ease with which access can be achieved to online content. Over the last 10 years this practice has increased to such an extent that since 2011, the National Science Foundation has earmarked 100 million dollars for the analysis of this problem, and academic journals are reporting that the number of retractions has increased by a factor of 10 over the last 20 years. For more details on retraction and plagiarism, see the comments at the end of this article.

In the academic world, worrying figures have come to light. For example, a plagiarism audit carried out on 285,000 scientific texts in arXiv.org retrieved more than 500 documents which were most likely plagiarized, along with another 30,000 documents (20% of the collection [sic]) which showed strong indications of excessive self-plagiarism.

This problem makes it necessary that those who are responsible for editorial policy for academic publications, and in particular those items which are published in SciELO, should adopt the necessary preventive measures to counteract plagiarism, and these procedures should also include the use of computerized tools.

The detection of plagiarism is the process of locating within a particular work or document those sections which have been taken from other sources without the appropriate references being made. Plagiarism can occur in any type of document, not solely in academic works, since it is also found in the Press, in graduation theses, in the code used in computer programs, in art designs, etc.

Computerized plagiarism detection systems (known as PDS in English) have two basic approaches: i. external comparison and ii. intrinsic analysis. The processes involved in external detection necessitate having access to a vast collection of documents which are accepted as being genuine. The document under analysis is then compared with the documents in this database. Intrinsic analyses carry out statistical checks on the vocabulary and linguistic style of the written language of the text by the application of linguistic techniques used in the specialized field known as stylometry4.

A common classification of the procedures used for the detection of plagiarism is as follows:

  • Fingerprinting
  • String matching
  • Vector space retrieval
  • Analysis of citations
  • Stylometry

In all cases, computer algorithms are used to represent the documents by means of text patterns (fingerprints) or argumentative structures or citation pattern similarity, and these algorithmic representations are used to check them against the suspect documents. The algorithms which use text representation form part of Information Retrieval theory (IR). These processes, depending on their complexity and the size of the database of genuine articles which is used, can take up significant machine time, especially when Internet sources are used to supplement the databases of genuine articles.

But, this is unfortunately a simplistic view of the problem, because it is based on the assumption that the plagiarism concerned is simply a verbatim copy – “cut and paste” – and as such is easily detectable. However, the accuracy of automated detection decreases as the plagiarism process becomes more concealed by any of the following techniques:

  • Masked plagiarism
    • Shake and paste plagiarism: the copying and merging of segments from different sources to form a coherent text;
    • Expansive plagiarism: portions of additional text are inserted into the copied segment;
    • Contractive plagiarism: a résumé or original text which has been “trimmed”;
    • Mosaic plagiarism: segments of texts from different sources are merged together, changing the word order, using synonyms, deleting and inserting filler words.
  • Undue paraphrasing: the intentional rewriting of other peoples’ ideas.
  • Translated plagiarism: the machine translation of paragraphs from other sources from one language to another. The translated content is then revised by polishing the style.
  • Idea plagiarism: the appropriation of research methods, experimental procedures, argumentative structures, and background sources. In this case, it is not the text that is copied, just the methods.

In addition, more sophisticated text manipulation methods are in existence which have been designed to dupe computer algorithms. These are most used in the case of those languages which have diacritics (Scandinavian and Slavic languages, etc.), or for languages written in non-Roman scripts such as Japanese and Hebrew.

A technique that is frequently employed is the use of homoglyphs5. These are letters or sequences of letters which appear similar but have different internal representations. Examples are the substitution of 0 (zero) for lower case “o” or upper case O, or the transcription of letters from the Greek or Roman alphabet. More than 40 possible substitutions are in existence and are frequently used in graduation theses to prevent plagiarism from being detected, since very few software applications are able to detect substitutions of this type (Turnitin and Urkund are practically the only exceptions as far as PDS are concerned). The state of the art of PDS can be summarized in the following phrase: “PDS finds the copies, not the plagiarism” (Gipp and Meuschke 2011).

A search on the Internet reveals the existence of dozens of software applications, both commercial and free, of different degrees of effectiveness. This has developed into an area of research and development which has shown strong growth over the last 10 years. But one must take into consideration that current PDS technology is imprecise, and that there are many myths surrounding this subject, one of which is that any copied work will be detected. Although the most important commercial systems such as iThenticate, Copyscape, Turnitin, and Urkund have databases of genuine documents which contain tens of billions of web pages, almost one hundred million offline academic works, and almost forty million articles from tens of thousands of academic journals, this is not enough because:

  • Published information continues to grow much faster than Google can index it, and there is no database of authentic documents in existence that contains everything that has been published.
  • For example: plagiarism detectors are of no use if two similar papers are sent simultaneously to two journals for publication since they will not be detected as they have not yet been published.

The topic is so important that, since 2004, the University of Applied Sciences of the Hochschule für Technik und Wirtshaft Berlin (HTW, Berlin) has been maintaing a specialized site of PDS software and conducts international evaluations where the applications are put through rigorous stress tests. The portal on plagiarism is maintained by Dr. Debora Weber-Wulff, professor of media and computing at the HTW Berlin.

The results of the evaluation undertaken in 2013, where 28 systems were analyzed with only 15 able to complete the series of more than 70 tests to which they were submitted, are available in the Plagiarism Portal of HTW6 and will be published as a book at the end of January 20147.

The problems most difficult to resolve as “detected” by the tests were the large numbers of false negatives and false positives which were generally caused by the use of phrases common to the subject specialty of the document. The errors from false positives are important because if human criteria are not applied, damage could be caused to the reputation of the authors of the work under analysis.

According to the HTW report, there are many problems in the machine determination of plagiarism that require human decision making criteria. Some of the problems mentioned are:

  • What exactly constitutes plagiarism: a simple “cut & paste”, or also paraphrasing without identifying the source, or just taking the ideas?
  • How much can be copied from a work before it is considered to be plagiarized?
  • Is it considered to be plagiarism only when copying is done with the intent to deceive?
  • Do PDS verify the complete text, or only sample extracts?

The tests of 2013 resulted in the selection of only three applications to the category of “partially useful”, the highest category given. These were Urkund, Turnitin (used by the iThenticate and WriteCheck products) and Copyscape. Urkund had the best score in effectiveness but had problems of usability.

Note about retraction

According to the Real Academía de la Lengua Española8, a retraction is the action of expressly revoking what has been said. For the National Library of Medicine9, after an article has been published it can be the object of the following modifications: errata, retraction (total or partial), correction and re-publication, plagiarism (duplicate publication), commentaries (include author responses), updated versions, and re-publications (reprints). Of these, retraction and plagiarism have a greater academic and social weight.

The SciELO Program provides guidance on how to publish retractions10.

Conclusions

Given that the results are mixed, with pros and cons, it is not possible to recommend any one particular system, especially when there are many different situations with specific cases where the applications may be particularly useful, but not in general.

The software systems considered as detectors of plagiarism do not in reality detect it. They can only detect parallel texts. The decision of whether or not something is plagiarism rests with the reviewers that use the software. What is made available is a tool and not a proof of plagiarism.

It would be important for the community of SciELO editors, which today publish more than 1,000 journals in 16 countries, to take the initiative to train and use these tools to improve the quality of what is published in the SciELO Network.

Notes

¹ Ética editorial y el problema del plagio – http://blog.scielo.org/es/2013/10/02/etica-editorial-y-el-problema-del-plagio/#.Usfw9Wx3u00

² Ética editorial y el problema del autoplagio – http://blog.scielo.org/es/2013/11/11/etica-editorial-y-el-problema-del-autoplagio/#.UsfxeGx3u00

³ Ética editorial –el Ghostwriting es una práctica insalubre – http://blog.scielo.org/es/2014/01/16/etica-editorial-el-ghostwriting-es-una-practica-insalubre/#.UuZDK7TJ2Hs

Stylometry – http://en.wikipedia.org/wiki/Stylometry

Homoglyph: In typography, a homoglyph is one of two or more characters, or glyphs, with shapes that either appear identical or cannot be differentiated by quick visual inspection. This designation is also applied to sequences of characters sharing these properties.

6 Results of the Plagiarism Detection System Test 2013 – http://plagiat.htw-berlin.de/software-en/test2013/

7 WEBER-WOLFF, D. False Feathers: A Perspective on Academic Plagiarism. 200p. 2014.

8 Real Academia Española – Retractar – http://buscon.rae.es/drae/?type=3&val=retractarse&val_aux=&origen=REDRAE

9 U.S. National Library of Medicine – http://www.nlm.nih.gov/pubs/factsheets/errata.html

10 SciELO – Procedimento para retractación de Artículos – http://www.scielo.org/php/level.php?lang=es&component=44&item=49

References

The Scientist: exploring life, inspiring innovation. Defending Against Plagiarism: publishers need to be proactive about detecting and deterring copied text. June 1, 2013. Available from: <http://www.the-scientist.com/?articles.view/articleNo/35677/title/Defending-Against-Plagiarism/>.

SOROKINA, D., et al. Plagiarism Detection in arXiv. 2007. Available from: <http://arxiv.org/ftp/cs/papers/0702/0702012.pdf>.

GIPP, B., and MEUSCHKE, N. Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence. Proceedings of the 11th ACM. 2011. Available from: <http://sciplore.org/wp-content/papercite-data/pdf/gipp11c.pdf>.

GIPP, B., MEUSCHKE, N., an BEEL, J. Comparative evaluation of text- and citation-based plagiarism detection approaches using guttenplag. In Proceedings of JCDL. pp. 255-258. 2011. Available from: <http://gipp.com/wp-content/papercite-data/pdf/gipp11.pdf>.

iThenticate. Plagiarism Detection Software Misconceptions. Free paper: 7 Misconceptions of Plagiarism Detection Software. Available from: <http://www.ithenticate.com/resources/papers/plagiarism-detection-software-misconceptions>.

External links

ArXivarxiv.org

HTW, Berlin – http://plagiat.htw-berlin.de/software-en/

Urkund – http://www.urkund.com/int/en/

Turnitin – http://www.turnitin.com/

Copyscape – http://www.copyscape.com/

iThenticate – http://www.ithenticate.com/

WriteCheck – http://en.writecheck.com/

 

Ernesto SpinakAbout Ernesto Spinak

Collaborator on the SciELO program, a Systems Engineer with a Bachelor’s degree in Library Science, and a Diploma of Advanced Studies from the Universitat Oberta de Catalunya (Barcelona, Spain) and a Master’s in “Sociedad de la Información” (Information Society) from the same university. Currently has a consulting company that provides services in information projects  to 14 government institutions and universities in Uruguay.

 

Translated from the original in Spanish by Nicholas Cop Consulting.

 

Como citar este post [ISO 690/2010]:

SPINAK, E. Editorial ethics: the detection of plagiarism by automated means [online]. SciELO in Perspective, 2014 [viewed ]. Available from: https://blog.scielo.org/en/2014/02/12/editorial-ethics-the-detection-of-plagiarism-by-automated-means/

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Post Navigation