{"id":915,"date":"2014-02-12T13:37:48","date_gmt":"2014-02-12T16:37:48","guid":{"rendered":"http:\/\/blog.scielo.org\/en\/?p=915"},"modified":"2016-01-15T16:40:28","modified_gmt":"2016-01-15T18:40:28","slug":"editorial-ethics-the-detection-of-plagiarism-by-automated-means","status":"publish","type":"post","link":"https:\/\/blog.scielo.org\/en\/2014\/02\/12\/editorial-ethics-the-detection-of-plagiarism-by-automated-means\/","title":{"rendered":"Editorial ethics: the detection of plagiarism by automated means"},"content":{"rendered":"<p><a href=\"http:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2014\/02\/plagiarism_en.png\" target=\"_blank\"><img loading=\"lazy\" decoding=\"async\" class=\"alignright size-medium wp-image-917\" src=\"http:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2014\/02\/plagiarism_en-300x220.png\" alt=\"\" width=\"300\" height=\"220\" srcset=\"https:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2014\/02\/plagiarism_en-300x220.png 300w, https:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2014\/02\/plagiarism_en.png 1000w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a>As we have seen in previous pieces<sup>1,2,3<\/sup>, plagiarism, in spite of it being a practice which violates ethical standards (in addition to copyright), is occurring far more frequently than people naively imagine. The major reason behind the growth in plagiarism is the ease with which access can be achieved to online content. Over the last 10 years this practice has increased to such an extent that since 2011, the National Science Foundation has earmarked 100 million dollars for the analysis of this problem, and academic journals are reporting that the number of retractions has increased by a factor of 10 over the last 20 years. For more details on retraction and plagiarism, see the comments at the end of this article.<\/p>\n<p>In the academic world, worrying figures have come to light. For example, a plagiarism audit carried out on 285,000 scientific texts in arXiv.org retrieved more than 500 documents which were most likely plagiarized, along with another 30,000 documents (20% of the collection [<i>sic<\/i>]) which showed strong indications of excessive self-plagiarism.<\/p>\n<p>This problem makes it necessary that those who are responsible for editorial policy for academic publications, and in particular those items which are published in SciELO, should adopt the necessary preventive measures to counteract plagiarism, and these procedures should also include the use of computerized tools.<\/p>\n<p>The detection of plagiarism is the process of locating within a particular work or document those sections which have been taken from other sources without the appropriate references being made. Plagiarism can occur in any type of document, not solely in academic works, since it is also found in the Press, in graduation theses, in the code used in computer programs, in art designs, etc.<\/p>\n<p>Computerized plagiarism detection systems (known as PDS in English) have two basic approaches: <b>i.<\/b> external comparison and <b>ii.<\/b> intrinsic analysis. The processes involved in external detection necessitate having access to a vast collection of documents which are accepted as being <i>\u201c<\/i><i>genuine<\/i><i>\u201d<\/i>. The document under analysis is then compared with the documents in this database. Intrinsic analyses carry out statistical checks on the vocabulary and linguistic style of the written language of the text by the application of linguistic techniques used in the specialized field known as stylometry<sup>4<\/sup>.<\/p>\n<p>A common classification of the procedures used for the detection of plagiarism is as follows:<\/p>\n<ul>\n<li>Fingerprinting<\/li>\n<li>String matching<\/li>\n<li>Vector space retrieval<\/li>\n<li>Analysis of citations<\/li>\n<li>Stylometry<\/li>\n<\/ul>\n<p>In all cases, computer algorithms are used to represent the documents by means of text patterns (fingerprints) or argumentative structures or citation pattern similarity, and these algorithmic representations are used to check them against the suspect documents. The algorithms which use text representation form part of Information Retrieval theory (IR). These processes, depending on their complexity and the size of the database of genuine articles which is used, can take up significant machine time, especially when Internet sources are used to supplement the databases of <i>genuine<\/i> articles.<\/p>\n<p>But, this is unfortunately a simplistic view of the problem, because it is based on the assumption that the plagiarism concerned is simply a verbatim copy \u2013 \u201ccut and paste\u201d \u2013 and as such is easily detectable. However, the accuracy of automated detection decreases as the plagiarism process becomes more concealed by any of the following techniques:<\/p>\n<ul>\n<li>Masked plagiarism\n<ul>\n<li>Shake and paste plagiarism: the copying and merging of segments from different sources to form a coherent text;<\/li>\n<li>Expansive plagiarism: portions of additional text are inserted into the copied segment;<\/li>\n<li>Contractive plagiarism: a r\u00e9sum\u00e9 or original text which has been \u201ctrimmed\u201d;<\/li>\n<li>Mosaic plagiarism: segments of texts from different sources are merged together, changing the word order, using synonyms, deleting and inserting filler words.<\/li>\n<\/ul>\n<\/li>\n<li>Undue paraphrasing: the intentional rewriting of other peoples\u2019 ideas.<\/li>\n<li>Translated plagiarism: the machine translation of paragraphs from other sources from one language to another. The translated content is then revised by polishing the style.<\/li>\n<li>Idea plagiarism: the appropriation of research methods, experimental procedures, argumentative structures, and background sources.<b> <\/b>In this case, it is not the text that is copied, just the methods.<\/li>\n<\/ul>\n<p>In addition, more sophisticated text manipulation methods are in existence which have been designed to dupe computer algorithms. These are most used in the case of those languages which have diacritics (Scandinavian and Slavic languages, etc.), or for languages written in non-Roman scripts such as Japanese and Hebrew.<\/p>\n<p>A technique that is frequently employed is the use of homoglyphs<sup>5<\/sup>. These are letters or sequences of letters which appear similar but have<b> <\/b>different internal representations. Examples are the substitution of <b>0 <\/b>(zero)<b> <\/b>for lower case \u201co\u201d or upper case <b>O<\/b>, or the transcription of letters from the Greek or Roman alphabet. More than 40 possible substitutions are in existence and are frequently used in graduation theses to prevent plagiarism from being detected, since very few software applications are able to detect substitutions of this type (Turnitin and Urkund are practically the only exceptions as far as PDS are concerned). The state of the art of PDS can be summarized in the following phrase: \u201cPDS finds the copies, not the plagiarism\u201d (Gipp and Meuschke 2011).<\/p>\n<p>A search on the Internet reveals the existence of dozens of software applications, both commercial and free, of different degrees of effectiveness. This has developed into an area of research and development which has shown strong growth over the last 10 years. But one must take into consideration that current PDS technology is imprecise, and that there are many myths surrounding this subject, one of which is that any copied work will be detected. Although the most important commercial systems such as iThenticate, Copyscape, Turnitin, and Urkund have databases of genuine documents which contain tens of billions of web pages, almost one hundred million offline academic works, and almost forty million articles from tens of thousands of academic journals, this is not enough because:<\/p>\n<ul>\n<li>Published information continues to grow much faster than Google can index it, and there is no database of authentic documents in existence that contains everything that has been published.<\/li>\n<li>For example: plagiarism detectors are of no use if two similar papers are sent simultaneously to two journals for publication since they will not be detected as they have not yet been published.<\/li>\n<\/ul>\n<p>The topic is so important that, since 2004, the University of Applied Sciences of the Hochschule f\u00fcr Technik und Wirtshaft Berlin (HTW, Berlin) has been maintaing a specialized site of PDS software and conducts international evaluations where the applications are put through rigorous stress tests. The portal on plagiarism is maintained by Dr. Debora Weber-Wulff, professor of media and computing at the HTW Berlin.<\/p>\n<p>The results of the evaluation undertaken in 2013, where 28 systems were analyzed with only 15 able to complete the series of more than 70 tests to which they were submitted, are available in the <i>Plagiarism Portal<\/i> of HTW<sup>6<\/sup> and will be published as a book at the end of January 2014<sup>7<\/sup>.<\/p>\n<p>The problems most difficult to resolve as \u201c<i>detected\u201d <\/i>by the tests were the large numbers of false negatives and false positives which were generally caused by the use of phrases common to the subject specialty of the document. The errors from false positives are important because if human criteria are not applied, damage could be caused to the reputation of the authors of the work under analysis.<\/p>\n<p>According to the HTW report, there are many problems in the machine determination of plagiarism that require human decision making criteria. Some of the problems mentioned are:<\/p>\n<ul>\n<li>What exactly constitutes plagiarism: a simple \u201ccut &amp; paste\u201d, or also paraphrasing without identifying the source, or just taking the ideas?<\/li>\n<li>How much can be copied from a work before it is considered to be plagiarized?<\/li>\n<li>Is it considered to be plagiarism only when copying is done with the intent to deceive?<\/li>\n<li>Do PDS verify the complete text, or only sample extracts?<\/li>\n<\/ul>\n<p>The tests of 2013 resulted in the selection of only three applications to the category of \u201cpartially useful\u201d, the highest category given. These were Urkund, Turnitin (used by the iThenticate and WriteCheck products) and Copyscape. Urkund had the best score in effectiveness but had problems of usability.<\/p>\n<h3>Note about retraction<\/h3>\n<p>According to the <i>Real Academ\u00eda de la Lengua Espa\u00f1ola<sup>8<\/sup><\/i>, a retraction is the action of expressly revoking what has been said. For the National Library of Medicine<sup>9<\/sup>, after an article has been published it can be the object of the following modifications: errata, retraction (total or partial), correction and re-publication, plagiarism (duplicate publication), commentaries (include author responses), updated versions, and re-publications (reprints). Of these, retraction and plagiarism have a greater academic and social weight.<\/p>\n<p>The SciELO Program provides guidance on how to publish retractions<sup>10<\/sup>.<\/p>\n<h3>Conclusions<\/h3>\n<p>Given that the results are mixed, with pros and cons, it is not possible to recommend any one particular system, especially when there are many different situations with specific cases where the applications may be particularly useful, but not in general.<\/p>\n<p>The software systems considered as detectors of plagiarism do not in reality detect it. They can only detect parallel texts. The decision of whether or not something is plagiarism rests with the reviewers that use the software. What is made available is a tool and not a proof of plagiarism.<\/p>\n<p>It would be important for the community of SciELO editors, which today publish more than 1,000 journals in 16 countries, to take the initiative to train and use these tools to improve the quality of what is published in the SciELO Network.<\/p>\n<h3>Notes<\/h3>\n<p>\u00b9 \u00c9tica editorial y el problema del plagio &#8211; <a href=\"http:\/\/blog.scielo.org\/es\/2013\/10\/02\/etica-editorial-y-el-problema-del-plagio\/#.Usfw9Wx3u00\">http:\/\/blog.scielo.org\/es\/2013\/10\/02\/etica-editorial-y-el-problema-del-plagio\/#.Usfw9Wx3u00<\/a><\/p>\n<p>\u00b2<b> <\/b>\u00c9tica editorial y el problema del autoplagio &#8211; <a href=\"http:\/\/blog.scielo.org\/es\/2013\/11\/11\/etica-editorial-y-el-problema-del-autoplagio\/#.UsfxeGx3u00\">http:\/\/blog.scielo.org\/es\/2013\/11\/11\/etica-editorial-y-el-problema-del-autoplagio\/#.UsfxeGx3u00<\/a><b><\/b><\/p>\n<p>\u00b3<b> <\/b>\u00c9tica editorial \u2013el Ghostwriting es una pr\u00e1ctica insalubre &#8211; <a href=\"http:\/\/blog.scielo.org\/es\/2014\/01\/16\/etica-editorial-el-ghostwriting-es-una-practica-insalubre\/#.UuZDK7TJ2Hs\">http:\/\/blog.scielo.org\/es\/2014\/01\/16\/etica-editorial-el-ghostwriting-es-una-practica-insalubre\/#.UuZDK7TJ2Hs<\/a><\/p>\n<p>\u2074<b> <\/b>Stylometry &#8211; <a href=\"http:\/\/en.wikipedia.org\/wiki\/Stylometry\" target=\"_blank\">http:\/\/en.wikipedia.org\/wiki\/Stylometry<\/a><\/p>\n<p>\u2075 <a href=\"http:\/\/en.wikipedia.org\/wiki\/Homoglyph\" target=\"_blank\">Homoglyph<\/a>: In typography, a homoglyph is one of two or more characters, or glyphs, with shapes that either appear identical or cannot be differentiated by quick visual inspection. This designation is also applied to sequences of characters sharing these properties.<\/p>\n<p><sup>6<\/sup> Results of the Plagiarism Detection System Test 2013 &#8211; <a href=\"http:\/\/plagiat.htw-berlin.de\/software-en\/test2013\/\" target=\"_blank\">http:\/\/plagiat.htw-berlin.de\/software-en\/test2013\/<\/a><\/p>\n<p><sup>7<\/sup> WEBER-WOLFF, D. <i>False Feathers<\/i>: A Perspective on Academic Plagiarism. 200p. 2014.<\/p>\n<p><sup>8 <\/sup><b>Real Academia Espa\u00f1ola \u2013 Retractar &#8211; <\/b><a href=\"http:\/\/buscon.rae.es\/drae\/?type=3&amp;val=retractarse&amp;val_aux=&amp;origen=REDRAE\" target=\"_blank\">http:\/\/buscon.rae.es\/drae\/?type=3&amp;val=retractarse&amp;val_aux=&amp;origen=REDRAE<\/a><\/p>\n<p><sup>9 <\/sup><b>U.S. National Library of Medicine &#8211; <\/b><a href=\"http:\/\/www.nlm.nih.gov\/pubs\/factsheets\/errata.html\" target=\"_blank\">http:\/\/www.nlm.nih.gov\/pubs\/factsheets\/errata.html<\/a><\/p>\n<p><sup>10<b> <\/b><\/sup><b>SciELO \u2013 Procedimento para retractaci\u00f3n de Art\u00edculos &#8211; <\/b><a href=\"http:\/\/www.scielo.org\/php\/level.php?lang=es&amp;component=44&amp;item=49\" target=\"_blank\">http:\/\/www.scielo.org\/php\/level.php?lang=es&amp;component=44&amp;item=49<\/a><\/p>\n<h3>References<\/h3>\n<p>The Scientist: exploring life, inspiring innovation. <i>Defending Against Plagiarism<\/i>: publishers need to be proactive about detecting and deterring copied text. June 1, 2013. Available from: &lt;<a href=\"http:\/\/www.the-scientist.com\/?articles.view\/articleNo\/35677\/title\/Defending-Against-Plagiarism\/\" target=\"_blank\">http:\/\/www.the-scientist.com\/?articles.view\/articleNo\/35677\/title\/Defending-Against-Plagiarism\/<\/a>&gt;.<\/p>\n<p>SOROKINA, D., <i>et al.<\/i> <b>Plagiarism Detection in arXiv<\/b>. 2007. Available from: &lt;<a href=\"http:\/\/arxiv.org\/ftp\/cs\/papers\/0702\/0702012.pdf\" target=\"_blank\">http:\/\/arxiv.org\/ftp\/cs\/papers\/0702\/0702012.pdf<\/a>&gt;.<\/p>\n<p>GIPP, B., and MEUSCHKE, N. <b>Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence<\/b>. <i>Proceedings of the 11<sup>th<\/sup> ACM<\/i>. 2011. Available from: &lt;<a href=\"http:\/\/sciplore.org\/wp-content\/papercite-data\/pdf\/gipp11c.pdf\" target=\"_blank\">http:\/\/sciplore.org\/wp-content\/papercite-data\/pdf\/gipp11c.pdf<\/a>&gt;.<\/p>\n<p>GIPP, B., MEUSCHKE, N., an BEEL, J. <b>Comparative evaluation of text- and citation-based plagiarism detection approaches using guttenplag<\/b>. <i>In Proceedings of JCDL<\/i>. pp. 255-258. 2011. Available from: &lt;<a href=\"http:\/\/gipp.com\/wp-content\/papercite-data\/pdf\/gipp11.pdf\" target=\"_blank\">http:\/\/gipp.com\/wp-content\/papercite-data\/pdf\/gipp11.pdf<\/a>&gt;.<\/p>\n<p>iThenticate. Plagiarism Detection Software Misconceptions. <i>Free paper: 7 Misconceptions of Plagiarism Detection Software. <\/i>Available from: &lt;<a href=\"http:\/\/www.ithenticate.com\/resources\/papers\/plagiarism-detection-software-misconceptions\" target=\"_blank\">http:\/\/www.ithenticate.com\/resources\/papers\/plagiarism-detection-software-misconceptions<\/a>&gt;.<\/p>\n<h3>External links<\/h3>\n<p>ArXiv<b> &#8211; <\/b><a href=\"http:\/\/arxiv.org\/\" target=\"_blank\">arxiv.org<\/a><b><\/b><\/p>\n<p>HTW, Berlin &#8211; <a href=\"http:\/\/plagiat.htw-berlin.de\/software-en\/\" target=\"_blank\">http:\/\/plagiat.htw-berlin.de\/software-en\/<\/a><\/p>\n<p>Urkund &#8211; <a href=\"http:\/\/www.urkund.com\/int\/en\/\" target=\"_blank\">http:\/\/www.urkund.com\/int\/en\/<\/a><\/p>\n<p>Turnitin &#8211; <a href=\"http:\/\/www.turnitin.com\/\" target=\"_blank\">http:\/\/www.turnitin.com\/<\/a><\/p>\n<p>Copyscape &#8211; <a href=\"http:\/\/www.copyscape.com\/\" target=\"_blank\">http:\/\/www.copyscape.com\/<\/a><\/p>\n<p>iThenticate &#8211; <a href=\"http:\/\/www.ithenticate.com\/\" target=\"_blank\">http:\/\/www.ithenticate.com\/<\/a><\/p>\n<p>WriteCheck &#8211; <a href=\"http:\/\/en.writecheck.com\/\" target=\"_blank\">http:\/\/en.writecheck.com\/<\/a><\/p>\n<p>&nbsp;<\/p>\n<h3><a href=\"http:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2013\/10\/spinak.jpg\" target=\"_blank\"><img loading=\"lazy\" decoding=\"async\" class=\"alignright wp-image-640\" title=\"Ernesto Spinak\" src=\"http:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2013\/10\/spinak-300x271.jpg\" alt=\"Ernesto Spinak\" width=\"180\" height=\"163\" srcset=\"https:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2013\/10\/spinak-300x271.jpg 300w, https:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2013\/10\/spinak-768x694.jpg 768w, https:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2013\/10\/spinak-1024x926.jpg 1024w, https:\/\/blog.scielo.org\/en\/wp-content\/uploads\/sites\/2\/2013\/10\/spinak-150x136.jpg 150w\" sizes=\"auto, (max-width: 180px) 100vw, 180px\" \/><\/a>About Ernesto Spinak<\/h3>\n<p>Collaborator on the SciELO program, a Systems Engineer with a Bachelor\u2019s degree in Library Science, and a Diploma of Advanced Studies from the Universitat Oberta de Catalunya (Barcelona, Spain) and a Master\u2019s in \u201cSociedad de la Informaci\u00f3n\u201d (Information Society) from the same university. Currently has a consulting company that provides services in information projects\u00a0 to 14 government institutions and universities in Uruguay.<\/p>\n<p>&nbsp;<\/p>\n<p>Translated from the original in\u00a0<a href=\"http:\/\/blog.scielo.org\/es\/2014\/02\/12\/etica-editorial-como-detectar-el-plagio-por-medios-automatizados\/\">Spanish<\/a>\u00a0by\u00a0<a href=\"http:\/\/www.nicholascopconsulting.com\/\" target=\"_blank\">Nicholas Cop Consulting<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The growth in plagiarism in academic articles requires publishers to have effective plagiarism detection systems, known as PDS, since there are multiple ways that this dishonest practice can be concealed. The issue is of such importance that, since 2004, the University of Applied Sciences in Berlin has been maintaining a specialized site of PDS software evaluations. <span class=\"ellipsis\">&hellip;<\/span> <span class=\"more-link-wrap\"><a href=\"https:\/\/blog.scielo.org\/en\/2014\/02\/12\/editorial-ethics-the-detection-of-plagiarism-by-automated-means\/\" class=\"more-link\"><span>Read More &rarr;<\/span><\/a><\/span><\/p>\n","protected":false},"author":8,"featured_media":916,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":"","_links_to":"","_links_to_target":""},"categories":[3],"tags":[37,7],"class_list":["post-915","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-analysis","tag-ethics-in-scholarly-communication","tag-scholarly-communication"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/posts\/915","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/comments?post=915"}],"version-history":[{"count":6,"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/posts\/915\/revisions"}],"predecessor-version":[{"id":1727,"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/posts\/915\/revisions\/1727"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/media\/916"}],"wp:attachment":[{"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/media?parent=915"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/categories?post=915"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.scielo.org\/en\/wp-json\/wp\/v2\/tags?post=915"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}