In the beginning it was just plagiarism – now its computer-generated fake papers as well

A number of articles have appeared in the press recently which are giving the impression that the academic publishing establishment is being flooded with fake papers created by computer programs and presented at conferences. This impression was first given by the news item “Publishers withdraw more than 120 gibberish papers¹” published in Nature on the 24th February of this year (1), closely followed by the article entitled “How computer-generated fake papers are flooding academia²; over the following few days, other press media published their comments on this subject (Scientific American, Reddit, etc).

This news generated shock effects because these fake papers, automatically created by a computer program, had been accepted by prestigious journal publishers, 16 by the German publisher Springer, and more than 100 were published by the Institute of Electronic Engineers (IEEE) in the USA. However, nothing like this should really shock us, since spamming exists on the Internet – well, doesn’t it? – so we should not be taken aback when it can also be found in academic publications. To clarify my view point, I am going to relate what I did when I read the news item which appeared in Nature, and invite readers to do the same thing because it was a very entertaining experience.

This news item in Nature informed us that the fake papers had been created by a program called SCIgen – An automatic CS paper generator³, developed at MIT by three graduate students in 2005. This program produces papers which, even thought they are fake, have excellent editing and strict maintenance of an academic format. In their introduction of this program, the authors state the following:

SCIgen is a program that generates random computer science research papers, including graphs, figures and citations. It uses a hand-written context-free grammar to form all elements of the papers. Our aim here is to maximise amusement, rather than coherence³.

The SCIgen program was analyzed by Cyril Labbé, a French researcher working at the IT Laboratory of the University of Grenoble, who created the program known as “antiScigen” and then went on to publish a paper on Scientometrics⁴ in 2012 which dealt with fake computer – generated papers. In 2010, Cyril Labbé also demonstrated the vulnerability of h-index calculations based on Google Academic by feeding it with more than 100 papers generated by SCIgen which all cited each other. Using this method he succeeded in getting the totally fictitious author “Ike Antkare” ranked as the most cited author ever.

I then decided to carry out my own test. I went on to the site SCIgen³ and input into the form entitled “Generate a Random Paper the names of some of my librarian friends in the author field.  I created three papers by using combinations of the different authors – and all this took me less than 5 minutes. Next I generated the PDFs for these papers and downloaded them to my computer for review. They were perfect in every way. Even the bibliography was correctly laid out, and helpfully included in some of the bibliographic references were the names of my librarian friends cited as authors!!!

Once I had prepared the fake papers, I carried on with my investigation and sent these three papers that had been “written” by my librarian friends for editing in the “antiScigen” program. I went to the page AntiScigen⁵, and saw that the only requirement was that the group of PDF files should be sent as an archive.zip file. We did what it said, and in two minutes we received the report in graphic form which can be seen below.  The graphic shows three trees:

  • In black, the sections of original text (does not correspond to known fingerprints);
  • In red, the sections of text that are recognized as produced by SCIgen;
  • In blue, the sections of text that have been copied from other sources in the field of computing, but not generated by the program.

In less than 10 minutes we were able to create three academic papers and have them verified by the program “antiSCIgen”. Seems simple, right?

None of this is new in the field of Computing Science. Today, either under good or bad intentions, there are plenty of imitators that manipulate papers, create false profiles in Google Scholar Citations, and then manipulate the numbers. Not only do these imitators fabricate conference papers, but also all kinds of other works. Two examples of generators similar to SCIgen are given below. It is worth your while to evaluate them yourself.

  • An essay generator6;
  • Automatic SBIR grant proposal generator7.

But the one that I found most surprising of all is the one made by Philip Parker who, in his small company and with the help of a few computers and programmers, wrote 200,000 books which he sells on Amazon. Parker produces a book every 20 minutes using a patented process.

Well, spamming has entered the very heart of science. As the Nature news item states, it does not matter if papers are submitted to a controlled environment (prestigious publishers and journals with peer review systems in place) or to more or less controlled environments, or to openly uncontrolled environments (Web pages, repositories, etc.) as in the Google world. There are no infallible means in existence which can prevent fraud and, as Emilio Delgado López-Cózar (2007) states about peer review as a system of assessing fraud:

there are no infallible means which can prevent fraud from occurring, nor is publication a guarantee of the reliability and validity of a research activity, nor is evaluation done by experts capable of detecting and neutralizing it. Basically, there are two reasons for this. The first  is that science rests upon an axiomatic pillar that can be falsified, and which is based on the goodwill of scientists ….. but if scientists wants to lie, they will. The second is that the warning system used by science to verify the likelihood and truthfulness of a discovery is applied in very few cases… wider application is impractical given the current volume of scientific outcomes. (free translation)

In the world of computing, we are accustomed to viruses, trojans, hackers, phishing, spamming and so on, and for these we set up firewalls, antivirus programs, blacklists, passwords and all other kinds of security. The people who create and use those “IT works” are computer science graduates who often develop these programs as part of their studies or simply as personal challenges, even just for fun.

Reflections

When fraudulent works are detected, responsible publishers will certainly remove them, but should leave a note explaining their removal. This then begs the question: what happens to the counts and indices of Google Scholar if the indicators are adjusted downward, and what happens to the works and pages that maintain links to those works that have been removed? Do they remain valid?

Research scientists are like any other human being, and in a highly competitive environment where much money and prestige is involved, there will always be those open to “forgetting” the rules.

Publishing systems must incorporate appropriate controls in their arbitration procedures. As mentioned in previous posts on plagiarism in this blog, the SciELO Program participating publishers should also incorporate professional procedures into their systems to avoid this kind of fraudulent publication.

Notes

¹ NOORDEN, R. V. Publishers withdraw more than 120 gibberish papers. Conference proceedings removed from subscription databases after scientist reveals that they were computer-generated. Nature. [viewed 24 February 2014]. Available from: <http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-papers-1.14763>.

² How computer-generated fake papers are flooding academia. The Guardian. [viewed 27 February 2014]. Available from: <http://www.theguardian.com/technology/shortcuts/2014/feb/26/how-computer-generated-fake-papers-flooding-academia>.

³ SCIgen – An Automatic CS Paper Generator – http://pdos.csail.mit.edu/scigen/.

⁴ LABBÉ, C., and LABBÉ, D. Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science. Scientometrics. [viewed 22 June 2012]. Available from: <http://hal.archives-ouvertes.fr/docs/00/71/35/55/PDF/0-FakeDetectionSci-Perso.pdf>.

⁵ AntiScigen – http://scigendetection.imag.fr/main.php.

6 An essay generator – http://www.essaygenerator.com/.

7 SBIR grant proposal generator. http://www.nadovich.com/chris/randprop/.

References

COHEN, N. He Wrote 200,000 Books (but Computers Did Some of the Work). The New Work Times. [14 April 2008]. Available from: <http://www.nytimes.com/2008/04/14/business/media/14link.html?pagewanted=all&_r=0>.

HILL, D.J. Patented book writing system creates, sells hundreds of thousands of books on amazon. Singularity HUB. [13 December 2012]. Available from: <http://singularityhub.com/2012/12/13/patented-book-writing-system-lets-one-professor-create-hundreds-of-thousands-of-amazon-books-and-counting/>.

LABBÉ, C. Ike Antkare one of the greatest stars in the scientific firmament. LIG Laboratory. [14 April 2010]. Available from: <http://hal.inria.fr/docs/00/71/35/64/PDF/TechReportV2.pdf>.

LÓPEZ-COZAR, E. D., SALINAS, D. T., and LÓPEZ, A. R. El fraude en la ciencia: reflexiones a partir del caso Hwang. El profesional de la información. 2007, marzo-abril, vol. 16, nº 2. Available from: <http://eprints.rclis.org/9979/1/g61n63522lg20818.pdf>.

External links

AntiScigen – http://scigendetection.imag.fr/main.php

 

Ernesto SpinakAbout Ernesto Spinak

Collaborator on the SciELO program, a Systems Engineer with a Bachelor’s degree in Library Science, and a Diploma of Advanced Studies from the Universitat Oberta de Catalunya (Barcelona, Spain) and a Master’s in “Sociedad de la Información” (Information Society) from the same university. Currently has a consulting company that provides services in information projects  to 14 government institutions and universities in Uruguay.

 

Translated from the original in Spanish by Nicholas Cop Consulting.

 

How to cite this post [ISO 690/2010]:

SPINAK, E. In the beginning it was just plagiarism – now its computer-generated fake papers as well [online]. SciELO in Perspective, 2014 [viewed ]. Available from: http://blog.scielo.org/en/2014/03/31/in-the-beginning-it-was-just-plagiarism-now-its-computer-generated-fake-papers-as-well/

 

One Thought on “In the beginning it was just plagiarism – now its computer-generated fake papers as well

  1. Pingback: Humans Run Experiments, a Robot Writes the Paper - Airiters

Leave a Reply

Your email address will not be published. Required fields are marked *

Post Navigation