1 / 35

Turnitoff

Turnitoff. defeating plagiarism detection systems. Lee Gillam, John Marinuzzi and Paris Ioannou. Talk Outline. Introduction Plagiarism in the Wild Examples of how (not) to do it Defeating Plagiarism Detection (PD) Systems Exploiting their weaknesses Conclusions.

neylan
Download Presentation

Turnitoff

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Turnitoff defeating plagiarism detection systems Lee Gillam, John Marinuzzi and Paris Ioannou

  2. Talk Outline • Introduction • Plagiarism in the Wild • Examples of how (not) to do it • Defeating Plagiarism Detection (PD) Systems • Exploiting their weaknesses • Conclusions

  3. Introduction, or …. But … why? • Plagiarism detection becomes a distraction from assessing written work, with an inherent cost in investigating suspicious material. • Pressure to perform makes such strategies a likely last resort. • Systematically run student work through PD systems • Legal / ethical dimenion is interesting here • Outputs should be used as an indication of need for further (human) investigation; pressure to perform not limited to students? • Those who can become knowledgeable about detection strategies may also become adept at avoiding risk of detection by suitable adaptation of the sourced material, bringing the material below any arbitrary threshold of suspicion. • Danger of becoming reliant on the safety net but not knowing its strength?

  4. Introduction, or …. But … why? • Intrigued by presentations at last HEA-ICS • Essay spinning [Lancaster and Clarke] • “ALT-1055” (cyrillic e), so what impact could we have by looking at Unicode more widely? • effective if the machine translation (MT) produces results that are sufficiently divergent from the original – advantage to using MT that produces lower quality results; Google/Yahoo MT systems too accurate? • Subsequent rewrite requires effort; also, risk that the rewrite lowers the required divergence. • Thesaural substitution and character substitution [Culwin] • 6 words for Google (maximum likelihood estimation); but need judicious selection of 6 words? • PD systems, search engines and systematic variations?

  5. Introduction, or …. But … why? • Belt and braces / another safety net; plenty to choose from, but want to know how they work: • Plagium – built over Yahoo Search • Seesources – built over Yahoo Search • PlagiarismDetector – “uses Google, Bing and Altavista” • Plagiarism Checker – first 32 words to Google or 50 to Yahoo. • EssayRater, now Grammarly.com - …. • Plagiarism Detect – “powered by Google” • Turnitin. • Many such alternatives will be as good as the search engine and its use. • So, if we can impact on the search query .....

  6. Introduction, or …. But … why? • Character and word substitutions, following on from last HEA-ICS conference to see what defeats what best. • Systematic changes, so also think about attempting to avoid suspicions being raised on reading. • Ideally, retain visual (and semantic) similarity. • Only assess risk of human detection in relation to character changes; semantic similarity beyond present scope. • Visual similarity also an issue for website domain names and phishing. • Developing prototypes to automate the generation of test texts and a web service for submission to parallel PD systems. • Key (initial) findings: We have identified systems that are not defeated in the same way as the apparently de facto Turnitin, so multiples can work.

  7. A hint of (initial) findings Alternative PD system needed

  8. Plagiarism in the Wild • A variety of high profile cases in recent years. • Three to pick on: • The Downing Street “dodgy dossier”. • Student accuser is the one to be disciplined. • The paper that never was. • The FamousPlagiarists.com / WarOnPlagiarism.org website features further alleged, and proven, cases of plagiarism in a variety of professions: • Al Gore, Martin Luther King, Osama Bin Laden, J.K. Rowling, Madonna, Britney Spears …. • …. and an unfortunate number of academics. • Unfortunate precedents - appear acceptable, go unchallenged, defended at senior levels, may be described as unintentional/mistakes with second chances offered. • Are penalties proportionate for small or moderate quantities of plagiarism in educational settings?

  9. Plagiarism in the Wild • The Downing Street “dodgy dossier” • Ibrahim Al Marashi’s, “Iraq’s Security and Intelligence Network: A Guide and Analysis”, published in September 2002 by the Middle East Review of International Affairs (MERIA), • Journal editor has suggested, possibly on reflection, that this endorses the quality of their published articles. • Even “grammatical mistakes made on the internet version ended up in this February 2003 document”. • MERIA Editor’s response posted at: http://meria.idc.ac.il/british-govt-plagiarizes-meria.html (Accessed 20 May 2010) • Further discussed on the FamousPlagiarists.com / WarOnPlagiarism.org website, amongst others. • See, also, Tony Blair, Colin Powell and the Case of the “Sexed Up” British Intelligence Dossier - A Linguistic Analysis by Dr. John P. Lesko: http://www.famousplagiarists.com/MLSsexedupdossier.ppt (Accessed 20 May 2010)

  10. MERIA Journal’s Free Publicity (Middle East Review of International Affairs) Prime Minister Tony Blair’s office has apologized to Mr. Marashi but not to MERIA Journal for this plagiarism. Barry Rubin, MERIA Editor Tony Blair, Colin Powell and the Case of the “Sexed Up” British Intelligence Dossier - A Linguistic Analysis by Dr. John P. Lesko

  11. Plagiarism in the Wild • Alleged Plagiarism? Adapted from: Tony Blair, Colin Powell and the Case of the “Sexed Up” British Intelligence Dossier - A Linguistic Analysis by Dr. John P. Lesko

  12. Plagiarism in the Wild • Bindu Ganga (From www.famousplagiarists.com): • Argosy University student Marla Decker reported possible instances of plagiarism in Director of Training and member of faculty at Argosy University-Chicago Bindu Ganga’s thesis. • Hearings held on supposed "ethics charges" against Decker. • Turnitin.com "originality report" commissioned by the Sun-Times substantiated the allegations ("a 45% match" as revealed by the "originality report"); attempts to silence the student exposed. • Marla Decker, did receive her degree after the university tried to have her dismissed, but her 'ethics violations' ended up as a "part of her permanent academic record“ • Argosy University eventually decided to fire Bindu Ganga over the plagiarism allegations. As reported in the Sun-Times, Argosy "also took away Ganga's doctorate in clinical psychology“ • …… but she was allowed to submit a new project for, and be awarded, another doctorate.

  13. Plagiarism in the Wild • Dubin, D., The most influential paper Gerard Salton never wrote. Library Trends,52-4, (2004) • Many citations by academics and practitioners of a supposed landmark reference in information retrieval: "A Vector Space Model for Information Retrieval" • But, there has never been an article with that particular title. • At minimum, this suggests that referring to spurious articles will help to distinguish the diligent researchers.

  14. Plagiarism in the Wild • Final note: Plagiarism by the world’s most wanted? • Osama Bin Laden • Allegation: Plagiarism of a poem by a Jordanian poet, Yusuf Abu Hilalah • Results: Criticism in a London-based newspaper, skillful use of Classical Arabic poetic conventions admired by followers • From “Poems in the Time of Oppression” by Yusuf Abu Hilalah • The fighter’s winds blew, • Striking their monuments, telling • The assailant that • “The swords will not be thrown down • until you free our lands” • From poetry recited by Osama bin Laden on Pentagon videotapes • The fighter’s winds blew, • Striking their towers, telling • The assailant that • “We will not stop our raids • until you free our lands” • “Views on plagiarism differ from culture to culture, and what might be seen as acceptable borrowing in one cultural context might be considered outright theft in another.” www.famousplagiarists.com

  15. Defeating PD Systems? • (Multiple) n-word segments (fingerprinting or shingling) • similar insertion will impact, depending on how systematic • costs by number of segments; full-text shingling is expensive • hashing • Frequency matching – distance/divergence measures; shared and unique words, ... • frequencies shifted by characters/words • This is a relatively expensive many-to-many search problem • thresholds may be applied to the number of segment matches required in order to reduce cost of subsequent operations

  16. Defeating PD Systems?

  17. Defeating PD Systems?

  18. Defeating PD Systems? • Anti-anti-plagiarism detection system (AAPS): http://sourceforge.net/projects/aaps • “a Perl application which accepts a text file as input, and produces an output with the same English meaning, but as many textual changes as possible, while maintaining grammer [sic.] and spelling” "in other words" <=> "alternatively", "a plethora of" <=> "many", "then there is" <=> "next comes", "took part in" <=> "participated in", "years of age" <=> "years old", "most common cause“ <=> "leading cause", ….. "on the other hand" <=> "however", "up till now there" <=> "to date there"

  19. Defeating PD Systems?

  20. Defeating PD Systems? • Three small experiments • Character substitutions within Unicode • Thesaural substitutions • (human) detection of character substitutions. • For the first and second experiments, a 266 word text was used. • 266 words of “History of London, History of England a unique and stimulating site”. http://www.historyofengland.net/content/view/119/49/, (Accessed 20 May 2010), starting at “London 1500 Years Ago” and ending “made London their winter HQ”. • For the third, we made use of 20 intelligent humans for segments of a relevant Internet Engineering Task Force (IETF) document. • Request for Comments (RFC) 4690 on Internationalized Domain Names (IDNs), available at: http://www.rfc-archive.org/getrfc.php?rfc=4690 (Accessed 20 May 2010).

  21. Defeating PD Systems (1 & 3) Arial, 44pt Remain similar in a number of “typical” fonts. Perpetua, 44pt

  22. Defeating PD Systems (1) • Simple tests : • Test 1: Change Latin o,c,p,y,a,e to similar Cyrillic and Greek letters; • Test 2: Change Latin i and v to similar Vav and Greek letters; • Test 3: Test 1 plus change Latin A,B,C,E,H,I,J,K,M,N,O,P,T,Y to similar Greek and Cyrillic letters; • Results for the seven systems show six are defeated by these changes • Plagiarism Detect appears not to recognize plagiarism for the original article, suggesting its database does not include this. Turnitin performs well, though Test 2 suggests it may be possible to push a text below a threshold.

  23. Defeating PD Systems (2) • Tests 4-7: replace every 5th, 6th, 7th and 8th word Alternative PD system needed

  24. Defeating PD Systems (3) • A small crowdsourcing experiment (20 undergraduatestudents, predominantly from the Department of Computing). • Each crowd member briefed to detect as many changes as they could to a set of paragraphs and given both the original and a revised version containing zero or more character substitutions. • For a set of substitutions, “intuitively selected”, we were looking for approximate detectability for an informed, but moderately motivated, audience. Arial, 20pt

  25. Defeating PD Systems (3) • To demonstrate these replacements, the first four (up to 20% risk of detection) are applied to a segment of the Marashi article • 16 replacements of “e”, 2 of “h”, 2 of “v” and 4 of “l” • Visually, these appear highly similar, and this is also likely to pass detection by most of the systems we tested. Arial, 20pt

  26. Defeating PD Systems (3) • For 8 texts taken from BBC News on a particular day, assessed the average number of changes attributable to these substitutions to determine the average impact on a document of making such changes. • 8 texts taken from BBC News, e.g. http://news.bbc.co.uk/1/hi/world/asia-pacific/8681833.stm • Over 50% of words changed (in an apparently risk-free manner) by the first two substitutions. • We are initially assuming 100% plagiarism; amounts starting below this might well be brought under any threshold for suspicion.

  27. Plagiarism Checker: "Students face a scramble for university places" OR "Last year 47,600 places were awarded through clearing" OR "Competition for university places is expected to be intense this year, with fewer places likely to be"

  28. www.cs.surrey.ac.uk

  29. Defeating PD Systems (3) As seen in Google query box Change the font.... "Studеnts fасеаsсrаmblеfοr univеrsitурlасеs" OR "Lаstуеаr 47,600 рlасеs wеrеаwаrdеd thrοugh сlеаring" OR "Cοmреtitiοn fοr univеrsitурlасеs is еxресtеd tοbеintеnsеthis уеаr, with fеwеr рlасеs likеlуtοbе“ Lucida Calligraphy, 20pt "Studеnts fасе а sсrаmblе fοr univеrsitу рlасеs" OR "Lаstуеаr 47,600 рlасеs wеrе аwаrdеd thrοugh сlеаring" OR "Cοmреtitiοn fοr univеrsitу рlасеs is еxресtеd tο bе intеnsе this уеаr, with fеwеr рlасеs likеlу tο bе“ Arial, 20pt www.cs.surrey.ac.uk

  30. Defeating PD Systems (3) • More work needed to make thesaural approach sufficiently systematic and impactful

  31. Would students actually do this? Copy: “Though the focus of this chapter is on the effect of ECommerce on agribusiness, it is helpful to define the technology that makes ECommerce possible. Network technology is not new. The telephone and the fax have been around a long time, and their effects are well understood. More relevant to ECommerce are the newer technologies that link computers together.” Original: “Though the focus of this paper is on the effect of e-commerce on agribusiness, it is helpful to define the technology that makes e-commerce possible. Network technology is not new. The telephone and the fax have been around a long time, and their effects are well understood. More relevant to e-commerce are the newer technologies that link computers together”. Copy: “Although data mining still in its infancy, companies is a wide range of industries - including retail, finance, healthcare, manufactory, transportation, aerospace, are already using data mining tools and techniques to take advantage of history data. For marketing, data mining is used to discover patterns and relationship in the data in order to help make better market decisions. Data mining can help sport sales trend, develop smart marketing campaigns and accurately predicts customer satisfaction.” Original: “Although data mining is still in its infancy, companies in a wide range of industries - including retail, finance, heath care, manufacturing transportation, and aerospace - are already using data mining tools and techniques to take advantage of historical data. For businesses, data mining is used to discover patterns and relationships in the data in order to help make better business decisions. Data mining can help spot sales trends, develop smarter marketing campaigns, and accurately predict customer loyalty. “A decision tree is a class discriminator that partitions the training dataset until each partition consists entirely or dominantly of examples from one class”

  32. Concluding Remarks • Experiments demonstrate how the determined, but lazy, plagiarist could learn to avoid detection. • Most of the systems evaluated are based on search engines, so are readily susceptible either to character substitutions, or to systematic word substitutions; some are susceptible to both. • Given the relative ease of making such substitutions with typical word processing software, manually or via a macro, or using or buying an article rewriter, reliance on a single plagiarism detection system may be risky. • Turnitin together with either Seesources or PlagiarismDetector could help to avoid the weaknesses discussed; certain pre-processing strategies could help to improve likelihood of detection. • Without such pre-processing, burden of detection remains with the human reader, who has to become increasingly adept at spotting stylistic variations and any other flags relating to such kinds of trickery as may have been used in order to avoid detection.

  33. Concluding Remarks • Evaluations of plagiarism detection systems have been reported, but the majority have not assessed weaknesses of these systems. • Relatively recently, an international workshop hosted the first international competition on plagiarism detection within the 3rd Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse (PAN-09). • The competition involves the detection of two kinds of plagiarism within a reasonably large text corpus, and some 47,000 (constructed) plagiarized articles. • The 2nd competition has just finished (and involved a second task on detecting vandalism in Wikipedia). • (with Neil Cooke) working on a fast detection approach – first version is comparable to the 4th best PAN 2009 system; we know how to improve on this.

  34. Thank you for listening

More Related