Large-scale Plagiarism Detection and Authorship attribution

Large-scale Plagiarism Detection and Authorship attribution

References • Juxtapp: A Scalable System for Detecting Code Reuse Among Android Applications • Steve Hanna, Ling Huang, Edward Wu1, Saung Li, Charles Chen, and Dawn Song • On the Feasibility of Internet-Scale Author Identification • Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong, John Bethencourt, Eui Chul Richard Shin, Dawn Song, Emil Stefanov

Plagiarism detection • Used to be applicable to literary corpus/ academia only • Source code similarity/plagiarism detection is very important • “Moss” is the most widely known s/w similarity detection tool • Can provide valuable insight into malware detection

Code similarity ⇒ malware?? • Generally not true • In the android apps domain, it can be! • 86% of the android malwares are repackaged versions of legitimate apps with malicious payloads (source: “Dissecting android malware:characterization and evolution”) • Similarity detection is crucial

JuxtApp: a scalable system for code-reuse detection in android apps • Each android app is an apk file, ends with a .apk extension • Each apk file has .dex file which is a dalvik executable file and is executed by the dalvik virtual machine • Fingerprint the apk using bithashing

Bithashing

JuxtApp workflow contd • Application preprocessing Each app is segmented into basic blocks. Only the opcodes are retained, the exception being opcodes storing constant data, e.g. const-string opcode. In this case the opcode is concatenated with the value it references • Feature Extraction K-grams of opcodes are extracted by sliding a window of size k and hashing it with djb2 hash function. For each hash value, corresponding bit in the bitvector is set.

Feature extraction cont.. • Value of K was set to 5 and was selected by an experiment. Pairs of apps were selected from randomly sampled 6000 apps. The distance between the pairs were computed. It was found that starting from 5, the value of K has little impact on the distance calculation • Mean is 5.35 opcodes and median is 2 opcodes, while the largest basic block in the dataset contains 35517 opcodes

Feature extraction cont.. • The bitvector size m is chosen by experiment. m >> N, the number of k-grams extracted from an application between two k-gram feature sets • 30000 apps were used to determine m. m = N90 x 9 = 240,007, a prime number

Similarity between a pair of apps • Given two bitvector representations of two apps A and B, their similarity is computed by the given formula: J(A,B) = |A ∧ B| / |A ⋁ B| This formula Is a variation of the original Jaccard similarity.

Future challenges • If the app is heavily obfuscated, then juxtapp may not perform well • Use of third-party libraries can add a lot of noise and adversely affect the similarity score

Authorship Attribution • Who wrote it? • Identify an anonymous author by comparing his/her writing style against a corpus of texts of known authorship • Primary application has shifted from literary domain to forensics : terrorist threats, harassment

“On the feasibility of Internet-scale Author identification” • 2.4 million posts from 100,000 blogs (almost a billion words) • Stylometry : Identify author based on writing style • Are N-gram techniques suitable? – Not really, because they reveal more about the context rather than the author

Experiment • Prepare test set and training set • Build a classifier with the training set • Test the classifier with the test set • Which features should be considered?

Feature selection

Feature selection contd… k = 10000*(M-N)/(N*N) N= Total number of words in the text M = ∑ i * i * Vi where Vi is the number of words that occur i times Syntax tree by Stanford parser Yule’s K

how well does it work? • In 20% of cases the classifiers can correctly identify an anonymous author given a corpus of texts from 100,000 authors • In 35% of cases the correct author is one of the top 20 guesses

Other challenges of authorship attribution • Malware author identification from : • Plain-text source code • Binary executables • Intermediate-code

Large-scale Plagiarism Detection and Authorship attribution

Large-scale Plagiarism Detection and Authorship attribution

Presentation Transcript

BotGraph: Large Scale Spamming Botnet Detection

proper citation and attribution avoiding plagiarism

Authorship Attribution

Large-Scale Copy Detection

Are we ready for large scale use of plagiarism detection tools?

Authorship Attribution and Stylometry

On detection and attribution …

Plagiarism Detection Techniques

Authorship Attribution and Stylometry (lecture 5)

Large-scale Plagiarism Detection and Authorship attribution

JStylo: An Authorship-Attribution Platform and its Applications

BotGraph: Large Scale Spamming Botnet Detection

BotGraph: Large Scale Spamming Botnet Detection

Authorship Attribution

Authorship Attribution Using Probabilistic Context-Free Grammars

Authorship Attribution

Concepts of Detection and Attribution

BotGraph: Large Scale Spamming Botnet Detection

BotGraph: Large Scale Spamming Botnet Detection