DSpin: Detecting Automatically Spun Content on the Web

Network and Distributed System Security Symposium(NDSS 2014) DSpin: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego Speaker : Ting Luo 2014/05/26

Outline 1. Introduction 2. Background And Previous Work 3. The Best Spinner 4. Similarity 5. Methodology 6. Spinning In The Wild 7. Disussion 8. Conclusion

Introduction • Search Engine Optimization (SEO) • Black Hat SEO • techniques that are used to get higher search rankings in an unethical manner • Spinning • To generating and posting Web spam • What is Spinning ? • replaces words • restructures original content • to create new versions with similar meaning but different appearance

Introduction • Using Spinning in SEO to increase page ranks • createmany different versions of a single seed article • post those versions on multiple Web sites with links pointing to a site being promoted D C A B Original Target Site

Introduction • Goal • detect automatically spun content on the Web • Input • a set of article pages crawled from various Web sites • output • a set of pages flagged as automatically spun content

Introduction • Contributions • Spinning characterization • The Best Spinner • 2. Spun content detection • detecting automatically spun content based upon immutables • 3. Behavior of article spammers

Background And Previous Work A. Spinning Overview

Background And Previous Work A. Spinning Overview • Example • Both links to adult webcam sites • The spun content is in English, but has been posted to German and Japanese wikis • You have actually seen the feared demon-eye impact that occurs when the camera flash bounces off the eye of a person or animal • You’veseen the dreaded demon-eye impact that happens when the camera flash bounces off the eye of an individual or animal

Background And Previous Work A. Spinning Overview (6) SPAM Content

Background And Previous WorkB. Article Spam Detection • Web spam taxonomies • content spam • Quilted pages • Keyword stuffing • link spam • Page hijacking • Link farms

Background And Previous WorkC. Near-duplicate Document Detection • Near-duplicate Document • Two such documents differ from each other in a very small portion that displays advertisements • Fingerprinting Algorithm • A procedure that maps an arbitrarily large data item (such as a computer file) to a much shorter bit string • reduce storage and computation costs

Background And Previous WorkC. Near-duplicate Document Detection From : http://en.wikipedia.org/wiki/Fingerprint_(computing)

Background And Previous WorkC. Near-duplicate Document Detection • The classic approach - Shingles [1] • The hash value of a k-gram which is a sub-sequence of k successive words • The sets of shingles constitutes the set of features of a document • Enables a graph representation for similarity among pages • pages as nodes • edges between two pages that share shingles above a threshold [1] Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma, ‘Detecting Near-Duplicate for Web Crawling,’ 2007

The Best Spinner(TBS)A. TBS

The Best Spinner(TBS)A. TBS • A popular spinning tool • $77 per year • requires registration with a username and password • synonym dictionary • requires credentials at runtime to allow the tool to download an updated version • Spintax • {Home|House|Residence|Household}

The Best Spinner(TBS)A. TBS • Parameters • Frequency • every word, or one in every second, third, or fourthword • Remove original • removes the original word from the spintaxalternatives • {Home|House|Residence|Household} {House|Residence|Household} • Auto-select inside spun text • when selected, spins already spun text

The Best Spinner(TBS)A. TBS {You can| You are able to | It is possible to | You’ll be able to | You possibly can}

The Best Spinner(TBS)B. Reverse Engineering TBS • During every startup • downloads the latest version ofthe synonym dictionary • Save as the file tbssf.dat in an encrypted format (base64 encoding) • After Reversing Engineering TBS • use an authentication key to download the synonym dictionary • Synonym dictionary • 8.4 MB in size • has a total of 750,114 synonyms grouped into 92,386 lines

The Best Spinner(TBS)B. Reverse Engineering TBS Authentication key

The Best Spinner(TBS)C. Controlled Experiments 5-12% 6-14%

Similarity • Similarity score • classic JaccardCoefficient • take all the words from the two documents, A and B • compute the set intersection over the set union across all the words

Similarity • How to compute the intersection and size of two documents? • Extention A. Methods Explored B. The Immutable Method C. Verification Process

SimilarityA. Methods Explored • Shingling • Computing shingles, or n-grams, over the entire text with a shingle size of four • a sentence “a b c d e f”is the set of three elements “a b c d”, “b c d e”, and “c d e f”. • the intersection is the overlap of shingles between two documents

SimilarityA. Methods Explored • low similarity between 21.1–60.7% • Although useful for document similarity, it is not useful for identifying spun content given the low similarity scores

SimilarityA. Methods Explored (2) Parts-of-speech Standford NLP package • For each sentence, the NLP parser returns the original sentence with parts-of-speech tags for every word • use the parts-of-speech lists as the comparison unit

SimilarityA. Methods Explored • TBS can replace single words with phrases, and phrases comprised of multiple words can be spun into a single word

SimilarityB. The Immutable Method • Separate each article’s words into • mutables • Immutables • focus entirely on the list of immutable words from two articles to determine if they are similar

SimilarityA. Methods Explored • Ratios are above 90% for most spun content • provides a clear separation between spun and non-spun content

SimilarityB. The Immutable Method • Benefit • it also greatly decreases the number of bytes needed for comparison by reducing the representation of each article by an order of magnitude.

SimilarityC. Verification Process • mutable verifier Steps • it sums all the words that are common between the two pages, and adds it to the total overlap count pages • It computes the synonyms of theremaining words from onepage and determines if they match the words of theother page • taking the synonyms of the synonyms of the remainingwords and comparing them in a similar fashion to steptwo

SimilarityA. Methods Explored • Has a much higher overhead

MethodologyA. DataSets • Wikis • purchase a Fiverr job offering to create 15,000 legitimate backlinks • Crawled the recent posts on each of the wikis • 37M pages for December 2012 • GoArticles • Allows users to build backlinks as “dofollow” that can affect search engine page rankings. • crawl over 1M articles posted between January 2012 to May 2013

MethodologyB. Filters • Visible text • remove all pages that do not contain any visible text on the page • Content tag • Wiki : div labeled “bodyContent” • GoArticles: div with “class=article” • If it lacks of this tag, then remove it

MethodologyB. Filters • Word count • Discard small pages • Threshold of 50 words • Link density • Discard pages with an unusually high link density • Foreign text • Only evaluate the immutable method on pages with mostly English text

MethodologyC. Inverted Indexing • Definition • < id, immu> • id : a unique index corresponding to an article • immu is an immutable that occurs in id. • < immu, group < ids >> • Each group represents all document ids that contain the immutable • < idi : idj , 1 > • < idi : idj, count > • the total number of immutablesthat overlap between idi and idj

MethodologyC. Inverted Indexing • Calculate the similarity score between each two pages • Set the threshold to be 75% <id1, Alice> <id1,id2,1> <id1, Mike> <Alice, group<id1,id2>> <id1,id2,1> <Mike, group<id1,id2>> <id2, Mike> <id2, Alice> 2 articles <id1,id2,2>

MethodologyD. Clustering • graph representation • each page(ids) is a node • each pair has an edge • Each connected subgraph represents a cluster

MethodologyE. Exact Duplicates and Near Duplicates • Exact duplicates • Use a hash over each page (MD5 sum) • two articles are identical if their MD5 sums match • Near Duplicates • Using mutable verifier • 100% mutable match, but with mismatching MD5 sums

MethodologyF. Hardware • 24 physical nodes running Fedora Core 14 • Each node has • a single Xeon X3470 Quad-Core 2.93GHz CPU and 24 GB of memory • Runs on • Hadoop 1.1.2 and Pig 0.11.1 jobs

Spinner In The WildA. Volume • Wiki • 68.0%as SEO spam • 35.6% are spun content • GoArticles has drastically less spun content (7.0%) than the wiki data set

Spinner In The WildB. False Positives • False positives • two articles that appear in the same cluster but are unrelated • Randomly sampled 99 clusters, for each one chose 2 pages. • found no evidence of false positives

Spinner In The WildC. Cluster Sizes Wiki data set GoArticles data set

Spinner In The WildD. Content • most of the popular words appear to relate to sales and services

Spinner In The WildE. Domains 1. Spun content across domains • the average cluster spans across 12 ± 27 domains • spammers target multiple domains when posting spun content, instead of a single site

Spinner In The WildE. Domains • It indicates a strong, positive correlation between larger scale spinning campaigns and a larger number of targeted domains

DSpin: Detecting Automatically Spun Content on the Web

DSpin: Detecting Automatically Spun Content on the Web

Presentation Transcript

Detecting Spam Web Pages

Detecting Phrase-Level Duplication on the World Wide Web

Automatically detecting and describing high level actions within methods

Web content

DSPIN: Detecting Automatically Spun Content on the Web

CANTINA : A Content-Based Approach to Detecting Phishing Web Sites

Detecting and Defending Against Third-Party Tracking on the Web

Lessons Learned in Automatically Detecting Lists in OCRed Historical Documents

Detecting Promotional Content in Wikipedia

Enhancing Directed Content Sharing on the Web

Nurturing content-based collaborative communities on the Web

Predicting Content Change On The Web

Detecting Erroneous Sentences using Automatically Mined Sequential Patterns

Detecting Semantic Cloaking on the Web

Automatically Detecting Equivalent Mutants and Infeasible Paths

Sharing 3D Content on the Web

Automatically Detecting Action Items in Audio Meeting Records

Detecting Spam Web Pages

Enhancing Directed Content Sharing on the Web

Detecting Web Spam with CombinedRank

Automatically Detecting Equivalent Mutants and Infeasible Paths

Study on Web Content Extraction Techniques