220 likes | 394 Views
NIST Blog Pre-Track, 14 Nov 2006. Blog Track Open Task: Spam Blog Detection. Tim Finin. Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin Martineau University of Maryland, Baltimore County. James Mayfield Johns Hopkins University Applied Physics Laboratory.
E N D
NIST Blog Pre-Track, 14 Nov 2006 Blog Track Open Task: Spam Blog Detection Tim Finin Pranam Kolari, Akshay Java, Tim Finin, Anupam Joshi, Justin Martineau University of Maryland, Baltimore County James Mayfield Johns Hopkins University Applied Physics Laboratory http://ebiquity.umbc.edu/paper/html/id/318/
Spam in the Blogosphere • Types: comment spam, ping spam, spam blogs • Akismet: “87% of all comments are spam” • 75% of update pings are spam (ebiquity 2005) • 20% of indexed blogs by popular blog search engines is spam (Umbria 2006, ebiquity 2005) • “Spam blogs, sometimes referred to by the neologism splogs, are weblog sites which the author uses only for promoting affiliated websites” 1 • “Spings, or ping spam, are pings that are sent from spam blogs” 1 1Wikipedia
Why a problem? • Blogosphere increasingly important segment of Web; ~12 hours from post to Google index • Splog content provides no additional value • Splog content is often plagiarized • Splogs demote value of authentic content • Splogs steal advertising (referral) revenue from authentic content producers • Splogs stress the blogosphere infrastructure • Splogs can skew Blog Analytics, as was observed in TREC Blog Track 2006
Nature of Splogs in TREC 2006 • Around 83K identifiable blog home-pages in the collection, with 3.2M permalinks • 81K blogs could be processed • We use splog detection models developed on blog home-pages; 87% accuracy • We identified 13,542 splogs • Blacklisted 543K permalinks from these splogs • ~16% of the entire collection • ~17% splog posts injected into TREC dataset1 1The TREC Blog06 Collection: Creating and Analyzing a Blog Test Collection – C. Macdonald, I. Ounis
Impact of Splogs in TREC Queries Cholesterol Hybrid Cars American Idol
Splog Detection Task Proposal • Motivation • Detecting and eliminating spam is an essential requirement for any blog analysis • Splog detection has characteristics that set it appart from e-mail and web spam detection • Constraint • Simulate how blog search systems operate • Task Statement • Is an input permalink (post) spam?
Relation to E-mail Spam Detection • TREC has an E-mail Spam Classification Task • Similar in • Fast online spam detection • Different in • Nature of spamming: links, RSS feeds, web graph, metadata • Users targeted indirectly through search engines, e.g. “N1ST” not relevant for “NIST” query
Relation to Web Spam Detection • TREC does not have a web spam track • Similar in • Spamming web link structure • Different in • Coverage: Blog Analytics Engines don’t look beyond blogosphere • Speed of detection is important, 150K posts/hour • Presence of structured text through RSS feeds presents new opportunities, and challenges
Difficulty • We have been experimenting with multiple approaches starting mid 2005 • Data: http://ebiquity.umbc.edu/resource/html/id/212
Difficulty • Evolving spamming techniques and splog creation genres • Most basic technique spam techniques • Generate content by stuffing key dictionary words • Generate link to affiliates, through link dumps on blogrolls, linkrolls or after post content • Evolving spam techniques • Scrape contextually similar content to generate posts • RSS hijacking • Aggregation software, e.g. Planet X • Intersperse links randomly • Make link placement meaningful • Add spam comments and then ping. Repeat.
Task Details - Dataset Creation • Similar to TREC Blog 2006, a collection of feeds, blog home-pages and permalinks • View dataset D as two sets – Dbase , Dtest • Dbase to span (n-x) days, and Dtest to span the rest of x days for x≤1 • Dcould collected as a combination of • D as collected in 2006 • Sample a subset of pings from a ping server over the period that D is collected
Task Details - Assessment • Assessors classify spam post into one or more classes based on the kind of spam this post, or the blog hosting it features • Non-blog • Keyword-stuffed • Post-stitching • Post-plagiarism • Post-weaving • Blog/link-roll spam • Each assessment typically takes 1-2 minutes • Detailed assessment will enable participants to identify classes they handle well and where they can improve
Non-Blog ping at weblogs.com • No RSS Feeds • No Dated Entry, no comments • Possibly plagiarized content
Keyword Stuffed Blog • ‘coupon codes’, ‘casino’
Post Stitching • Excerpts scraped from other sources
Post Weaving • Spam Links contextually placed in post
Link-roll spam • With fully plagiarized text
Evaluation • Dbase distributed first, Dtest subsequently with 50 independent sets of permalinks • Dbase, Dtest division will mimic how blog search engines operate • Build models to detect splogs – using individual posts, feeds or blog homepages of what is seen • Detect spam in an incoming stream of new blog postings • Teams will be judged by how well they detect “spamminess” for new posts
Input/Output <set> <num>...</num> <test> <permalink> <url>...</url> <homepage>...</homepage> <feed>...</feed> <when>... </when> </permalink> <permalink> ... </permalink> ... </test> </set> Each permalink to be judged by participants Individual set of test input. 1 or y such sets can be used, with each set biased to a specific splog genre, blog Publishing host or TLD Output format {set Q0 docno rank prob runtag}
Summary • Spam Blogs present a major challenge to the quality of blog mining/analytics • Splog Detection is different from spam in other communication platforms • Development of TREC Task will help furthering state of the art • Task requirements can be easily aligned with existing task of opinion identification