1 / 77

Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari

Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari. Ph.D. Defense, Sept 25, 2007. THESIS STATEMENT. It is possible to develop an effective, efficient and adaptive system to detect spam blogs. CONTRIBUTIONS. a principled study of the characteristics of the problem,

Download Presentation

Detecting Spam Blogs: An Adaptive Online Approach Pranam Kolari

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detecting Spam Blogs: An Adaptive Online ApproachPranam Kolari Ph.D. Defense, Sept 25, 2007

  2. THESIS STATEMENT It is possible to develop an effective, efficient and adaptive system to detect spam blogs.

  3. CONTRIBUTIONS • a principled study of the characteristics of the problem, • a well motivated feature discovery effort, • a cost-sensitive, real-time filtering implementation, and • an ensemble driven classifier co-evolution.

  4. OUTLINE • Introduction • Characterization • Feature Discovery • Cost-aware pipeline • Adaptive Classifiers • Evaluation • Conclusions • Future Directions

  5. WHAT IS SPAM? • “Unsolicited usually commercial e-mail sent to a large number of addresses” – Merriam Webster Online • As the Internet has supported new applications, many other forms are common, requiring a much broader definition Capturing user attention unjustifiably in Internet enabled applications (e-mail, Web, Social Media etc..)

  6. SPAM TAXONOMY INTERNET SPAM DIRECT INDIRECT [Forms] Bookmark Spam E-Mail Spam Comment Spam IM Spam (SPIM) Spam Blogs (Splogs) Social Network Spam General Web Spam [Mechanisms] Spamdexing Social Media Spam

  7. SPAMDEXING Affiliate Programs Context Ads (i) arbitrage ads/affiliate links (ii) in-links Spam pages, Spam Blogs [DOORWAY] JavaScript Redirect Spammer owned domains Affiliate Program Buyers spamdex (iii) Spam pages, Spam Blogs, Spam Comments, Guestbook Spam Wiki Spam SERP Search Engines

  8. SPAM BLOG Advertisements in Profitable Contexts Auto-generated and/or Plagiarized Content Link Farms to promote other spam pages

  9. OUTLINE • Introduction • Characterization • Feature Discovery • Cost-aware pipeline • Adaptive Classifiers • Evaluation • Conclusions • Future Directions

  10. CONTRIBUTIONS • a principled study of the characteristics of the problem, • a well motivated feature discovery effort, • a cost-sensitive, real-time filtering implementation, and • an ensemble driven classifier co-evolution.

  11. CHARACTERIZATION • WorldNet defines characterize as “to describe or portray the characters or the qualities or peculiarities” • Our efforts • Define and Scope the Problem • Field Study • Principled Empirical Analysis • Publicize and solicit feedback

  12. SCOPE Update Pings 2 Ping Stream Update Pings 3 Fetch Content 1 Splog Filtering between steps 2 and 3 (Pre-indexing) , used by blog harvester

  13. BLOGS & SPAMDEXING • Bias of Search engines to blogs • through quick indexing (ping servers) • and higher relevance (temporal) • Availability of third party blogging platforms • providing service for free • supporting programmatic content injection • enjoying high authority and trust (e.g. blogspot) • enabling obfuscation (doorways) to search engines and DMCA notices

  14. SPLOGS BY NUMBERS • 75% of update pings (eBiquity 2006) • 20% of indexed Blogosphere (Umbria 2006) • 56% of update pings (eBiquity 2007) 56% of all active blogs are splogs! (2007)

  15. SPLOG DETECTION PROBLEM • Given a blog, is it authentic or spam? • Explore evidence space • Contents of the Blogs (Local Attributes) • Evidence from Neighbors (Global Attributes) P(splog(x)/ O(x)) P(splog(x)/ L(x))

  16. EXISTING CONTEXTS E-MAIL BLOGS WEB NATURE time/posts time • Web Search Engines • Blog Search Engines • Blog Hosting Services • (Ping Servers) • Users • E-mail Service • Provider • Search Engines • Page Hosting • Services (e.g. Tripod) WHO USES IT? • Fast Detection • Low Overhead • Online • Batch Detection • Mostly Offline • Fast Detection • Low Overhead CONSTRAINTS • Scripts, Doorways • Temporal Deception • Image Spam, • Character Salad • Scripts, Doorways ATTACKS

  17. RELATED WORK – WEB SPAM • Local Content (Drost et al, 2005) • using TFIDF word-features, specialized features etc. • Statistical Properties (Fetterly et al, 2004) • using page updates, identical pages through page-stitching • Trust-Rank (Gyongi et al, 2004) • As an extension to Page-Rank • Splog Detection (Salvetti et al, Lin et al)

  18. OUTLINE • Introduction • Characterization • Feature Discovery • Cost-aware pipeline • Adaptive Classifiers • Evaluation • Conclusions • Future Directions

  19. CONTRIBUTIONS • a principled study of the characteristics of the problem, • a well motivated feature discovery effort, • a cost-sensitive, real-time filtering implementation, and • an ensemble driven classifier co-evolution.

  20. MACHINE LEARNING CLASSIFICATION • Document as vectors in a feature space • Feature Space • Discovery • Representation • Selection • Classification Techniques • Support Vector Machines (Discriminative) • Naïve Bayes Classifier (Generative) • Tools (libsvm, weka) f1, f2, f3 .. fm

  21. MACHINE LEARNING EVALUATION • Precision (P) • a measure of correctness of classified documents • Recall (R) • a measure of completeness of classified documents • F-1 = 2*P*R/(P+R) • ROC AUC* – Area Under the Curve • a measure of discriminatory power * Presented in Thesis Document

  22. DATASETS • SPLOG-2005 • Sampled Summer 2005 at Technorati • Labeled samples of 700 blogs and 700 splogs • Only Blog-homepages • SPLOG-2006 • Sampled Oct 2006 at Weblogs.com • Labeled samples of 750 blogs and 750 splogs • Blog-homepages + feeds

  23. EXPERIMENTAL SETUP • Binary feature encoding • Top 50K selected using frequency count • SVMs • Default parameters • Linear Kernel • No stemming or stop word elimination • Naïve Bayes • Ten fold cross-validation

  24. URL 2005 2006

  25. URL • 3,4,5 charactergrams from URL • Captures profitable contexts • Highly effective at ping streams • Supports an extremely low cost classifier 2005 2006

  26. WORDS 2005 2006

  27. WORDS • Words (Text) on a Blog • Previously effective in topic classification • Captures profitable advertising contexts • Interesting Authentic Genre Observed 2005 2006

  28. WORDGRAMS 2005 2006

  29. WORDGRAMS • Word-2-grams, 2 adjacent words • Shallow NLP technique to tackle word salad • Word salad less common in web spam (TFIDF) • Word-x-gram features, exponential with x 2005 2006

  30. CHARACTERGRAMS 2005 2006

  31. CHARACTERGRAMS • 3,4,5 charactergrams from blog content • Can capture character salad (e.g. p1lls) • Feature selection important 2005 2006

  32. OUTLINKS 2005 2006

  33. OUTLINKS • Out-links tokenized by non-alphabets • Similar to URL n-grams, likely more robust • Novel feature space 2005 2006

  34. ANCHORS 2005 2006

  35. ANCHORS • Anchor text tokenized into words • Subsumed by words, but obfuscation difficult • Capture personalization of publishing template • Novel feature space 2005 2006

  36. Splog software ?! “Honestly, Do you think people who make $10k/month from adsense make blogs manually? Come on, they need to make them as fast as possible. Save Time = More Money! It's Common SENSE! How much money do you think you will save if you can increase your work pace by a hundred times? Think about it…” “Discover The Amazing Stealth Traffic Secrets Insiders Use To Drive Thousands Of Targeted Visitors To Any Site They Desire!” “Holy Grail Of Advertising... “ “Easily Dominate Any Market, AnySearch Engine, Any Keyword.” $ 197

  37. Capture HTML Stylistic Patterns in Authentic Blogs

  38. HTMLTAGS 2005 2006

  39. HTMLTAGS • Use HTML Tags – stylistic information • Capture signatures of splog software • Fully language independent • Novel feature space 2005 2006

  40. FEED BASED DETECTION • Limitations using only home-pages • No knowledge of blog lifetime • Classifiers less effective in early lifecycle • Benefits of using feeds • Most recent posts, lifetime, metadata • Capture correlations across posts • Limitations of using only feeds • Loose out signatures in publishing template

  41. FEED ITEM DISTRIBUTON • Plot number of items in feeds (SPLOG-2006) • Authentic Blogs feature normal distribution • Splogs – many with just one post • Knowledge of classifier effectiveness vs. lifetime

  42. FEED BASED DETECTION • Disjoint feature spaces – Words, Tags • Trained and Tested with n (x-axis) posts • Publishing template signatures important • Tags much more effective – early lifecycle

  43. RELATED CLASSIFIERS • Blog Identification • Competency requirement for blog harvesters • F-1 measure of 98% • Relational Features • Less Effective (High P, Low R) • Short-lived blogs, lifetime dependent • Knowledge of Web-graph • Derived Features • Less Effective

  44. FEATURE SPACE OBSERVATIONS • Cost based classifier bucketing • Known Feature Spaces • Words continue to be effective • Word-grams against obfuscation • Novel Feature Spaces • Out-links, Anchors capture useful signals • HTML Tags very effective, even early lifecycle • Feature Space Exploration • Tags, JavaScript, Feed Classification

  45. OUTLINE • Introduction • Characterization • Feature Discovery • Cost-aware pipeline • Adaptive Classifiers • Evaluation • Conclusions • Future Directions

  46. CONTRIBUTIONS • a principled study of the characteristics of the problem, • a well motivated feature discovery effort, • a cost-sensitive, real-time filtering implementation, and • an ensemble driven classifier co-evolution.

  47. META-PING SYSTEM • Regular Expression Filtering (March 2005) • List of Authentic Blogs (August 2005) • Blog Home-page Classifier (December 2005) • URL Classifier (October 2006) • Feed Classifier (May 2007) • Cost-Aware Pipeline Implementation (Jan 2007)

  48. META-PING SYSTEM Increasing Cost PRE-INDEXING SPING FILTER LANGUAGE IDENTIFIER Ping Stream REGULAREXPRESSIONS BLACKLISTS WHITELISTS URLFILTERS HOMEPAGEFILTERS FEEDFILTERS Ping Stream BLOG IDENTIFIER Ping Stream PING LOG IP BLACKLISTS AUTHENTIC BLOGS

  49. META-PING SYSTEM • Static Design • Project specific thresholds • Classifiers in pipeline • Based on accrued domain knowledge • Dynamic Possibilities • Classifier Thresholds • Classifier use • Queuing analysis and Precision/Recall requirements

  50. OUTLINE • Introduction • Characterization • Feature Discovery • Cost-aware pipeline • Adaptive Classifiers • Evaluation • Conclusions • Future Directions

More Related