1 / 29

BlogVox: Separating Blog Wheat from Blog Chaff

BlogVox: Separating Blog Wheat from Blog Chaff. Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL). Motivation: Cleaning the Harvest. BlogVox – A Blog analytics engine developed for the TREC 2006 Blog Track.

jonah
Download Presentation

BlogVox: Separating Blog Wheat from Blog Chaff

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL)

  2. Motivation: Cleaning the Harvest • BlogVox – A Blog analytics engine developed for the TREC2006 Blog Track. • Presence of spam blogs or splogs and extraneous content waters down the quality of the index. • Narrowing down on the content of the post is essential in lack of clearly demarcated opinion sentences (like in eopinions, IMDB, Amazon etc) • Noisy and unstructured text on the Blogosphere can skew blog analytics/ business intelligence tools (as observed in TREC 2006).

  3. TREC 06: Finding opinionated posts, either positive or negative, about a query 2006 TREC Blog corpus: 80K blogs 300K posts 50 test queries BlogVox opinion extraction system Document and sentence level scorers Combined scores using an SVM meta-learner Data cleaning: splogs and post identification BlogVox Opinion Extraction System BlogVox • BlogVox challenges • Data cleaning and splog removal • Slangs • Semantic orientation of words • Contradictions, sarcasms, ungrammatical text

  4. Separating Blog Wheat from Blog Chaff Data cleaning for • Splog removal • Post content identification

  5. Spam in the Blogosphere • Types: comment spam, ping spam, splogs • Akismet: “87% of all comments are spam” • 75% of update pings are spam (ebiquity 2005) • 56% of blogs are spam (ebiquity 2005) • 20% of indexed blogs by popular blog search engines is spam (Umbria 2006, ebiquity 2005) • Spam blogs (splogs) are weblogs used to promoting affiliated websites or host ads • “Spings, or ping spam, are pings that are sent from spam blogs”

  6. Motivation: host ads

  7. Motivation: index affiliates, promote pageRank

  8. Splog detection using SVM 700 blogs, 700 splogs used for training Model based on blog homepage and local blog features Data Cleaning: Splogs Host Ads Index affiliates, Promote pageRank Plagiarized content Splog Detection Performance

  9. Nature of Splogs in TREC 2006 • Around 83K identifiable blog home-pages in the collection, with 3.2M permalinks • 81K blogs could be processed • We use splog detection models developed on blog home-pages; 87% accuracy • We identified 13,542 splogs • Blacklisted 543K permalinks from these splogs • ~16% of the entire collection • ~17% splog posts injected into TREC dataset1 1The TREC Blog06 Collection: Creating and Analyzing a Blog Test Collection – C. Macdonald, I. Ounis 1The TREC Blog06 Collection: Creating and Analyzing a Blog Test Collection – C. Macdonald, I. Ounis

  10. Impact of Splogs in TREC Queries Cholesterol Hybrid Cars American Idol

  11. Higher in Spam Prone Contexts Card Interest Mortgage Spam query terms based on analysis by McDonald et al 2006 ..

  12. Separating Blog Wheat from Blog Chaff Data cleaning for • Splog removal • Post content identification

  13. Data Cleaning: Content Identification Navigation Ads Post content Recent Posts

  14. Data cleaning: Baseline heuristic Eliminate link a if there exist a link b • Within θ distance • No Title tags between the links • Avg length of text bearing nodes less than a threshold • b is the nearest link to a Navigational Links Post Content Sidebar Ads An example DOM tree

  15. Data cleaning: SVM cleaner • Random collection of 150 blog posts • Human evaluation of 400 links tagged as content or extraneous links • We trained SVM using linear kernel in this analysis DOM Features Tag Features Position Features Word Features Evaluation

  16. Data Cleaning: Effect of sidebar content

  17. Web Spam Detection Coverage: Blog Analytics Engines don’t look beyond Blogosphere Speed of detection is important, 150K posts/hour RSS feeds presents new opportunities, and challenges Email spam Detection Nature of spamming: links, RSS feeds, web graph, metadata Users targeted indirectly through search engines, e.g. “N1ST” not relevant for “NIST” query Template Detection Repeated structural components detected via sampling Customization, use of javascripts and AJAX is increasing Simple heuristics using DOM traversal work well in general cases Sentiment Analysis Open domain opinion extraction is complex Opinions are part of a narrative Subject for which the opinion is being expressed is not easy to detect Related Work

  18. Conclusions • Noisy content on the Blogosphere present a major challenge to the quality of blog analytics tools. • Combination of heuristics and ML can be used to effectively clean the data. Ongoing Work • DOM subtree elimination • Identifying the subject of the opinion • Slangs • More training examples!

  19. Thank you! http://ebiquity.umbc.edu/

  20. Backup Slides

  21. Opinions in Social Media Reader’s Perspective “Starbucks Sandwiches are bad!” “I went to school early so I would have time to grab some lunch. Which ended up consisting of a crappy sandwich from starbucks and a chai latte. Lacey came into Starbucks while I was there so we chatted for a little bit and she thought that I might be in her class. After I finished eating I headed to school and checked the board……..”1 Narrative Expressed Opinions Opinions can influence buying decisions of customers [1] http://annamay13x.livejournal.com/7061.html

  22. Keyword Stuffed Blog • ‘coupon codes’, ‘casino’

  23. Post Stitching • Excerpts scraped from other sources

  24. Post Weaving • Spam Links contextually placed in post

  25. Link-roll spam • With fully plagiarized text

  26. Difficulty • We have been experimenting with multiple approaches starting mid 2005 • Data: http://ebiquity.umbc.edu/resource/html/id/212

  27. Difficulty • Evolving spamming techniques and splog creation genres • Most basic technique spam techniques • Generate content by stuffing key dictionary words • Generate link to affiliates, through link dumps on blogrolls, linkrolls or after post content • Evolving spam techniques • Scrape contextually similar content to generate posts • RSS hijacking • Aggregation software, e.g. Planet X • Intersperse links randomly • Make link placement meaningful • Add spam comments and then ping. Repeat.

  28. TREC Submissions (Topic Relevance)

  29. TREC Submissions (Opinion Extraction)

More Related