290 likes | 484 Views
BlogVox: Separating Blog Wheat from Blog Chaff. Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL). Motivation: Cleaning the Harvest. BlogVox – A Blog analytics engine developed for the TREC 2006 Blog Track.
E N D
BlogVox: Separating Blog Wheat from Blog Chaff Akshay Java, Pranam Kolari, Tim Finin, Aupam Joshi, Justin Martineau (UMBC) James Mayfield (JHU/APL)
Motivation: Cleaning the Harvest • BlogVox – A Blog analytics engine developed for the TREC2006 Blog Track. • Presence of spam blogs or splogs and extraneous content waters down the quality of the index. • Narrowing down on the content of the post is essential in lack of clearly demarcated opinion sentences (like in eopinions, IMDB, Amazon etc) • Noisy and unstructured text on the Blogosphere can skew blog analytics/ business intelligence tools (as observed in TREC 2006).
TREC 06: Finding opinionated posts, either positive or negative, about a query 2006 TREC Blog corpus: 80K blogs 300K posts 50 test queries BlogVox opinion extraction system Document and sentence level scorers Combined scores using an SVM meta-learner Data cleaning: splogs and post identification BlogVox Opinion Extraction System BlogVox • BlogVox challenges • Data cleaning and splog removal • Slangs • Semantic orientation of words • Contradictions, sarcasms, ungrammatical text
Separating Blog Wheat from Blog Chaff Data cleaning for • Splog removal • Post content identification
Spam in the Blogosphere • Types: comment spam, ping spam, splogs • Akismet: “87% of all comments are spam” • 75% of update pings are spam (ebiquity 2005) • 56% of blogs are spam (ebiquity 2005) • 20% of indexed blogs by popular blog search engines is spam (Umbria 2006, ebiquity 2005) • Spam blogs (splogs) are weblogs used to promoting affiliated websites or host ads • “Spings, or ping spam, are pings that are sent from spam blogs”
Splog detection using SVM 700 blogs, 700 splogs used for training Model based on blog homepage and local blog features Data Cleaning: Splogs Host Ads Index affiliates, Promote pageRank Plagiarized content Splog Detection Performance
Nature of Splogs in TREC 2006 • Around 83K identifiable blog home-pages in the collection, with 3.2M permalinks • 81K blogs could be processed • We use splog detection models developed on blog home-pages; 87% accuracy • We identified 13,542 splogs • Blacklisted 543K permalinks from these splogs • ~16% of the entire collection • ~17% splog posts injected into TREC dataset1 1The TREC Blog06 Collection: Creating and Analyzing a Blog Test Collection – C. Macdonald, I. Ounis 1The TREC Blog06 Collection: Creating and Analyzing a Blog Test Collection – C. Macdonald, I. Ounis
Impact of Splogs in TREC Queries Cholesterol Hybrid Cars American Idol
Higher in Spam Prone Contexts Card Interest Mortgage Spam query terms based on analysis by McDonald et al 2006 ..
Separating Blog Wheat from Blog Chaff Data cleaning for • Splog removal • Post content identification
Data Cleaning: Content Identification Navigation Ads Post content Recent Posts
Data cleaning: Baseline heuristic Eliminate link a if there exist a link b • Within θ distance • No Title tags between the links • Avg length of text bearing nodes less than a threshold • b is the nearest link to a Navigational Links Post Content Sidebar Ads An example DOM tree
Data cleaning: SVM cleaner • Random collection of 150 blog posts • Human evaluation of 400 links tagged as content or extraneous links • We trained SVM using linear kernel in this analysis DOM Features Tag Features Position Features Word Features Evaluation
Web Spam Detection Coverage: Blog Analytics Engines don’t look beyond Blogosphere Speed of detection is important, 150K posts/hour RSS feeds presents new opportunities, and challenges Email spam Detection Nature of spamming: links, RSS feeds, web graph, metadata Users targeted indirectly through search engines, e.g. “N1ST” not relevant for “NIST” query Template Detection Repeated structural components detected via sampling Customization, use of javascripts and AJAX is increasing Simple heuristics using DOM traversal work well in general cases Sentiment Analysis Open domain opinion extraction is complex Opinions are part of a narrative Subject for which the opinion is being expressed is not easy to detect Related Work
Conclusions • Noisy content on the Blogosphere present a major challenge to the quality of blog analytics tools. • Combination of heuristics and ML can be used to effectively clean the data. Ongoing Work • DOM subtree elimination • Identifying the subject of the opinion • Slangs • More training examples!
Thank you! http://ebiquity.umbc.edu/
Opinions in Social Media Reader’s Perspective “Starbucks Sandwiches are bad!” “I went to school early so I would have time to grab some lunch. Which ended up consisting of a crappy sandwich from starbucks and a chai latte. Lacey came into Starbucks while I was there so we chatted for a little bit and she thought that I might be in her class. After I finished eating I headed to school and checked the board……..”1 Narrative Expressed Opinions Opinions can influence buying decisions of customers [1] http://annamay13x.livejournal.com/7061.html
Keyword Stuffed Blog • ‘coupon codes’, ‘casino’
Post Stitching • Excerpts scraped from other sources
Post Weaving • Spam Links contextually placed in post
Link-roll spam • With fully plagiarized text
Difficulty • We have been experimenting with multiple approaches starting mid 2005 • Data: http://ebiquity.umbc.edu/resource/html/id/212
Difficulty • Evolving spamming techniques and splog creation genres • Most basic technique spam techniques • Generate content by stuffing key dictionary words • Generate link to affiliates, through link dumps on blogrolls, linkrolls or after post content • Evolving spam techniques • Scrape contextually similar content to generate posts • RSS hijacking • Aggregation software, e.g. Planet X • Intersperse links randomly • Make link placement meaningful • Add spam comments and then ping. Repeat.