300 likes | 448 Views
SVMs for the Blogosphere: Blog Identification and Splog Detection. Pranam Kolari, Tim Finin, Anupam Joshi. http://ebiquity.umbc.edu. Computational Approaches to Analyzing Weblogs, Stanford, March 27-29, 2006. Blogosphere - the brighter side. Panel View Market Research PR Monitoring
E N D
SVMs for the Blogosphere: Blog Identification and Splog Detection Pranam Kolari, Tim Finin, Anupam Joshi http://ebiquity.umbc.edu Computational Approaches to Analyzing Weblogs, Stanford, March 27-29, 2006
Blogosphere - the brighter side • Panel View • Market Research • PR Monitoring • From Presentations • Opinion Extraction • Demography based analysis P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Blogosphere - the darker side (1) • From the Panel • Blogger is cracking down splogs • SixApart and TypePad • Content Hijacking • From Presentations • Removing SPAM an essential part of blog search engine • Cost of cleaning up splogs and its effect on results P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Blogosphere - the darker side (2) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
The Blogosphere Information Audience BLOG HOSTS Blogger msn-spaces livejournal PING SERVERS SPLOGS SPINGS P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Spings – weblogs.com P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Spings – weblogs.com (2) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Spings – weblogs.com (3) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Splogs – icerocket.com P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Splogs – icerocket.com (2) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
A Featured Splog? P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Splogs – technorati.com (2) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Splogs – The Source! “Honestly, Do you think people who make $10k/month from adsense make blogs manually? Come on, they need to make them as fast as possible. Save Time = More Money! It's Common SENSE! How much money do you think you will save if you can increase your work pace by a hundred times? Think about it…” “Discover The Amazing Stealth Traffic Secrets Insiders Use To Drive Thousands Of Targeted Visitors To Any Site They Desire!” “Easily Dominate Any Market, AnySearch Engine, Any Keyword.” “Holy Grail Of Advertising... “ P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Spam we target -- summarized • Non-blogs • For increased search engine exposure • Through BLOG IDENTIFICATION • Splogs • Adsense clicks for high-paying contexts (i) • Unjustifiably increase page-rank (importance) of affiliates – link farms (ii) • Combination of (i) and (ii) • Through SPLOG DETECTION P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
This work • Can machine learning models be effective to counter splogs on the blogosphere? • How do they perform when using features local to a blog? P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Dataset for Training • Technorati random sampling • 500K blogs – May/June 2005 • Dropped those from top blogging hosts • Blog Identification is an easy tasking using just URL patterns/domains • Sampled the rest in different ways to create training datasets P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Blog-HomePage/Non-Blog • Sampled for blog home-pages • Sampled for external links from these blogs to capture contextually similar pages – but from non-blogs • All samples were manually verified • Training set consists of 2100 positive and 2100 negative samples – multiple languages • Lets call this (BH, NB) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Blog-SubPage/Non-Blog • Sampled for local-links from BH • Sampled for out-links similar to NB • No manual verification • 2600 positive and 2600 negative samples • Lets call this (BNH, NB) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Authentic Blog/Splog • Manually identified 700 splogs (English) in the BH sample • Sampled for 700 blogs from the rest • 700 positive and 700 negative samples • Lets call this (AB, S) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Comparison Baselines • Blog Identification • Splog Detection is a known problem! P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Evaluation - Background • SVMs as implemented by libsvm • Leave-One-Out cross-validation • No stop word elimination • No stemming • Mutual Information for feature selection • Frequency count provided similar results • Binary feature encoding • Others encodings give similar results P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
New features for blogs • Hyper-links on a page • Tokenized by “/” and “-” • Anchor-text on a page • Meta tags • From HTML HEAD element • 4-grams • Contiguous blocks of 4 characters • Combinations • words and urls • meta and link • urls, anchors, meta P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Blog Identification – (BH, NB) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Blog Identification – (BNH, NB) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Splog Detection - (AB, S) P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
An quick Analysis • Ping Servers • Our analysis in December 2005 • At least 75% of pings are spings • Technorati Index • Data from week of March 20, 2006 • Random queries to sample for 10K blogs • 3K blogspot, 2.5K livejournal, 1.8K msn • We predict that 1.5K blogspot, 250 from LJ are splogs • Overall 2.5K/10K are splogs ~ 25% of the fresh index! P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Blogosphere Spam - Summary Information Audience BLOG HOSTS 25% 50% Blogger msn-spaces livejournal 10% PING SERVERS 75% P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
And its not getting easier … But spammers still leave trails that can be exploited P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Conclusion • Blogosphere is prone to spam at various infrastructure points • Local content based models can be quite effective by itself • 75% of pings and further downstream, 25% of fresh content is spam • Blogger’s problem is now livejournal’s problem, and now everyone’s problem • Combining local and global splog models is our current direction P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection
Questions? • Google “Splog Detection” • memeta • http://memeta.umbc.edu • eBiquity • http://ebiquity.umbc.edu • http://ebiquity.umbc.edu/blogger • Check out Umbria’s report on splogs • http://www.umbrialistens.com/files/uploads/umbria_splog.pdf P. Kolari, T. Finin, A. Joshi :-: Blog Identification and Splog Detection