10 likes | 77 Views
Memeta: A Framework for Analytics on the Blogosphere. Pranam Kolari, Tim Finin. What is memeta? Our framework that puts research into real world use Features blog identification and splog detection modules
E N D
Memeta: A Framework for Analytics on the Blogosphere Pranam Kolari, Tim Finin What is memeta? • Our framework that puts research into real world use • Features blog identification and splog detection modules • Includes Language Identification Modules, for more than 10 languages (provided by James Mayfield) • memeta has been used on a need-to basis to analyze the blogosphere Blogosphere Analytics Language Identifier Blog Directories Blog Identifier (98% Accuracy) Ping Servers 1 2 Splog Detector (87% Accuracy) Search Engines + BLOGS Blog Crawler Nature of pinging URLs at weblogs.com Host Distribution of Pings at weblogs.com 3 1. Welcome to the Splogosphere: 75% of pings are spings (splogs) • Monitored a ping server – weblogs.com over a period of 3 weeks from 20 Nov 2005 to 11 Dec 2005 • Total of 16 million update pings • See 1 for ping distribution of URLs • Pings were first classified into languages • Blogs from Italian followed a predictable pattern – higher during the day • Blogs from the English languages follows a similar pattern – not as obvious as Italian • Splogs followed no pattern and number of pings were three times of authentic English blogs (2, 3) Ping time-series of Italian blogs on a single day Ping time-series of Authentic blogs on a single day Ping time-series of Spam blogs on a single day Ping time-series of Italian blogs over five days Ping time-series of Authentic blogs over five days Ping time-series of Spam blogs over five days 5 4 2. Characterizing the Splogosphere • Blogosphere dump for 21 days of July 2005 • 1.3 million total blogs • Blogs run through splog detector • Link distribution of blogs vs. splogs plotted on a log-log scale • Predictably only authentic blogs subscribe to a power-law (4, 5) Only in-degree distribution of authentic blogs subscribe to a power law Only out-degree distribution of authentic blogs subscribe to a power law Continuing Work • Inducing new features for splog detection • Language Independent and Adaptive Techniques for Splog Detection • Splog Taxonomy and Evaluation Metrics • Multi-Relational Local Models for Splog Detection • Tuning memeta to harvest blogs regularly Splog Detector Blog Identification Heuristics Language Identifiers Spam Blog Detectors IP Blacklists Authentic Blogs Spam Blogs Partially supported by NSF award ITR-IIS-0326460 and ITR-IDM-0219649 and IBM http://ebiquity.umbc.edu