410 likes | 558 Views
SEARCHING THE BLOGOSPHERE. Nilesh Bansal. Nick Koudas University of Toronto. BLOGOSPHERE. 67M KNOWN BLOGS 100K NEW EVERYDAY DOUBLING EVERY 200 DAYS. WHAT ARE THEY WRITING ABOUT?? PERSONAL LIFE PRODUCT REVIEWS POLITICS TECHNOLOGY TOURISM SPORTS ENTERTAINMENT. WHY SHOULD WE CARE?.
E N D
SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto
67M KNOWN BLOGS 100K NEW EVERYDAY DOUBLING EVERY 200 DAYS
WHAT ARE THEY WRITING ABOUT?? PERSONAL LIFE PRODUCT REVIEWS POLITICS TECHNOLOGY TOURISM SPORTS ENTERTAINMENT
HUGE DATA REPOSITORY WILL CONTINUE TO GROW EXTRACT PUBLIC OPINION VALUABLE INSIGHTS
KEY INSIGHTS MARKET RESEARCH PUBLIC RELATION STRATEGIES CUSTOMER OPINION TRACKING
MACHINE CREATED WEBLOGS MORE THAN HALF OF BLOGSPOT IS SPAM 33% OF WEBSPAM HOSTED AT BLOGSPOT
Gruhl et al., The Predictive Power of Online Chatter, KKD 2005 Kumar et al., On the Bursty Evolution of Blogspace, WWW 2003 Chi et al., Eigen-trend: trend analysis in the blogosphere based on singular value decompositions, CIKM 2006 Mishne et al., MoodViews: Tool for Blog Mood Analysis, AAAI-CAAW 2006 Mei et al., Topic sentiment mixture: modeling facets and opinions in weblogs, WWW 2007
CRAWLER RUNNING 24x7 TRACKING 9M BLOGS INDEXING 70M ARTICLES AGGREGATION AND PREPROCESSING INTERACTIVE SEARCH AND ANALYSIS
ANY STREAMING TEXT SOURCE NEWS MAILING LISTS FORUMS SOCIAL MEDIA
www.blogscope.net Hot Keywords
Geo Search Related Terms Search Results Popularity Curve
Taiwan Undersea Earthquake Sumatra Earthquake Hawaii Earthquake
December 15 2006 March 06 2007
CRAWLS RSS FEEDS 250 THOUSAND NEW POSTS DAILY PING SERVER: WEBLOGS.COM
LINK BASED ANALYSIS IS NOT EFFECTIVE SPAMMERS ARE INTELLIGENT WE USE HEURISTICS ON GOING BATTLE [Wang et al.] Spam Double-Funnel: Connecting Web Spammers with Advertisers, WWW 2007 [Gyongi et al.] Combating Web Spam With TrustRank, VLDB 2004 [Kolari et al.] Detecting Spam Blogs, A Machine Learning Approach, AAAI 2006
INTERACTIVE APPLICATION TWO SECOND RESPONSE TIME HUGE AMOUNTS OF DATA SEVEN THOUSAND UNIQUE IP ADDRESSES DAILY SCALABILITY
BURST DETECTION [Kleinberg] Bursty and Hierarchical Structures in Streams, DMKD 2007 [Fung et al.] Parameter Free Bursty Events Detection in Text Streams, VLDB 2005
POPULARITY = BASE + ZERO MEAN GAUSSIAN BURST = STATISTICAL OUTLIER
COLLOCATIONS POINTWISE MUTUAL INFORMATION EXPENSIVE [Ott and Longnecker] An Introduction to Statistical Methods and Data Analysis [Manning and Schutze] Foundation of Natural Statistical Language Processing [Church and Hanks] Word Association Norms, Mutual Information and Lexicography, ACL 1989
FAST COMPUTATION OF RELATED TERMS RANDOM SAMPLE MUTUAL INFORMATION IN EXPECTATION USE TF WITH PRECOMPUTED IDF
POPULAR DOES NOT MEAN HOT INTERESTING = SURPRISING MIXTURE OF DIFFERENT SCORING FUNCTIONS DEVIATION FROM EXPECTED
INTELLIGENT ALERT SERVICE BURST SYNOPSIS AUTHORATIVE RANKING
JUST THE BEGINNING Nilesh Bansal, Fei Chiang, Nick Koudas, Frank Wm. Tompa, Seeking Stable Clusters in the Blogosphere, to appear in VLDB 2007. Nilesh Bansal, Nick Koudas, BlogScope: System for Online Analysis of High Volume Text Streams, to appear in VLDB 2007 (Demonstration Proposal).
THANK YOU. QUESTIONS? Source: xkcd.com