1 / 41

SEARCHING THE BLOGOSPHERE

SEARCHING THE BLOGOSPHERE. Nilesh Bansal. Nick Koudas University of Toronto. BLOGOSPHERE. 67M KNOWN BLOGS 100K NEW EVERYDAY DOUBLING EVERY 200 DAYS. WHAT ARE THEY WRITING ABOUT?? PERSONAL LIFE PRODUCT REVIEWS POLITICS TECHNOLOGY TOURISM SPORTS ENTERTAINMENT. WHY SHOULD WE CARE?.

betty
Download Presentation

SEARCHING THE BLOGOSPHERE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SEARCHING THE BLOGOSPHERE Nilesh Bansal Nick Koudas University of Toronto

  2. BLOGOSPHERE

  3. 67M KNOWN BLOGS 100K NEW EVERYDAY DOUBLING EVERY 200 DAYS

  4. WHAT ARE THEY WRITING ABOUT?? PERSONAL LIFE PRODUCT REVIEWS POLITICS TECHNOLOGY TOURISM SPORTS ENTERTAINMENT

  5. WHY SHOULD WE CARE?

  6. HUGE DATA REPOSITORY WILL CONTINUE TO GROW EXTRACT PUBLIC OPINION VALUABLE INSIGHTS

  7. KEY INSIGHTS MARKET RESEARCH PUBLIC RELATION STRATEGIES CUSTOMER OPINION TRACKING

  8. CHALLENGES AND OPPORTUNITIES

  9. HUGE AMOUNTS OF UNSTRUCTURED TEXT

  10. MACHINE CREATED WEBLOGS MORE THAN HALF OF BLOGSPOT IS SPAM 33% OF WEBSPAM HOSTED AT BLOGSPOT

  11. TEMPORAL DIMENSION

  12. GEOGRAPHICAL ASSOCIATION

  13. CONVERSATION

  14. Gruhl et al., The Predictive Power of Online Chatter, KKD 2005 Kumar et al., On the Bursty Evolution of Blogspace, WWW 2003 Chi et al., Eigen-trend: trend analysis in the blogosphere based on singular value decompositions, CIKM 2006 Mishne et al., MoodViews: Tool for Blog Mood Analysis, AAAI-CAAW 2006 Mei et al., Topic sentiment mixture: modeling facets and opinions in weblogs, WWW 2007

  15. BLOGSCOPE

  16. CRAWLER RUNNING 24x7 TRACKING 9M BLOGS INDEXING 70M ARTICLES AGGREGATION AND PREPROCESSING INTERACTIVE SEARCH AND ANALYSIS

  17. ANY STREAMING TEXT SOURCE NEWS MAILING LISTS FORUMS SOCIAL MEDIA

  18. www.blogscope.net Hot Keywords

  19. Geo Search Related Terms Search Results Popularity Curve

  20. Taiwan Undersea Earthquake Sumatra Earthquake Hawaii Earthquake

  21. December 15 2006 March 06 2007

  22. IPHONE ON JAN 09 2007

  23. Curves are usually correlated, except at one point

  24. TECHNIQUES

  25. CRAWLS RSS FEEDS 250 THOUSAND NEW POSTS DAILY PING SERVER: WEBLOGS.COM

  26. LINK BASED ANALYSIS IS NOT EFFECTIVE SPAMMERS ARE INTELLIGENT WE USE HEURISTICS ON GOING BATTLE [Wang et al.] Spam Double-Funnel: Connecting Web Spammers with Advertisers, WWW 2007 [Gyongi et al.] Combating Web Spam With TrustRank, VLDB 2004 [Kolari et al.] Detecting Spam Blogs, A Machine Learning Approach, AAAI 2006

  27. INTERACTIVE APPLICATION TWO SECOND RESPONSE TIME HUGE AMOUNTS OF DATA SEVEN THOUSAND UNIQUE IP ADDRESSES DAILY SCALABILITY

  28. BURST DETECTION [Kleinberg] Bursty and Hierarchical Structures in Streams, DMKD 2007 [Fung et al.] Parameter Free Bursty Events Detection in Text Streams, VLDB 2005

  29. POPULARITY = BASE + ZERO MEAN GAUSSIAN BURST = STATISTICAL OUTLIER

  30. IDENTIFYING RELATED TERMS

  31. COLLOCATIONS POINTWISE MUTUAL INFORMATION EXPENSIVE [Ott and Longnecker] An Introduction to Statistical Methods and Data Analysis [Manning and Schutze] Foundation of Natural Statistical Language Processing [Church and Hanks] Word Association Norms, Mutual Information and Lexicography, ACL 1989

  32. FAST COMPUTATION OF RELATED TERMS RANDOM SAMPLE MUTUAL INFORMATION IN EXPECTATION USE TF WITH PRECOMPUTED IDF

  33. COMPUTING HOT KEYWORDS

  34. POPULAR DOES NOT MEAN HOT INTERESTING = SURPRISING MIXTURE OF DIFFERENT SCORING FUNCTIONS DEVIATION FROM EXPECTED

  35. INTELLIGENT ALERT SERVICE BURST SYNOPSIS AUTHORATIVE RANKING

  36. JUST THE BEGINNING Nilesh Bansal, Fei Chiang, Nick Koudas, Frank Wm. Tompa, Seeking Stable Clusters in the Blogosphere, to appear in VLDB 2007. Nilesh Bansal, Nick Koudas, BlogScope: System for Online Analysis of High Volume Text Streams, to appear in VLDB 2007 (Demonstration Proposal).

  37. THANK YOU. QUESTIONS? Source: xkcd.com

More Related