1 / 22

Quiz Time

Quiz Time.

melvyn
Download Presentation

Quiz Time

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Quiz Time • Consider searching over an archive of a really old newspaper – let’s say Times of India (founded in 1838). Let’s also assume that we are given with all the entities that have changed their names over time (e.g., Kolkatta -> Calcutta, Kolokata; Mumbai -> Bombay, etc.) Now, given a keyword query, what additional information would be needed for efficient retrieval of results? Briefly sketch your proposal. • If we want to include issues of temporal dynamics into TREC, how can we enhance TREC evaluation process? Briefly sketch (not more than 5 sentences) your ideas.

  2. The Web Changes Everything: Understanding the Dynamics of Web Content Eytan Adar, Jaime Teevan, Susan Dumais, Jon Elsas Presented by: Srikanta Bedathur

  3. Web is Constantly Changing • In fact, each Web-page changes constantly! • Visually too! December 18, 2008

  4. Web is Constantly Changing • In fact, each Web-page changes constantly! • Visually too! April 7, 2009

  5. Web is Constantly Changing • In fact, each Web-page changes constantly! • Visually too! May 10, 2009

  6. Change Measurement • Ntoulas et al[2004] – 150 websites, weekly over 1 year • No change w.r.t. to bag-of-words in Individual Web pages • Frequency of Change was not a good predictor of Degree of Change • Fetterley et al [2003] – 150 million pages, weekly for 11 weeks • Nearly 65% of Web pages remain the same (degree of change is very small) • Past change predicts future change, page length is correlated with change, domain-level is correlated with change • Olston and Pandey [2008] – 20K pages, every 2 day over several months • Little correlation between information longevity and change frequency • Here we consider “micro-level” changes • Readable content/structure, at hourly and sub-hourly frequency

  7. Organization • Data Collection • Random Sampling vs. Selecting • Crawling • Change Measurements • Overall change • Page-level evolution • Textual content change • DOM level page-structure change • Applications • Improving search • Improving browsing experience

  8. Data To Collect • Focus on the Web that is actually visited • Using Live-Search Toolbar, with opt-in mechanism • Starting on August 1, 2006, for 5 weeks with 612K users • Behavior driven sampling • Unique-visitors • Average inter-arrival time for a page • Average number of revisits per user per page • Exponential binning to obtain equable sampling

  9. URL Sampling 468 (avg), 650 (med) X 120 = 54788 Full details: Adar et al., CHI08 Visits Per User All crawlable, min 2 users, 2 times Inter-arrival time

  10. Data Crawling • Each URL was crawled hourly for 5 weeks • Starting May 24, 2007. • Sub-hourly crawls for fast-changing pages controller Original crawl 1 Original crawl 2 2 minute delay 16 minute delay 32 minute delay 60 minute delay

  11. Organization • Data Collection • Random Sampling vs. Selecting • Crawling • Change Measurements • Overall change • Page-level evolution • Textual content change • DOM level page-structure change • Applications • Improving search • Improving browsing experience

  12. Measuring Change • Bag-of-words measure using Dice coefficient • 66% displayed change in 5 week period • 123 hours average time for change • Average 0.7954 Dice • Compare this to 35% change in 11 weeks!

  13. Detailed Analysis of Change Sports/Recreation 0.95 0.9 0.85 News/ Magazine Music 0.8 Personal Pages 0.75 Adult Mean Dice Coefficient 0.7 0.65 • More visitors  Faster change • Shallower the depth => Faster change Industry/Trade 0.6 0.55 0.5 0 50 100 150 200 250

  14. 40000 35000 30000 25000 20000 15000 10000 5000 0 0 minutes 2 minutes 16 minutes 32 minutes 60 minutes Sub-hourly Crawl Analysis 19% At least once 9% pages 23% 24% 11% 66% Change every sample 6% 11% 12% 42% It is still not clear how many of these are really “interesting changes” Mean Dice

  15. Page-level Evolution • Change curve per page • Dice value of document Dtw.r.t. to the original at Dr1 vs. time Can be used to classify a page as • Knotted (70%) • Flat (2%) • Sloped (28%)

  16. Text Evolution • Extended crawling for another 6 months • Compute term-level lifespan plots Bottom – longer staying terms characterize the content of the page

  17. Staying Power & Divergence Staying Power of word w in document D The likelihood of observing a term w in document D at two different timestamps, t and t+∝ Divergence of word w wrt Document D The contribution of word w towards K-L Divergence of the document from the collection distribution

  18. Analysis of Staying Power and Divergence • High divergence, low staying power indicates ephemeral topics unique to the page • High divergence, high staying power can be seen as the “signature” of the site over time • Low divergence, low staying power are not really interesting

  19. Using Change Analyses • Crawling • Focus on only the relatively important, static content of the page. Not worry about unimportant changes. • Ranking • Additional features for ranking • Adaptive weighting of terms based on their dynamic or static occurrence in the document • Snippet Generation • Take into account the survivability of items

  20. Revisitation and Change[CHI’09] Time Implying interest in the newest news Implying interest in the newest deal (once every 24 hours) Implying interest in the stable (slow changing) content [CHI09] Adar et al., “Resonance on the Web: Web Dynamics and Revisitation Patterns”

  21. Inferred intent • Filter content by removing content changing faster or slower than peak revisitation [CHI09] Adar et al., “Resonance on the Web: Web Dynamics and Revisitation Patterns”

  22. Done! Questions?

More Related