E N D
Quiz Time • Consider searching over an archive of a really old newspaper – let’s say Times of India (founded in 1838). Let’s also assume that we are given with all the entities that have changed their names over time (e.g., Kolkatta -> Calcutta, Kolokata; Mumbai -> Bombay, etc.) Now, given a keyword query, what additional information would be needed for efficient retrieval of results? Briefly sketch your proposal. • If we want to include issues of temporal dynamics into TREC, how can we enhance TREC evaluation process? Briefly sketch (not more than 5 sentences) your ideas.
The Web Changes Everything: Understanding the Dynamics of Web Content Eytan Adar, Jaime Teevan, Susan Dumais, Jon Elsas Presented by: Srikanta Bedathur
Web is Constantly Changing • In fact, each Web-page changes constantly! • Visually too! December 18, 2008
Web is Constantly Changing • In fact, each Web-page changes constantly! • Visually too! April 7, 2009
Web is Constantly Changing • In fact, each Web-page changes constantly! • Visually too! May 10, 2009
Change Measurement • Ntoulas et al[2004] – 150 websites, weekly over 1 year • No change w.r.t. to bag-of-words in Individual Web pages • Frequency of Change was not a good predictor of Degree of Change • Fetterley et al [2003] – 150 million pages, weekly for 11 weeks • Nearly 65% of Web pages remain the same (degree of change is very small) • Past change predicts future change, page length is correlated with change, domain-level is correlated with change • Olston and Pandey [2008] – 20K pages, every 2 day over several months • Little correlation between information longevity and change frequency • Here we consider “micro-level” changes • Readable content/structure, at hourly and sub-hourly frequency
Organization • Data Collection • Random Sampling vs. Selecting • Crawling • Change Measurements • Overall change • Page-level evolution • Textual content change • DOM level page-structure change • Applications • Improving search • Improving browsing experience
Data To Collect • Focus on the Web that is actually visited • Using Live-Search Toolbar, with opt-in mechanism • Starting on August 1, 2006, for 5 weeks with 612K users • Behavior driven sampling • Unique-visitors • Average inter-arrival time for a page • Average number of revisits per user per page • Exponential binning to obtain equable sampling
URL Sampling 468 (avg), 650 (med) X 120 = 54788 Full details: Adar et al., CHI08 Visits Per User All crawlable, min 2 users, 2 times Inter-arrival time
Data Crawling • Each URL was crawled hourly for 5 weeks • Starting May 24, 2007. • Sub-hourly crawls for fast-changing pages controller Original crawl 1 Original crawl 2 2 minute delay 16 minute delay 32 minute delay 60 minute delay
Organization • Data Collection • Random Sampling vs. Selecting • Crawling • Change Measurements • Overall change • Page-level evolution • Textual content change • DOM level page-structure change • Applications • Improving search • Improving browsing experience
Measuring Change • Bag-of-words measure using Dice coefficient • 66% displayed change in 5 week period • 123 hours average time for change • Average 0.7954 Dice • Compare this to 35% change in 11 weeks!
Detailed Analysis of Change Sports/Recreation 0.95 0.9 0.85 News/ Magazine Music 0.8 Personal Pages 0.75 Adult Mean Dice Coefficient 0.7 0.65 • More visitors Faster change • Shallower the depth => Faster change Industry/Trade 0.6 0.55 0.5 0 50 100 150 200 250
40000 35000 30000 25000 20000 15000 10000 5000 0 0 minutes 2 minutes 16 minutes 32 minutes 60 minutes Sub-hourly Crawl Analysis 19% At least once 9% pages 23% 24% 11% 66% Change every sample 6% 11% 12% 42% It is still not clear how many of these are really “interesting changes” Mean Dice
Page-level Evolution • Change curve per page • Dice value of document Dtw.r.t. to the original at Dr1 vs. time Can be used to classify a page as • Knotted (70%) • Flat (2%) • Sloped (28%)
Text Evolution • Extended crawling for another 6 months • Compute term-level lifespan plots Bottom – longer staying terms characterize the content of the page
Staying Power & Divergence Staying Power of word w in document D The likelihood of observing a term w in document D at two different timestamps, t and t+∝ Divergence of word w wrt Document D The contribution of word w towards K-L Divergence of the document from the collection distribution
Analysis of Staying Power and Divergence • High divergence, low staying power indicates ephemeral topics unique to the page • High divergence, high staying power can be seen as the “signature” of the site over time • Low divergence, low staying power are not really interesting
Using Change Analyses • Crawling • Focus on only the relatively important, static content of the page. Not worry about unimportant changes. • Ranking • Additional features for ranking • Adaptive weighting of terms based on their dynamic or static occurrence in the document • Snippet Generation • Take into account the survivability of items
Revisitation and Change[CHI’09] Time Implying interest in the newest news Implying interest in the newest deal (once every 24 hours) Implying interest in the stable (slow changing) content [CHI09] Adar et al., “Resonance on the Web: Web Dynamics and Revisitation Patterns”
Inferred intent • Filter content by removing content changing faster or slower than peak revisitation [CHI09] Adar et al., “Resonance on the Web: Web Dynamics and Revisitation Patterns”