640 likes | 736 Views
Eytan Adar, Jaime Teevan , Susan Dumais , and Jon Elsas University of Washington, Microsoft Research, and Carnegie Mellon University WSDM’09. The Web Changes Everything: Understanding the Dynamics of Web Content. Who Cares About Web Change?. Revisitation Monitoring Page Structure
E N D
Eytan Adar, Jaime Teevan, Susan Dumais, and Jon Elsas University of Washington, Microsoft Research, and Carnegie Mellon University WSDM’09 The Web Changes Everything:Understanding the Dynamics of Web Content
Who Cares About Web Change? • Revisitation • Monitoring • Page Structure • Fragility • Dynamic language • Search engine design
Quantifying Change • Dynamics of the Web is well researched • Fetterly et al., (150 million pages), 65% stay the same • Koehler et al., (5 years), stabilization • Ntoulas et al., (turnover), 50% new content a year • And many others (see the paper for a summary) • But: eye towards systems issues • Crawl rates, indexing, storage needs, etc. • Always random samples • What about the visited Web • Slow (every day at best)
Outline Baz Foo Bar • Behavior-Driven Sampling & Crawling • Measuring change • Basic change behavior • Page evolution • Text changes • DOM level changes • Applications
Outline • Behavior-Driven Sampling & Crawling • Measuring change • Basic change behavior • Page evolution • Text changes • DOM level changes • Applications
Behavior Driven Sampling • Can we measure dynamics of the actually used Web? • Usage Logs • Live Toolbar • 600k from August of ‘06 • Subset of total
Sampling URLs 468 (avg), 650 (med) X 120 = 54788 Full details: Adar et al., CHI08 Visits Per User All crawlable, min 2 users, 2 times Inter-arrival time Unique Users (popularity)
Behavior Driven Sampling • Can we measure dynamics of the actually used Web? • Usage Logs • Live Toolbar • 600k from August of ‘06 • Subset of total • Sampled URLs • Around 55k (use the 40k that had revisits in May/June) • Crawled hourly (and sub-hourly) for a year • May/June ’07
URL Annotations • Visitation properties • Revisits, popularity, etc. • Broad type • News, Sports, Personal, Adult, etc. • Structural location • Top level page or deep within site?
Outline • Behavior-Driven Sampling & Crawling • Measuring change • Basic change behavior • Page evolution • Text changes • DOM level changes • Applications
Basic measures of change Page version 1 Page version 2 time How long? (inter-version time) How much? Dice: 2*|A ⋂ B| / (|A|+|B|)
66% displayed change in 5 week sample (every 123 hours on average) Random web: 35% change after 11 weeks
Average Inter-version Time by Page Popularity hours visitors More visitors = faster change
Average Inter-version Time by Page “Depth” hours URL Depth More shallow (closer to homepage)= faster change
Change Plot by Type Sports/Recreation 0.95 0.9 0.85 News/ Magazine Music 0.8 Personal Pages 0.75 Adult Mean Dice Coefficient 0.7 0.65 Industry/Trade 0.6 0.55 0.5 0 50 100 150 200 250 Mean Inter-version time (hours)
Sub-hourly crawls Over 60% of pages displayed some change when crawled every 60 minutes. What is the “true” change rate of the page?
Sub-hourly crawls controller Original crawl 1 Original crawl 2 2 minute delay 16 minute delay 32 minute delay 60 minute delay Round-robin crawling 8 samples over 3 (week)days shifted by at least 4 hours
40000 35000 30000 25000 20000 15000 10000 5000 0 0 minutes 2 minutes 16 minutes 32 minutes 60 minutes Range of Changes in Sub-hourly crawls 19% At least once 9% pages 23% 24% 11% 66% Change every sample 11% 6% 12% 42% Mean Dice
40000 35000 30000 25000 20000 15000 10000 5000 0 0 minutes 2 minutes 16 minutes 32 minutes 60 minutes Range of Changes in Sub-hourly crawls 19% At least once 9% pages 23% 24% 11% 66% Change every sample 11% 6% 12% 42% “623 Users Online” “Page generated in .6 ms” “Served to IP address…”
Outline • Behavior-Driven Sampling & Crawling • Measuring change • Basic change behavior • Page evolution • Text changes • DOM level changes • Applications
Outline • Behavior-Driven Sampling & Crawling • Measuring change • Basic change behavior • Page evolution • Text changes • DOM level changes • Applications
Measuring change t0 • Pages are equally (dis)similar • Similarity based on • navigation elements • base language model Dice Time (hours) t1 t2 t3 t4 t5
Two Segment Model Dynamic versus static steady state Knot point Time at which proportion of dynamic to static remains constant 2 segment (linear) – hockey stick
Calculating the Knot Point Knot point Optimization problem
Calculating the Knot Point Knot point Optimization problem
Calculating the Knot Point Knot point Optimization problem
Calculating the Knot Point Knot point Optimization problem
Types of Change Curves *Consistent with the proportions of hand labeled data • 3 main types • Knotted (two-segment) • Sloped • Unchanging • Automatic classification (93% accuracy*) • 70% are knotted • 145 hours mean, 92 median • 28% sloped • 2% unchanging (flat)
Change curves http://www.nytimes.com http://www.allrecipes.com Different stable segment different ratios of dynamic to stable content
Change curves Craigslist, Anchorage, AK Craigslist, Los Angeles, CA 1 dice AK .4 LA hours 10 20 30 40
Outline • Behavior-Driven Sampling & Crawling • Measuring change • Basic change behavior • Page evolution • Text changes • DOM level changes • Applications
Outline Baz Foo Bar • Behavior-Driven Sampling & Crawling • Measuring change • Basic change behavior • Page evolution • Text changes • DOM level changes • Applications
Nature of the Text Or are still here? What terms vanish here? Baz Foo Bar
Term Longevity Plot Baz Foo Bar Sep. Oct. Nov. Dec. Time
Term Longevity Plot Baz Foo Bar • Term level representation of change curve • Pick a vertical (t0) • Compare overlap of terms to next vertical
Features of Terms Baz Foo Bar • Divergence • Which terms distinguish current document from the collection (at a point in time) • Staying power (σ) • Likelihood of observing a word (w) at two different times, t and t+α in document D • σ(w,D)≈ P(t)P(α)P(w|Dt,Dt+ α)
Low staying power (allrecipes.com) High Div. bbq salads sandwiches pork cheese cool High staying power (allrecipes.com) High Div. Distribution of terms by staying power (σ) cooks cookbooks ingredient desserts home you search … Low Div. Baz Foo Bar
Outline Baz Foo Bar • Behavior-Driven Sampling & Crawling • Measuring change • Basic change behavior • Page evolution • Text changes • DOM level changes • Applications
Outline • Behavior-Driven Sampling & Crawling • Measuring change • Basic change behavior • Page evolution • Text changes • DOM level changes • Applications
DOM Level Changes DOM Structure [UIST08] Adar et al., “Zoetrope: Interacting with the Ephemeral Web” • How long does structure hold? • Applications with assumed stability • Programming by Demonstration (PbD) • Mashups • Scrapers, etc.
Tree Isomorphism • The “general” approach: • Compare the DOM structure of 2 trees • Produces alignment, edit distances, etc. [Grandi’04] • But: somewhat inefficient for large scale • We want: • A method for comparing many (1000s) of versions of the same page at the same time
The Idea a / foo b full path type path node hash subtree hash version bar <a>foo <b>bar</b></a> @time = 0 Serialize each DOM structure
The Idea a / foo b jar <a>foo <b>jar</b></a> @time = 1 Serialize each DOM structure
The Idea a / b jar <a><b>jar</b></a> @time = 2 Serialize each DOM structure
Operators on Serialized Data • sort(columns) • Sorts by the variables • reduce(columns) • Generates a set of sets • Look familiar?
sort(full_path,version) S = reduce(full_path) foreach s in S: calculate the difference between the minimum version id and last reported id 2 1
Structure Survival Over Time Smaller dataset ([UIST’08]) shows that mean survival after a year is only 23%
Frequencies and Motion • Frequency of change of DOM elements
Frequencies and Motion • Frequency of change of DOM elements • Motion of elements on a page • Can we predict the motion of a page element?