330 likes | 443 Views
Ranking Web Sites with Real User Traffic. Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani. Web Search and Data Mining Stanford, California February 11, 2008. Outline. Data collection Structural properties Behavioral patterns PageRank validation
E N D
Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani Web Search and Data Mining Stanford, California February 11, 2008
Outline • Data collection • Structural properties • Behavioral patterns • PageRank validation • Temporal patterns
Sources for Ranking Data:Dynamic Sources Network flow data Web server logs Toolbars and plugins
Sources for Ranking Data:Packet Inspection ISP ~100 K users
Data Collection HTTP (80) 30% @ peak anonymizer Host Path Referer User-Agent Timestamp GET HUMAN h/p/r/a/t { requests from IU only FULL h/p/r/a/t
Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns Outline
Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns Outline
Behavioral patterns(HUMAN) (Proportion of total out-strength)
Ratios are stable Requests (x 106)
Ratios are stable Requests (x 106)
Outline • Data collection • Structural properties • Behavioral patterns • PageRank validation • Temporal patterns
Validation of PageRank • PR is a stationary distribution of visit frequency by a modified random walk (with jumps) on the Web graph • Compare with actual site traffic (in-strength) • From an application perspective, we care about the resulting ranking of sites rather than the actual values
PageRank Assumptions • Equal probability of teleporting to each of the nodes • Equal probability of teleporting from each of the nodes • Equal probability of following each link from any given node
perfect concentration perfect homogeneity Local Link Heterogeneity HH Index of concentration or disparity
sout > sin popular hubs -2 Teleportation Source Heterogeneity (“hubness”) sout < sin teleport sources browsing sinks
Outline • Data collection • Structural properties • Behavioral patterns • PageRank validation • Temporal patterns
Temporal patterns How predictable are traffic patterns? -- Cache refreshing (e.g. proxies) -- Capacity allocation (e.g. peering and provisioning for spikes) -- Site design (e.g. expose content based on time of day)
Temporal patterns • Predict future host graph (clicks) from current one, as a function of delay • Generalized temporal precision and recall:
Summary • Heterogeneity: incoming and outgoing site traffic, link traffic • Less than half of traffic is from following links • Only 5% of traffic is directly from search engines • High temporal regularity • PageRank is a poor predictor of traffic: random walk and random teleportation assumptions violated
Next • Sampling bias and search bias • From host graph to page graph • Modeling traffic: Beyond random walk?
CNLL THANKS! ? Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani