1 / 32

Ranking Web Sites with Real User Traffic

Ranking Web Sites with Real User Traffic. Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani. Web Search and Data Mining Stanford, California February 11, 2008. Outline. Data collection Structural properties Behavioral patterns PageRank validation

lang
Download Presentation

Ranking Web Sites with Real User Traffic

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani Web Search and Data Mining Stanford, California February 11, 2008

  2. Outline • Data collection • Structural properties • Behavioral patterns • PageRank validation • Temporal patterns

  3. Sources for Ranking Data:The Link Graph

  4. Sources for Ranking Data:Dynamic Sources Network flow data Web server logs Toolbars and plugins

  5. Sources for Ranking Data:Packet Inspection ISP ~100 K users

  6. Data Collection HTTP (80) 30% @ peak anonymizer Host Path Referer User-Agent Timestamp GET HUMAN h/p/r/a/t { requests from IU only FULL h/p/r/a/t

  7. Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns Outline

  8. Structural properties: Degree

  9. Caveat: Sampling Bias

  10. Structural properties:Strength (Site Traffic)

  11. Structural properties:Weights (Link Traffic)

  12. Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns Outline

  13. Behavioral patterns(HUMAN) (Proportion of total out-strength)

  14. Ratios are stable Requests (x 106)

  15. Ratios are stable Requests (x 106)

  16. Outline • Data collection • Structural properties • Behavioral patterns • PageRank validation • Temporal patterns

  17. Validation of PageRank • PR is a stationary distribution of visit frequency by a modified random walk (with jumps) on the Web graph • Compare with actual site traffic (in-strength) • From an application perspective, we care about the resulting ranking of sites rather than the actual values

  18. Kendall’s Rank Correlation

  19. PageRank Assumptions • Equal probability of teleporting to each of the nodes • Equal probability of teleporting from each of the nodes • Equal probability of following each link from any given node

  20. Kendall’s Rank Correlation

  21. perfect concentration perfect homogeneity Local Link Heterogeneity HH Index of concentration or disparity

  22. Teleportation Target Heterogeneity

  23. sout > sin popular hubs -2 Teleportation Source Heterogeneity (“hubness”) sout < sin teleport sources browsing sinks

  24. Navigation vs. Jumps: Sources of Popularity

  25. Outline • Data collection • Structural properties • Behavioral patterns • PageRank validation • Temporal patterns

  26. Temporal patterns How predictable are traffic patterns? -- Cache refreshing (e.g. proxies) -- Capacity allocation (e.g. peering and provisioning for spikes) -- Site design (e.g. expose content based on time of day)

  27. Temporal patterns • Predict future host graph (clicks) from current one, as a function of delay • Generalized temporal precision and recall:

  28. HUMAN host graph (FULL is about 10% more predictable)

  29. Summary • Heterogeneity: incoming and outgoing site traffic, link traffic • Less than half of traffic is from following links • Only 5% of traffic is directly from search engines • High temporal regularity • PageRank is a poor predictor of traffic: random walk and random teleportation assumptions violated

  30. Next • Sampling bias and search bias • From host graph to page graph • Modeling traffic: Beyond random walk?

  31. CNLL THANKS! ? Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani

More Related