280 likes | 519 Views
Towards Understanding Modern Web Traffic. Sunghwan Ihm and Vivek S. Pai Google Inc. / Princeton University. Web Changes and Growth. Simple static documents c omplex rich media applications H eavy client-side interactions (e.g., Ajax ) Traffic increase
E N D
Towards Understanding Modern Web Traffic Sunghwan Ihm and Vivek S. Pai Google Inc. / Princeton University
Web Changes and Growth • Simple static documents complex rich media applications • Heavy client-side interactions (e.g., Ajax) • Traffic increase • Social networking, file-sharing, and video streaming sites • Trends expected to continue • Applications migrated to the Web • A de facto standard interface of cloud services Sunghwan Ihm, Princeton University
Understanding Changes • Goal: shape system design by better understanding the traffic optimization opportunities • Improve response times • Understand caching effectiveness • Design intermediary systems: firewalls, security analyzers, and reporting/monitoring systems Sunghwan Ihm, Princeton University
Challenges • Tracking changes • Requires large-scale data set spanning many years collected under the same conditions • Web page analysis • Requires new analysis techniques suitable for dynamic Web pages with client-side interactions (e.g, Ajax) • Redundancy and caching • Requires full content instead of simple access logs for assessing implications of content-based caching We address these challenges by • Analyzing large-scale data with full content • Developing a new Web page analysis technique Sunghwan Ihm, Princeton University
CoDeeN Traffic • CoDeeN content distribution network (CDN) • http://codeen.cs.princeton.edu/ • A semi-open globally distributed open proxy on 500+ PlanetLabnodes • Running since 2003 • 30+ million requests per day Sunghwan Ihm, Princeton University
WAN Browser Cache Local Proxy Cache CoDeeN Cache Data Collection Full Content Access Logs • Assume local proxy caches • 1. Access logs (all requests, but limited info.) • URL, Timestamp, Content-Length, Content-Type, Referer, etc. • 2. Full content (cache-misses) • Header + body Origin Web Server User Sunghwan Ihm, Princeton University
Data Set • 5 years: from 2006 to 2010 • Focus on one month (April) per year • Full content data only for 2010 • Total volume per month • 3.3~6.6 TB • 280~460 million requests • 240~360K unique client IPs (40~60% /8 nets) • 168~187 countries and regions • 820K~1.2 million servers Focus on US, CN, FR, BR: 100M+ requests / 1TB+ / 100K+ users Sunghwan Ihm, Princeton University
Analysis Outline 1. High-level analysis 2. Page-level analysis 3. Caching analysis Access Logs Full Content Sunghwan Ihm, Princeton University
1. High-Level Analysis • Q: What has changed over five years? • Connection speed • NAT usage • Max # concurrent browser connections • Content type • Object Size • Traffic share of Web sites Sunghwan Ihm, Princeton University
Content Type • US, 20062010, both X and Y log-scale • A sharp increase of Ajax: JavaScript / CSS / XML • A sharp increase of Flash video(FLV) (<5%25%) Sunghwan Ihm, Princeton University
Traffic Share of Web Sites • Increase in video sites’ traffic • Increase in ad networks and analytics sites’ requests (~12%) • Ad networks market growth • Most accessed site by users • search / analytics • google.com, baidu.com, google-analytics.com • % user share increasing, tracking up to 65% Sunghwan Ihm, Princeton University
2. Page-Level Analysis • Q: How have Web pages changed? • New page detection heuristic • Initial page characteristics • Page size / # of embedded objects / latency • Page load latency simulation • Entire page characterization Sunghwan Ihm, Princeton University
Page Detection Problem • Given a set of access logs, detect the page boundaries • # of embedded objects, page size, time, etc. • Challenge: previous approaches from 1990s are a poor fit, inaccurate for modern Web traffic Time main embedded Sunghwan Ihm, Princeton University
Previous Approach #1:Time-based • Check idle time between requests • If within a threshold (e.g. 1 second), they belong to the same page • Misclassifyclient-side interactions (Ajax) with longer idle time as pages Sunghwan Ihm, Princeton University
Previous Approach #2:Type-based • Check file extension / content type • Regard every html object as a main object • Misclassifyframes/iframes within a page as separate pages Sunghwan Ihm, Princeton University
StreamStructureAlgorithm Ajax 1. Group logs into streams by Refererfield 2. Consider all html object as main object candidates ( Type-based) 3. Ignore those with no children (embedded objects) 4. Apply idle time among the candidates for finalizing selection ( Time-based) frames/iframes Sunghwan Ihm, Princeton University
Validation • Ground truth:browse Alexa’s top 100 sites • Visit about 10 pages per site • Record Web page URLs (main objects) • Total 1197 pages • Precision • # correct pages found / # total pages found • Recall • # correct pages found / # total correctpages Sunghwan Ihm, Princeton University
Validation Result Better 4 26~33 • StreamStructureoutperforms other approaches • Robustto the idle time parameter selection 19~30 4~24 1 sec Sunghwan Ihm, Princeton University
Identifying Initial Page Loads Client-side Interactions (e.g., Ajax) Initial Page Load • Initial page: user-perceived page user-perceived latency traffic/revenue of Websites • Apply Time-based approach, but DNS lookup or browser processing time can vary significantly • Use Google Analyticsbeacon • JavaScript collecting various client-side info. • Fires when document are loaded 40-60% of traffic after initial page loads Sunghwan Ihm, Princeton University
Initial Page Size and # Objects • Initial pages become increasingly complex • US: about 2x increase • 2006: 69 KB / 6 objects • 2010: 133 KB / 12 objects Caching Effectiveness Sunghwan Ihm, Princeton University
Initial Page Load Latency • Median latency dropped in 2009 and 2010 Increased # of browser concurrent connections Reduced per-object latency from improved caching behavior / client bandwidth Sunghwan Ihm, Princeton University
3. Caching Analysis • Q: Implications for caching? • URL popularity • Caching effectiveness • Required cache storage size • Impact of aborted transfers Sunghwan Ihm, Princeton University
Two Caching Approaches • HTTP Object-based Approach • Whole object • HTTP-cacheable only • Previously reported cache hit rate: 35~50% • Byte hit rate usually much less • Content-based Approach • Cache smaller chunks instead of objects • Protocol independent • Effective for uncacheable content as well • WAN accelerators, storage/file systems Sunghwan Ihm, Princeton University
Ideal Cache Hit Rate • HTTP object-based: 17~28% • Mainly effective for JavaScript and image • Content-based: 42~51%with 128-byte chunks • Effective for any content type • Growth of tail that hurts caching 1.8~2.5x Sunghwan Ihm, Princeton University
Origins of Redundancy • Most of additional savings from the redundancy • across different versions (intra-URL) • across different objects (inter-URL) Aborted US, 128 byte Content updates Sunghwan Ihm, Princeton University
Required Cache Storage Size • 1-KB outperforms 128-B w/ metadata overhead • MRC: Multi-Resolution Chunking (USENIX’10) • Increases working set size • Large cache storage highly desirable CN: 218GB Sunghwan Ihm, Princeton University
Conclusions • Analyzed five years of real Web traffic with over 70,000 users • Observed a rise of Ajax and Flash video, search engine / analytics site tracking 65% users • Developed StreamStructure • Half of the traffic occurs due to client-side interactions after initial page loads • Pages have become increasingly complex • Content-based caching with large cache storage highly desirable • 2x larger byte hit rate, aborted transfers Sunghwan Ihm, Princeton University
sihm@cs.princeton.eduhttp://www.cs.princeton.edu/~sihm/ Thank You Sunghwan Ihm, Princeton University