270 likes | 596 Views
Workload Characterization of a Personalized Web Site And Its Implication for Dynamic Content Caching. Weisong Shi , Randy Wright*, Eli Collins, and Vijay Karamcheti Department of Computer Science New York University * NYUHome Team. http://www.cs.nyu.edu/~weisong/conca.html.
E N D
Workload Characterization of a Personalized Web Site And Its Implication for Dynamic Content Caching Weisong Shi, Randy Wright*, Eli Collins, and Vijay Karamcheti Department of Computer Science New York University * NYUHome Team http://www.cs.nyu.edu/~weisong/conca.html
Trends in Web Content Access • Rapid growth of traffic for dynamic and personalized content • Dynamic web services • E.g., My Yahoo!, MyCiti.com • Trickle-down effect for static web pages • Web caching and CDN • 50% of requests for dynamically generated content • Wolman/Voelker/Levy, SOSP’99 • 30% of requests carry cookies (indicates personalization) • Caceres/Douglis/Rabinovich, SIGMETRICS Server Perf. Workshop’98 • However, traditional web caching architectures do not work well with these trends
Problem and Solution • Problem: How to efficiently generate/deliver dynamic and personalized content? • Solution: object composition technique • Basic idea: reuse at sub-document level • Quasi-static document template(e.g. ESI or XSL-FO) • Multiple objects with different characteristics • 60% of bytes of dynamic content can be reused (Shi’02, Wills’00) • Several Projects • Server-side: DUP (Challenger’99) • Cache-side: HPP (Douglis’97), Content Assembly (Mikhailov/Wills’00), EdgeSuite (Akamai), Websphere (IBM) • CONCAproject is our effort • Reuse “sharable” portion of personalized content • Transcode content to suit client device and network connection
What is Missing? • Questions • Whether or not object composition techniques are in fact required and if they are likely to be beneficial? • What architecture for well-suited for dynamic content caching? • To answer the questions, we need…… • A better understanding of their characteristicsfrom both a server and client perspective • This study focuses on characterization of a personalized web site • Complements previous work • Analysis of the MSNBC web site (Padmanabhan and Qiu, 2000) • Analysis of an e-commerce site (Arlitt et al., 2001)
Roadmap • Motivation • NYUHome • Trace gathering • Analysis of characteristics • Implications for dynamic content caching • Related work • Summary
NYUHome • Portal for NYU students, faculty and staff ( 44,000 users) • Personalized web site • Tab-based design • Users choose channels and layout the chosen channels as desired
Five tabs HOME, ACADEMICS, RESEARCH, NEWS, FILES Twenty channels NYUHome: Tabs and Channels
Trace Gathering • Process time (Tp ) • Tp = T3 T2 • Network latency (Tn) • Estimated by adding a blank pixel image at response • T2 T1 T5 T4 • Tn = T5 T3
Roadmap • Motivation • NYUHome • Trace gathering • Analysis of characteristics • Implications for dynamic content caching • Related work • Summary
Overall Characteristics • Two weeks period (02/13/2002 02/28/2002) • Aggregate statistics • 643,853 total requests (1706 requests/hour) • 27,576 total users (62% of registered users) • 73,119 total distinct IP addresses
Distribution of Requests • Classify the source IP addresses into 5 categories • Campus, Resnet, Dialup, Overseas, and Others • Findings • Machines from NYU campus contribute 17% of IP, but 69% of request • 83% of IP fall outside NYUcontrol • Grouped into 4183 network clusters (Krishnamurthy’00) • 60 clusters have more than 100 IPs
Requests to Tabs • 90.1% to default HOME tab • Most of them use NYUHome primarily for checking e-mail • Template occupies a significant portion (30% to 60%) • Agrees with other study on dynamic content (Wills’00, Shi’02)
Channel Size vs. Document Size • 99% of channels are less than 3000 bytes • Modeled well by Weibull distribution with • Agrees with our previous study on e-commerce sites • 70% of the documents lie in a small range (9,725, 10,688) • Popularity of HOME tab, and sizeable fraction of template size
User Behavior: Session Characteristics • Number of requests per session • Defined as the requests occupied by the same session key • 82.85% of sessions contain one request only • Inter-request time within a session • Average 492.7 seconds, median is 92.9 seconds • Reason why persistent connections are disabled at NYUHome • Captured very well with Lognormal distribution
User Behavior: Client Popularity • Relation between user rank and the number of requests • Based on the number of requests he/she issues • If client popularity follows a Zipf-like distribution, the log scale plot should appear linear with a slope near • =0.35 for top 2000 users
User Behavior: Personalization • Calculate the total number of channel combinations • Default vs. personalized users • Counted the number of distinct users who used a particular channel combination for each tab
User Behavior: Personalization • Percentage of requests to different channel combinations • Observation: significant percentagediffers only in layout
Request Processing Cost • Apache 1.3 on 12-processor Sun E10000 (399MHz) • Average Tp = 1.41s • Relationship with server load • Observation: average Tp is independent of load • Inherent overhead of dynamic generation of personalized content
Request Processing Cost: A Closer Look • Correlation coefficient between • Tp and the number of channels: 0.98 • which explains the lower Tp of ACADEMICS and RESEARCH • Tp and document size: 0.04 • A simple model of processing overhead where Tc for obtaining from cache, Tg for generating the content synchronously, and Ta for assembling the content into a document Relationship: Tc+Ta = 0.32s, Tg+Ta = 0.52s • which means generating incurs an additional 0.2 seconds
Network Latency and Throughput • Average Tn=2.45s, 15% of requests require more than 5s • Both latency and throughput are captured well by Lognormal distribution • Agrees with Balakrishnan et.al’s study of 1996 Atlanta Olympic traces • Correlation coefficient between Tn and document size • -0.0031, but strong correlation after categorization • Diversity of network connections • Two LAN-like, one WAN-like, one phone modem, and others • Access using NYU Dialup is 7 times slower than those that access from Resnet
Roadmap • Motivation • NYUHome • Trace gathering • Analysis of characteristics • Implications for dynamic content caching • Related work • Summary
Implications for Dynamic Content Caching • Need for efficient delivery of personalized content • 30% of users are using personalization • Larger if we count the “email checker phenomenon” • Increased server overheads and larger network latencies • Potential benefits from using the object composition technique • Advocate caching of channel content at proxy caches or surrogates • 6 of 11 channels larger than 1K bytes are sharable • 30% to 60% contributed by quasi-static template • Take HOME tab as an example, 96% potential bandwidth saving • Only Email channel need to be fetched
Implications for Dynamic Content Caching • Benefits from proxy prefetching and/or server pushing • Sizeable fraction (40%) of personalized channels • Solution: Server pushing or proxy prefetching • Long inter-request interval within a session allows more sophisticated prefetching policies • Benefits from predicting access patterns • Conflicting demand between prefetching and personalization needs to be reconciled • Zipf-like client popularity allows us to predict a small group • Need for customizing content based on network connection • Solution: different default layouts and channel content
Related Work • Workload characterization • A lot of previous work for static web content • Few on dynamic web content • Analysis of MSNBC by Padmanabhan and Qiu (2000) • Analysis of a large shopping site by Arlitt et al. (2001) • Sub-document level and instrumented logs • Personalization • My!Yahoo user experience analysis by Manber et al (2000) • Only general information and high level implications • Our study looks at quantitative aspects • Server performance • Web server performance for static content • Flash (Pai’98), SEDA(Welsh’01) • Server processing overhead for personalized content
Summary and Future Work • Characteristics of NYUHome • Document composition, personalization behavior, server-side overhead and network latency • Implications for dynamic content caching • Personalization functionality is increasingly accepted • Substantial benefits are likely by applying object composition technique for personalized content • Both server load and latency can be further reduced by prefetching the content of a small number of personalized channels • Client-perceived latencies can be reduced by specializing the document layout and content to the network connection • Next step: integrating with the CONCA prototype • ESI-based prototype is now running
Additional Information Moving to Wayne State University next week weisong@cs.wayne.edu