1 / 27

Workload Characterization of a Personalized Web Site  And Its Implication for Dynamic Content Caching

Workload Characterization of a Personalized Web Site  And Its Implication for Dynamic Content Caching. Weisong Shi , Randy Wright*, Eli Collins, and Vijay Karamcheti Department of Computer Science New York University * NYUHome Team. http://www.cs.nyu.edu/~weisong/conca.html.

Donna
Download Presentation

Workload Characterization of a Personalized Web Site  And Its Implication for Dynamic Content Caching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Workload Characterization of a Personalized Web Site And Its Implication for Dynamic Content Caching Weisong Shi, Randy Wright*, Eli Collins, and Vijay Karamcheti Department of Computer Science New York University * NYUHome Team http://www.cs.nyu.edu/~weisong/conca.html

  2. Trends in Web Content Access • Rapid growth of traffic for dynamic and personalized content • Dynamic web services • E.g., My Yahoo!, MyCiti.com • Trickle-down effect for static web pages • Web caching and CDN • 50% of requests for dynamically generated content • Wolman/Voelker/Levy, SOSP’99 • 30% of requests carry cookies (indicates personalization) • Caceres/Douglis/Rabinovich, SIGMETRICS Server Perf. Workshop’98 • However, traditional web caching architectures do not work well with these trends

  3. Problem and Solution • Problem: How to efficiently generate/deliver dynamic and personalized content? • Solution: object composition technique • Basic idea: reuse at sub-document level • Quasi-static document template(e.g. ESI or XSL-FO) • Multiple objects with different characteristics • 60% of bytes of dynamic content can be reused (Shi’02, Wills’00) • Several Projects • Server-side: DUP (Challenger’99) • Cache-side: HPP (Douglis’97), Content Assembly (Mikhailov/Wills’00), EdgeSuite (Akamai), Websphere (IBM) • CONCAproject is our effort • Reuse “sharable” portion of personalized content • Transcode content to suit client device and network connection

  4. What is Missing? • Questions • Whether or not object composition techniques are in fact required and if they are likely to be beneficial? • What architecture for well-suited for dynamic content caching? • To answer the questions, we need…… • A better understanding of their characteristicsfrom both a server and client perspective • This study focuses on characterization of a personalized web site • Complements previous work • Analysis of the MSNBC web site (Padmanabhan and Qiu, 2000) • Analysis of an e-commerce site (Arlitt et al., 2001)

  5. Roadmap • Motivation • NYUHome • Trace gathering • Analysis of characteristics • Implications for dynamic content caching • Related work • Summary

  6. NYUHome • Portal for NYU students, faculty and staff ( 44,000 users) • Personalized web site • Tab-based design • Users choose channels and layout the chosen channels as desired

  7. Five tabs HOME, ACADEMICS, RESEARCH, NEWS, FILES Twenty channels NYUHome: Tabs and Channels

  8. Trace Gathering • Process time (Tp ) • Tp = T3  T2 • Network latency (Tn) • Estimated by adding a blank pixel image at response • T2 T1 T5  T4 • Tn = T5  T3

  9. Roadmap • Motivation • NYUHome • Trace gathering • Analysis of characteristics • Implications for dynamic content caching • Related work • Summary

  10. Overall Characteristics • Two weeks period (02/13/2002  02/28/2002) • Aggregate statistics • 643,853 total requests (1706 requests/hour) • 27,576 total users (62% of registered users) • 73,119 total distinct IP addresses

  11. Distribution of Requests • Classify the source IP addresses into 5 categories • Campus, Resnet, Dialup, Overseas, and Others • Findings • Machines from NYU campus contribute 17% of IP, but 69% of request • 83% of IP fall outside NYUcontrol • Grouped into 4183 network clusters (Krishnamurthy’00) • 60 clusters have more than 100 IPs

  12. Requests to Tabs • 90.1% to default HOME tab • Most of them use NYUHome primarily for checking e-mail • Template occupies a significant portion (30% to 60%) • Agrees with other study on dynamic content (Wills’00, Shi’02)

  13. Channel Characteristics

  14. Channel Size vs. Document Size • 99% of channels are less than 3000 bytes • Modeled well by Weibull distribution with • Agrees with our previous study on e-commerce sites • 70% of the documents lie in a small range (9,725, 10,688) • Popularity of HOME tab, and sizeable fraction of template size

  15. User Behavior: Session Characteristics • Number of requests per session • Defined as the requests occupied by the same session key • 82.85% of sessions contain one request only • Inter-request time within a session • Average 492.7 seconds, median is 92.9 seconds • Reason why persistent connections are disabled at NYUHome • Captured very well with Lognormal distribution

  16. User Behavior: Client Popularity • Relation between user rank and the number of requests • Based on the number of requests he/she issues • If client popularity follows a Zipf-like distribution, the log scale plot should appear linear with a slope near  • =0.35 for top 2000 users

  17. User Behavior: Personalization • Calculate the total number of channel combinations • Default vs. personalized users • Counted the number of distinct users who used a particular channel combination for each tab

  18. User Behavior: Personalization • Percentage of requests to different channel combinations • Observation: significant percentagediffers only in layout

  19. Request Processing Cost • Apache 1.3 on 12-processor Sun E10000 (399MHz) • Average Tp = 1.41s • Relationship with server load • Observation: average Tp is independent of load • Inherent overhead of dynamic generation of personalized content

  20. Request Processing Cost: A Closer Look • Correlation coefficient between • Tp and the number of channels: 0.98 • which explains the lower Tp of ACADEMICS and RESEARCH • Tp and document size: 0.04 • A simple model of processing overhead where Tc for obtaining from cache, Tg for generating the content synchronously, and Ta for assembling the content into a document Relationship: Tc+Ta = 0.32s, Tg+Ta = 0.52s • which means generating incurs an additional 0.2 seconds

  21. Network Latency and Throughput • Average Tn=2.45s, 15% of requests require more than 5s • Both latency and throughput are captured well by Lognormal distribution • Agrees with Balakrishnan et.al’s study of 1996 Atlanta Olympic traces • Correlation coefficient between Tn and document size • -0.0031, but strong correlation after categorization • Diversity of network connections • Two LAN-like, one WAN-like, one phone modem, and others • Access using NYU Dialup is 7 times slower than those that access from Resnet

  22. Roadmap • Motivation • NYUHome • Trace gathering • Analysis of characteristics • Implications for dynamic content caching • Related work • Summary

  23. Implications for Dynamic Content Caching • Need for efficient delivery of personalized content • 30% of users are using personalization • Larger if we count the “email checker phenomenon” • Increased server overheads and larger network latencies • Potential benefits from using the object composition technique • Advocate caching of channel content at proxy caches or surrogates • 6 of 11 channels larger than 1K bytes are sharable • 30% to 60% contributed by quasi-static template • Take HOME tab as an example, 96% potential bandwidth saving • Only Email channel need to be fetched

  24. Implications for Dynamic Content Caching • Benefits from proxy prefetching and/or server pushing • Sizeable fraction (40%) of personalized channels • Solution: Server pushing or proxy prefetching • Long inter-request interval within a session allows more sophisticated prefetching policies • Benefits from predicting access patterns • Conflicting demand between prefetching and personalization needs to be reconciled • Zipf-like client popularity allows us to predict a small group • Need for customizing content based on network connection • Solution: different default layouts and channel content

  25. Related Work • Workload characterization • A lot of previous work for static web content • Few on dynamic web content • Analysis of MSNBC by Padmanabhan and Qiu (2000) • Analysis of a large shopping site by Arlitt et al. (2001) • Sub-document level and instrumented logs • Personalization • My!Yahoo user experience analysis by Manber et al (2000) • Only general information and high level implications • Our study looks at quantitative aspects • Server performance • Web server performance for static content • Flash (Pai’98), SEDA(Welsh’01) • Server processing overhead for personalized content

  26. Summary and Future Work • Characteristics of NYUHome • Document composition, personalization behavior, server-side overhead and network latency • Implications for dynamic content caching • Personalization functionality is increasingly accepted • Substantial benefits are likely by applying object composition technique for personalized content • Both server load and latency can be further reduced by prefetching the content of a small number of personalized channels • Client-perceived latencies can be reduced by specializing the document layout and content to the network connection • Next step: integrating with the CONCA prototype • ESI-based prototype is now running

  27. Additional Information Moving to Wayne State University next week weisong@cs.wayne.edu

More Related