290 likes | 396 Views
Empirical Quantification of Opportunities for Content Adaptation in Web Servers. Michael Gopshtein and Dror Feitelson School of Engineering and Computer Science The Hebrew University of Jerusalem. Supported by a grant from the Israel Internet Association. Capacity Planning.
E N D
Empirical Quantification of Opportunities for Content Adaptation in Web Servers Michael Gopshtein and Dror Feitelson School of Engineering and Computer Science The Hebrew University of Jerusalem Supported by a grant from the Israel Internet Association
Capacity Planning Daily cycle of activity capacity time Utilized capacity Wasted capacity
Capacity Planning Flash crowd capacity time
Capacity Planning • The problem: • Required capacity for flash crowds cannot be anticipated in advance • Even capacity for daily fluctuations is highly wasteful • Academic solution: use admission control • Business practice: unacceptable to reject any clients • Especially in cases of surge in traffic
Content Adaptation • Trade off quality for throughput • Installed capacity matches normal load • Handle abnormal load by reducing quality • But still manage to provide meaningful service to all clients • Assumes normal optimizations have been made already • Compress or combine images, promote caching, … • Empirically this usually is not the case
Content Adaptation smily Low load smily smily
Content Adaptation smily smily High load smily smily smily smily smily smily
Content Adaptation • Maintain the invariant: • Need to change quality (and cost!) of content • Prepare multiple versions in advance
The Questions • What are the main costs in web service? • Bottleneck is CPU / network / disk? • What do we gain by eliminating HTTP requests? • What do we gain by reducing file sizes? • What can realistically be done? • What is the structure of a “random” site? • How much can we reduce quality? Assumption: static web pages only
Measuring Random Web Sites • http://en.wikipedia.org/wiki/Special:Random • Use title of page as input to Google search • Extract domain of first link to get home page • Retrieve it using IE • Collect statistical data by intercepting system calls to send and receive
Retrieved Component Sizes A ¼ of total data from components larger than 200 KB This is only 0.02% of the components
Download Times Download time (and bandwidth requirements) roughly proportional to image size
Network Bandwidth • Typical Ethernet packets are 1526 bytes • Ethernet and TCP/IP headers require 54 bytes • HTTP response headers require 280-325 • Most components fit into few packets • 43% fit into a single packet • 24% more fit into 2 packets Save bandwidth by reducing number of small components or size of large components
Locality and Caching • Flash crowds typically involve a very small number of pages (possibly the home page) • Servers allocate GB of memory for cache • This is enough for thousands of files Disk is not expected to be a bottleneck
CPU Overhead • CPU usage reflects several activities • Opening TCP connection • Processing request • Sending data • Measure using combinatorical microbenchmarks • Open connection only • One extremely large file • Many small files • Many requests for non-existent file
CPU Overhead Example: single 10KB file • Equal processing and transfer at 240KB • Only 0.3% of files are so big If CPU is bottleneck, need to reduce number of requests
Guidelines • Either CPU or network are the bottleneck • Network bandwidth saved by reducing large components • CPU saved by eliminating small components • Maintaining “acceptable” quality is subjective
Eliminating Images • Images have many functions • Story (main illustrative item) • Preview (for other page) • Commercial • Logo • Decoration (bullets, background) • Navigation (buttons, menus) • Text (special formatting) • Some can be eliminated or replaced
Distribution of Types • Manually classified 959 images from 30 random sites • 50% decoration • 18% preview • 11% commercial • 6% logo • 6% text
Automatic Identification • Decorations are candidates for elimination • Identified by combination of attributes: • Use gif format • Appear in HTML tags other than <IMG> • Appear multiple times in same page • Small original size • Displayed size much bigger than original • Large change in aspect ratio when displayed
Image Sizes Distribution commercial preview decoration
Auxiliary Files • JavaScript • May be crucial for page function • Impossible to understand automatically • CSS (style sheets) • May be crucial for page structure • May be possible to identify those parts that are used
Auxiliary Files • Cannot be eliminated • Common wisdom: use separate files • Allow caching at client • Save retransmission with each page • Alternative: embed in HTML • Reduce number of requests • May be better for flash crowds that do not request multiple pages
Text and HTML • Some areas may be eliminated under extreme conditions • Commercials • Some previews and navigation options • Often encapsulated in <DIV> tags • Sometimes identified by ID or class names, e.g. “sidebanner” • Especially when using modular design
Content Adaptation • Degraded content usually better than exclusion • Only way to handle flash crowds that overwhelm installed capacity • Empirical results identify main options • Identify and eliminate decorations • Compress large images (story, commercial) • Embed JavaScript and CSS • Hide unnecessary blocks
Next Paper Preview • Implementation in Apache • Monitor CPU utilization and idle threads to switch between modes • Use mod_rewrite to redirect URLs to adapted content • Achieve up to x10 increase in throughput for extreme adaptation