1 / 67

Web Caching

Web Caching. Elliot Jaffe Presentation for The Seminar on Database and Internet Hebrew University, Fall 2002. Agenda. Caching: Why, Where, How, What Some empirical data: Zipf’s Law Content Delivery Networks Bibliography. Why cache?. Number of unique pages: 800M < X < 2.2B

nova
Download Presentation

Web Caching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Caching Elliot Jaffe Presentation for The Seminar on Database and Internet Hebrew University, Fall 2002

  2. Agenda • Caching: Why, Where, How, What • Some empirical data: Zipf’s Law • Content Delivery Networks • Bibliography

  3. Why cache? • Number of unique pages: 800M < X < 2.2B • Number of unique web sites: 8,500,000 • static pages: %30 - %40 • pages revisited: %80 • expected hit-rate: %24 - %32

  4. Why cache? • Bandwidth • Latency • Performance = Response Time • Server Load • Failure Redundancy

  5. Content Server Reverse Proxy Content Server Reverse Proxy Content Server Reverse Proxy Where Local ISP Content Server Reverse Proxy cache cdn L4 Switch Data Center ISP Intranet cache Browser cache Browser cache cache Browser cdn

  6. Hot-potato routing • Get traffic off of your network as soon as possible • Bounces traffic around the internet • Increases chance of dropped packet • Increases latency Destination You are here

  7. How: Types of Caches • Simple Proxy • Transparent Proxy • Reverse Proxy • Adaptive Caching • Push Caching • Active Caching • Streaming Caches

  8. How: Simple Proxy • Harvest/Squid • Provide web content for a fixed user base • Standalone operation • May be transparent • Commodity product/technology • Easy to get 90% correct

  9. How: Transparent Proxy • No client configuration • Violates end-to-end paradigm • Client thinks it is talking directly to server • Server thinks it is talking to cache • Implemented as • Pass-through unit • L4 switch

  10. How: Reverse Proxy • Designed to offload duties from one or more specific servers • Data size is limited to size of static content on the server • Challenge is fast, disk-less operation • Cache consistency is easy • Single point of failure

  11. How: Adaptive Caching • ISP Level caching • Cooperating multiple distributed caches • Operate as a cache-mesh based on content demand • Multicast for group membership (GCS) • Content Routing Protocol sends request to the appropriate cache within the mesh

  12. How: Push Caching • Send the data out proactively • Content Delivery Networks • Paid for by data providers • More on this later!

  13. How: Active Caching • Use an applet inside of the cache to customize dynamic pages on the fly • How do you identify dynamic pages? • Where does the custom data come from? • Who is going to pay for this service?

  14. How: Streaming Caches • What about streaming content • Movies • Audio • Proprietary streaming protocols • Challenge is to maintain Quality of content and service • Who pays for this?

  15. What: Content and Protocols • Mostly Static Content • HTML • XML • GIF • AVI • EXE • Etc.

  16. What: Content and Protocols • HTTP 1.0 Basic protocol • Send Request based on fix number of verbs • GET • HEAD • POST • Receive response, meta-data, content

  17. What: Content and Protocols • HTTP Request Request = Simple-Request | Full-Request Simple-Request = "GET" SP Request-URI CRLF Full-Request = Request-Line ; * ( General-Header ; | Request-Header ; | Entity-Header ) ; CRLF [ Entity-Body ]

  18. What: Content and Protocols • Example: GET /pub/www/index.html HTTP/1.0 • Response: HTTP/1.1 200 OK Server: Microsoft-IIS/5.0 Date: Sat, 19 Oct 2002 05:46:53 GMT Expires: Sun, 20 Oct 2002 16:00:00 GMT Content-Length: 2291 Content-Type: text/html Cache-control: private

  19. What: Content and Protocols • Example “if-modified-since”: GET /pub/www/index.html HTTP/1.0 If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT • Response: HTTP/1.1 200 OK Server: Microsoft-IIS/5.0 Date: Thu, 13 Jul 2000 05:46:53 GMT Expires: Sun, 20 Oct 2002 16:00:00 GMT Content-Length: 2291 Content-Type: text/html Cache-control: private

  20. What: Content and Protocols • Example “if-modified-since”: GET /pub/www/index.html HTTP/1.0 If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT • Response: HTTP/1.1 304 Not Modified

  21. Basic caching algorithm Pages may be • Fresh: up-to-date • Expired: current date > expiration date • Stale: “old”

  22. Basic caching algorithm - #2 If (page is in the cache) if ( page is expired or stale ) Get from server - if-modified-since If not modified, Get from cache Get from cache Else Get from Server Soft Miss

  23. Basic caching algorithm - #3 If cache has space Store the file Else • Delete expired from cache • Delete stale from cache • Delete LRU from cache • Delete largest/smallest from cache?

  24. Agenda • Caching: Why, Where, How, What • Some empirical data: Zipf’s Law • Content Delivery Networks • Bibliography

  25. Zipf’s law • Zipf’s law: The frequency of an event P as a function of rank i is a power law function: Pi = Ω / iα where α ≤ 1

  26. Zipf’s law • Observed to be true for • Frequency of written words in English texts • Population of cities • Income of a company as a function of rank

  27. Zipf’s law and web access • For a given server, page access by rank follows Zipf’s law • Web requests from a fixed population of users follows Zipf’s law 0.64 < α < 0.83

  28. Observations • Top %1 of all documents account for %20 - %35 of proxy requests • Top %10 account for %45 - %55 of requests • It takes %25 to %40 of all documents to account for %70 of requests • It takes %70 to %80 of all documents to account for %90 of requests

  29. Observations

  30. Observations • For an infinite sized cache, the hit-ratio for a web-proxy grows in a log-like fashion as a function of the client population of the proxy and the number of requests seen by the proxy.

  31. Observations • The hit-ratio of a web cache grows in a log-like fashion as a function of the cache size.

  32. Observations Locality of Reference • The probability that a document will be referenced k requests after it was last referenced is roughly proportional to 1/k.

  33. Observations - NOT • There is very little correlation between access frequency and document size • There is no correlation between access frequency and the change rate of a document • No single web server contributes to most of the popular pages

  34. Zipf’s Law and Caching Discussion • How does this help in cache design? • Are there any business implications?

  35. Agenda • Caching: Why, Where, How, What • Some empirical data: Zipf’s Law • Content Delivery Networks • Bibliography

  36. CDN • “Traditional” CDN • Dirty Secrets • P2P content delivery systems

  37. Content Server Reverse Proxy Content Server Reverse Proxy Content Server Reverse Proxy Why use a CDN? Local ISP Content Server Reverse Proxy cache cdn L4 Switch Data Center ISP Intranet cache Browser cache Browser cache cache Browser cdn

  38. What is CDN? Content Deliver Networks = PUSH PUSH = Prefetch

  39. CDNMechanisms • DNS redirection • Complete • Partial • URL rewrite

  40. Network Model HTTP server example.com ? A B HTTP server B GET http://example.com/foo HTTP server C A DNS-redirecting CDN DNS redirector Original server Client http://example.com/foo Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

  41. CDN DNS Full Redirection • (Semi)automatic mechanism to replicate original site on CDN servers • Replace original DNS entry with enhanced DNS server that uses knowledge of network and server load to direct clients to appropriate CDN server • TTL on DNS entries are very short • Adero, NetCaching, IntelliDNS

  42. CDN DNS Partial Redirection • Statically modify selected URL’s within pages to point to CDN service • Replicate selected objects to CDN service • Redirect clients of selected URL’s using enhanced DNS server that uses knowledge of network and server load • Akamai, Digital Island, MirrorImage, SolidSpeed, Speedera

  43. CDN rewrite • Modify pages at the origin server on the fly • Change embedded URL’s based on up-to-date knowledge of the network and CDN server loads • Does not require additional DNS lookups • Fasttide, Clearway

  44. Measuring a CDN’s performance • Two papers • K.L.Johnson,J.F.Carr,M.S.Day,and M.F.Kaashoek,”The measured performance of content distribution networks,”in Proceedings of the 5th International Web Caching Workshop and Content Delivery Workshop,(Lisbon,Portugal),May 2000. • B. Krishnamurthy,C. Wills,Y. Zhang, “On the Use and Performance of Content Distribution Networks” in ACM SIGCOMM INTERNET MEASUREMENT WORKSHOP 2001.

  45. The measured performance of content distribution networks Client Actions • R: Resolve domain name • F: Fetch content • Ordinary client use of CDN: RF • Instead of doing (RF)+ we do R+ then F+ • This allows us to compare the server chosen to some other servers that could have been chosen, over a large number of fetches. Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

  46. The measured performance of content distribution networks Procedure • R+: Collect a set of servers by repeated DNS queries • to a variety of name servers • over a number of hours • F+: Fetch a particular piece of content from each member of the set, measuring latency Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

  47. The measured performance of content distribution networks Important Details • Interleaved fetches • Fetch1 at server1, fetch1 at server2, etc. • Not fetch1 at server1, fetch2 at server1, etc. • Unmeasured fetch before measured fetch • Avoids cache misses • Measure only HTTP fetch latency • CDN not penalized for cost of DNS resolution Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

  48. The measured performance of content distribution networks: Looking at these graphs • Note: log plot of latency • Gray line: cumulative distribution at one server • Red line: cumulative distribution at all servers • Blue line: cumulative distribution at CDN Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

  49. The measured performance of content distribution networks Cumulative Distribution • Right way to look at this data • Want to understand frequency and magnitude of bad choices • Consistent = vertical • Fast = to the left Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

  50. The measured performance of content distribution networks Results • Akamai does a better job than Digital Island • Neither does a particularly good job of selecting the optimal server Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt

More Related