670 likes | 929 Views
Web Caching. Elliot Jaffe Presentation for The Seminar on Database and Internet Hebrew University, Fall 2002. Agenda. Caching: Why, Where, How, What Some empirical data: Zipf’s Law Content Delivery Networks Bibliography. Why cache?. Number of unique pages: 800M < X < 2.2B
E N D
Web Caching Elliot Jaffe Presentation for The Seminar on Database and Internet Hebrew University, Fall 2002
Agenda • Caching: Why, Where, How, What • Some empirical data: Zipf’s Law • Content Delivery Networks • Bibliography
Why cache? • Number of unique pages: 800M < X < 2.2B • Number of unique web sites: 8,500,000 • static pages: %30 - %40 • pages revisited: %80 • expected hit-rate: %24 - %32
Why cache? • Bandwidth • Latency • Performance = Response Time • Server Load • Failure Redundancy
Content Server Reverse Proxy Content Server Reverse Proxy Content Server Reverse Proxy Where Local ISP Content Server Reverse Proxy cache cdn L4 Switch Data Center ISP Intranet cache Browser cache Browser cache cache Browser cdn
Hot-potato routing • Get traffic off of your network as soon as possible • Bounces traffic around the internet • Increases chance of dropped packet • Increases latency Destination You are here
How: Types of Caches • Simple Proxy • Transparent Proxy • Reverse Proxy • Adaptive Caching • Push Caching • Active Caching • Streaming Caches
How: Simple Proxy • Harvest/Squid • Provide web content for a fixed user base • Standalone operation • May be transparent • Commodity product/technology • Easy to get 90% correct
How: Transparent Proxy • No client configuration • Violates end-to-end paradigm • Client thinks it is talking directly to server • Server thinks it is talking to cache • Implemented as • Pass-through unit • L4 switch
How: Reverse Proxy • Designed to offload duties from one or more specific servers • Data size is limited to size of static content on the server • Challenge is fast, disk-less operation • Cache consistency is easy • Single point of failure
How: Adaptive Caching • ISP Level caching • Cooperating multiple distributed caches • Operate as a cache-mesh based on content demand • Multicast for group membership (GCS) • Content Routing Protocol sends request to the appropriate cache within the mesh
How: Push Caching • Send the data out proactively • Content Delivery Networks • Paid for by data providers • More on this later!
How: Active Caching • Use an applet inside of the cache to customize dynamic pages on the fly • How do you identify dynamic pages? • Where does the custom data come from? • Who is going to pay for this service?
How: Streaming Caches • What about streaming content • Movies • Audio • Proprietary streaming protocols • Challenge is to maintain Quality of content and service • Who pays for this?
What: Content and Protocols • Mostly Static Content • HTML • XML • GIF • AVI • EXE • Etc.
What: Content and Protocols • HTTP 1.0 Basic protocol • Send Request based on fix number of verbs • GET • HEAD • POST • Receive response, meta-data, content
What: Content and Protocols • HTTP Request Request = Simple-Request | Full-Request Simple-Request = "GET" SP Request-URI CRLF Full-Request = Request-Line ; * ( General-Header ; | Request-Header ; | Entity-Header ) ; CRLF [ Entity-Body ]
What: Content and Protocols • Example: GET /pub/www/index.html HTTP/1.0 • Response: HTTP/1.1 200 OK Server: Microsoft-IIS/5.0 Date: Sat, 19 Oct 2002 05:46:53 GMT Expires: Sun, 20 Oct 2002 16:00:00 GMT Content-Length: 2291 Content-Type: text/html Cache-control: private
What: Content and Protocols • Example “if-modified-since”: GET /pub/www/index.html HTTP/1.0 If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT • Response: HTTP/1.1 200 OK Server: Microsoft-IIS/5.0 Date: Thu, 13 Jul 2000 05:46:53 GMT Expires: Sun, 20 Oct 2002 16:00:00 GMT Content-Length: 2291 Content-Type: text/html Cache-control: private
What: Content and Protocols • Example “if-modified-since”: GET /pub/www/index.html HTTP/1.0 If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT • Response: HTTP/1.1 304 Not Modified
Basic caching algorithm Pages may be • Fresh: up-to-date • Expired: current date > expiration date • Stale: “old”
Basic caching algorithm - #2 If (page is in the cache) if ( page is expired or stale ) Get from server - if-modified-since If not modified, Get from cache Get from cache Else Get from Server Soft Miss
Basic caching algorithm - #3 If cache has space Store the file Else • Delete expired from cache • Delete stale from cache • Delete LRU from cache • Delete largest/smallest from cache?
Agenda • Caching: Why, Where, How, What • Some empirical data: Zipf’s Law • Content Delivery Networks • Bibliography
Zipf’s law • Zipf’s law: The frequency of an event P as a function of rank i is a power law function: Pi = Ω / iα where α ≤ 1
Zipf’s law • Observed to be true for • Frequency of written words in English texts • Population of cities • Income of a company as a function of rank
Zipf’s law and web access • For a given server, page access by rank follows Zipf’s law • Web requests from a fixed population of users follows Zipf’s law 0.64 < α < 0.83
Observations • Top %1 of all documents account for %20 - %35 of proxy requests • Top %10 account for %45 - %55 of requests • It takes %25 to %40 of all documents to account for %70 of requests • It takes %70 to %80 of all documents to account for %90 of requests
Observations • For an infinite sized cache, the hit-ratio for a web-proxy grows in a log-like fashion as a function of the client population of the proxy and the number of requests seen by the proxy.
Observations • The hit-ratio of a web cache grows in a log-like fashion as a function of the cache size.
Observations Locality of Reference • The probability that a document will be referenced k requests after it was last referenced is roughly proportional to 1/k.
Observations - NOT • There is very little correlation between access frequency and document size • There is no correlation between access frequency and the change rate of a document • No single web server contributes to most of the popular pages
Zipf’s Law and Caching Discussion • How does this help in cache design? • Are there any business implications?
Agenda • Caching: Why, Where, How, What • Some empirical data: Zipf’s Law • Content Delivery Networks • Bibliography
CDN • “Traditional” CDN • Dirty Secrets • P2P content delivery systems
Content Server Reverse Proxy Content Server Reverse Proxy Content Server Reverse Proxy Why use a CDN? Local ISP Content Server Reverse Proxy cache cdn L4 Switch Data Center ISP Intranet cache Browser cache Browser cache cache Browser cdn
What is CDN? Content Deliver Networks = PUSH PUSH = Prefetch
CDNMechanisms • DNS redirection • Complete • Partial • URL rewrite
Network Model HTTP server example.com ? A B HTTP server B GET http://example.com/foo HTTP server C A DNS-redirecting CDN DNS redirector Original server Client http://example.com/foo Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
CDN DNS Full Redirection • (Semi)automatic mechanism to replicate original site on CDN servers • Replace original DNS entry with enhanced DNS server that uses knowledge of network and server load to direct clients to appropriate CDN server • TTL on DNS entries are very short • Adero, NetCaching, IntelliDNS
CDN DNS Partial Redirection • Statically modify selected URL’s within pages to point to CDN service • Replicate selected objects to CDN service • Redirect clients of selected URL’s using enhanced DNS server that uses knowledge of network and server load • Akamai, Digital Island, MirrorImage, SolidSpeed, Speedera
CDN rewrite • Modify pages at the origin server on the fly • Change embedded URL’s based on up-to-date knowledge of the network and CDN server loads • Does not require additional DNS lookups • Fasttide, Clearway
Measuring a CDN’s performance • Two papers • K.L.Johnson,J.F.Carr,M.S.Day,and M.F.Kaashoek,”The measured performance of content distribution networks,”in Proceedings of the 5th International Web Caching Workshop and Content Delivery Workshop,(Lisbon,Portugal),May 2000. • B. Krishnamurthy,C. Wills,Y. Zhang, “On the Use and Performance of Content Distribution Networks” in ACM SIGCOMM INTERNET MEASUREMENT WORKSHOP 2001.
The measured performance of content distribution networks Client Actions • R: Resolve domain name • F: Fetch content • Ordinary client use of CDN: RF • Instead of doing (RF)+ we do R+ then F+ • This allows us to compare the server chosen to some other servers that could have been chosen, over a large number of fetches. Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content distribution networks Procedure • R+: Collect a set of servers by repeated DNS queries • to a variety of name servers • over a number of hours • F+: Fetch a particular piece of content from each member of the set, measuring latency Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content distribution networks Important Details • Interleaved fetches • Fetch1 at server1, fetch1 at server2, etc. • Not fetch1 at server1, fetch2 at server1, etc. • Unmeasured fetch before measured fetch • Avoids cache misses • Measure only HTTP fetch latency • CDN not penalized for cost of DNS resolution Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content distribution networks: Looking at these graphs • Note: log plot of latency • Gray line: cumulative distribution at one server • Red line: cumulative distribution at all servers • Blue line: cumulative distribution at CDN Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content distribution networks Cumulative Distribution • Right way to look at this data • Want to understand frequency and magnitude of bad choices • Consistent = vertical • Fast = to the left Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt
The measured performance of content distribution networks Results • Akamai does a better job than Digital Island • Neither does a particularly good job of selecting the optimal server Slide originally from http://www.iwcw.org/2000/Proceedings/S4/S4-1.ppt