World Wide Web Caching: Trends and Technologys

World Wide Web Caching: Trends and Technologys Gerg Barish & Katia Obraczka USC Information Sciences Institute, USA,2000

plan • Introduction • The Expected gains • Desirable properties of a Web Caching system • Caching architectures • Cache deployment options • Design techniques • summary • Future works

Introduction • What is Web Caching ? • Introducing proxy servers at certain points in the network that serve in caching Web documents for faster client access. • Comparable to the cache memory in a computer system. • Why is it needed ? • Rapid growth in HTTP traffic to form the largest part of the Internet traffic which causes more network congestion and server unavailability. • The number of Web static pages almost doubles every year.

The Expected gains: • Bandwidth saving • Improving content availability. • Improving web server availability. • Reducing network latency. • Server load balancing. • Improving user’s perception about network’s performance.

Desirable properties : • Fast access • Transparency • Scalability • Efficiency • Adaptivity • Stability • Load balancing • Simplicity

Caching Architectures • Proxy Caching • Deployed at the edges of the network • Unavailable cache  Unavailable network • Single point of failure • User browser manual reconfiguration in times of failure • Browser auto-reconfiguration is a recent trend client Web client cache router client (a). standalone

Caching Architectures • Reverse Proxy Caching • Placing proxies near the content provider • Transparent Caching • Eliminates the needs to manually configure web browsers • Router-based transparent proxy caching • Switch-based transparent proxy caching client client Web Web L4 switch client router client router client client cache cache cache cache cache cache (c)switch-transparent proxy caching (b)router-transparent

Caching Architectures • Adaptive Web Caching • Uses distributed cache meshes to solve the hot spot problem • Caches dynamically join and leave the groups based on content demand • Adaptivity and self-organizing • Cache Group Management Protocol(CGMP) • Content Routing Protocol(CRP) • Administrative boundaries must be relaxed

Caching Architectures Overlapping multicast groups of web caches Self-organization of web caches

Caching Architectures • Push Caching • Keep data close to those clients requesting this information • Assumption: we are able launch caches that may cross administrative boundaries • Incurs cost (storage and transmission) • Active Caching • Applies caching to dynamic documents • 30 % of client HTTP requests contains cookies • Cache applets • The servers provides the cache with the objects and any associated cache applets

Cache Deployment options • Near the content consumer(consumer-oriented) • Better response time • Local service of requests • Near the content provider(provider-oriented) • Improves access to logical sets of data • Improve the scalability and availability of content • Problem critical to delay sensitive content (audio,video) • At strategic points in the network • Based on user access patterns and network topology and conditions • Problem with administrative control

Design Techniques • Main Concerns: • Speed • Reliability • Scalability • design techniques: • Hierarchical caching • Intercache communication • Hash-Based request routing • Optimized disk I/O • Microkernel Operating System • Content prefetching • Cache consistency methods

Hierarchical Caching • Caches are arranged in a tree-like structure • A child cache can query parent caches and other siblings • A parent cache can never query children • This maintains information gradually filtering down to the leaves • To avoid swamping parents with information, clustering may be applied to hierarchies.

Hierarchical Caching • Caches are placed at multiple levels of the network. • Bottom – clients/browsers caches. web page not found national regional web page not found institutional bottom

Hierarchical Caching • Advantages: • Bandwidth efficient – especially when cache servers are slow. • Allows to efficiently diffuse popular web pages towards the demand. • Disadvantages • Cache server needs to be placed at key access points of the network requires coordinationamong caches. • Each level adds a delay. • High levels are bottlenecks. • multiple copies at different cache levels.

Distributed Caching • Multiple Distributed Caches in meshes • Caches at the bottom level only. • No other intermediate caching levels. • Improves scalability, availability, and physical locality • Each cache server contains meta-data on the data stored on other servers. • Hierarchy used only for distributing information about location of the copy. • No copying of actual documents

Intercache Communication • Composed of multiple distributed caches. • Protocols: • ICP (Internet Cache Protocol) [Squid]: Caches issue queries to other caches to determine the best location of object retrieval. Main problem is the message overhead • CRP (Content Routing Protocol): ICP with multicast feature to query cache meshes • Cache digests [Squid]: summarizes cache objects • WCCP (Web Cache Communication Protocol) [Cisco]: Enables transparent redirection of HTTP traffic to Cisco Cache Engine • CARP (Cache Array Routing Protocol) [Microsoft]: Uses Hashing Schemes for location determination of the required proxy having the requested information

Hashing function • Point the local cache in direction of other caches which have the object or can get it. • Hash-Based request routing • Use hash-function to map a key (such as the url) to a cache within a cluster • Reduces (eliminates) the need of caches to query each other • Ex) Netcache-MD5-indexed URL hash-function CARP

Treat the object cache with high performance data base. Determine if the object has been cached in memory data structure. Disk operations locate where is in the disk place the content. Costly I/O operations can be avoided. How the resources are managed . Improve resource allocation. Optimize cache performance. Optimized I/O: Microkernel Operating System:

Content prefetching • The latter uses data accumulate by the server,such as historical information. • Content prefetching • Local based • Server-hint based • Implementation: • Between clients and servers • Between clients and proxies • Between proxies and servers • Improvements: • Less latency (from 26% improvement to 57%) • Improved access time

Cache coherency (consistency) • Ensure that the cached object does not reflect stale or defunct data. • Consistency techniques: • Client polling:compare the cached object with that of the original object . • Invalidation callbacks:the server contact the proxies when objects change. • TTL and Adaptive TTL • If-Modified Since:caches only when they are requested and there expiration date has been reached.

Summary: • Different designing caches but some issues common among them. • Advantages: • Improve content availability. • Reduce network latencies. • Reduce address increasing bandwidth demands. • Can hide network problems. • Reduce server burden. • Disadvantages: • Stale pages. • Information retained in caches.

Open Future Works(trends): • Content security. 1.Net cache 2. Cache flow • Handling more complex objects and real-time data • Web Caching based on Ontology ? • User access pattern prediction • Prefatching • Cache placement/replacement -The appliance deployed in parallel to firewall -The appliance can be used to control who accesses a web site. -Virus scanning for all incoming content. -Added content filtering to its caches. RTEE(real time event engine): captures,caches,and queries data at speeds greater than 12000 event/s.

World Wide Web Caching: Trends and Technologys