p-Jigsaw: A Cluster-based Web Server with Cooperative Caching Supports

p-Jigsaw: A Cluster-based Web Server with Cooperative Caching Supports Ge Chen, Cho-Li Wang, Francis C.M. Lau (Presented by Cho-Li Wang) The Systems Research Group Department of Computer Science and Information Systems The University of Hong Kong

What’s a cluster ? • A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone/complete computerscooperatively working together as a single, integrated computing resource– IEEE TFCC.

Rich Man’s Cluster • Computational Plant (C-Plant cluster) • Rank: 30 at TOP500 (11/2001) • 1536Compaq DS10L 1U servers (466 MHz Alpha 21264 (EV6) microprocessor, 256 MB ECC SDRAM) • Each node contains a 64-bit, 33 MHz Myrinet network interface card (1.28 Gbps/s) connected to a 64-port Mesh64 switch. 48 cabinets, each of which contains 32 nodes (48x32=1536)

Poor Man’s Cluster • HKU Linux Cluster • 32 733 MHz Pentium III PCs, 392MB Memory • Hierarchical Ethernet-based network : four 24-port Fast Ethernet switches + one 8-port Gigabit Ethernet backbone switch) • Additional 80-port Cisco Catalyst 2980G Fast Ethernet Switch

Cluster Computer Architecture Programming Environment (Java, C, MPI, HPF) Web Windows User Interface Other Subsystems (Database, Web server OLTP, etc.) Single System Image Infrastructure Availability Infrastructure OS OS OS OS Node Node Node Node High-Speed LAN (Fast/Gigabit Ethernet, SCI, Myrinet)

Talk Outline • Motivation -- The Need for Speed • Cluster-based Solutions • System Architecture of p-Jigsaw • Performance Evaluation • Conclusion and Future Work • Other SRG Projects

The Challenges • Netscape Web site in November 1996: 120 million hits per day • Microsoft Corp. Web site received more than 100 M hits per day. (1,200 hits per second) • Olympic Winter Games 1998 (Japan): 634.7M (16 days), peak day 57M, peak minute 110K • Winbledon July 1999, 942 M hits (14 days), peak day 125M, peak minute 430K • Olympic Games 2000 : peak day 502.6 M, peak minute 600K hits. (10K hits per second)

The Need for Speed • Internet user popularity is growing very fast • According to United States Internet Council’s report, regular Internet user has increased from less then 9M in 1993 to more than 300M in the summer of 2000, and is still growing fast • Broadband becomes popular • According IDG’s report, 57% of the workers in U.S access Internet via broadband in office. The figure will be more than 90% by 2005. Home broadband user will also increase from less than 9M now to over 55M by 2005 • HTTP requests account for larger portion of Internet traffic now • One study shows that HTTP activity has grown to account for 75%~80% of all Internet traffic

The Need for Speed • The Need for Speed • Growing user number • Faster last-mile connection speed • Increasing portion of HTTP requests accounts for all Internet traffic • Require a more powerful Web server architecture

Cluster-Based Solution Cluster -- A Low-cost yet efficient parallel computing architecture

Dispatcher-based Internet A network component of the Web-server system acts as a dispatcher which routes the requests to one of the Web servers to fulfill the load balancing. Each Web server works individually. • Layer 4 switching with level 2 address translation : • One-IP, IBM eNetwork, WebMux, LVS in DR mode • Layer 4 switching with level 3 address translation : Cisco LocalDirector, Alteon ACEDirector, F5 Big/IP, LVS in NAT mode. • Layer 7 switching (Content-based): LARD, IBM Web Accelerator, Zeus Load Balancer (ZLB) client Dispatcher

p-Jigsaw -- Goals • High Efficiency: • Explore aggregate power of cluster resources (CPU, memory, disk, network bandwidth). • Explore in-memory Web caching on cluster-based Web servers • High Scalability • Maintain high cache hit rate and high throughput as cluster size grows • Eliminate potential bottleneck in the overall design • High Portability • Multi-platform support • Heterogeneous cluster

Main Features of p-Jigsaw Web servers • Global Object Space (GOS) • Hot Objects Caching • Cooperative Object Caching • Distributed Cache Replacement Algorithms

Global Object Space (All Web objects in system are visible and accessible to every node through GOS) Global Object Space Hot Object Cache Memory Cache Memory Cache Memory Cache Memory Cache p-Jigsaw p-Jigsaw p-Jigsaw p-Jigsaw JVM JVM JVM JVM OS OS OS OS Server Node Server Node Server Node Server Node High-Speed LAN

5 Cached in node 1 Cached copy is forwarded from node 2,3, or 4, depends on the server workload Incoming Request: http://p-Jigsaw.org/node4/dir2/pic1.jpg HN : Home Node AGAC : Approximated Global Access Counter CCNN : Cache Copy Node NumberLAC : Local Access Counter Redirect the request to node 4 (Home node of the requested page) Search on GOT1 (Hashing) 3 6 1 Miss! 4 2 Hot Object Cache Hot Object Cache Hot Object Cache Hot Object Cache Hard disk Hard disk Hard disk Hard disk Node 1 Node 4 Node 2 Node 3

Distributed Cache Replacement • Two LFU-based Algorithms are Implemented: • LFU-Aging: AGAC/ 2 ; every △t • Weighted-LFU: AGAC/ (file size) • Global LRU (GLRU) is implemented for comparison • Try to cache the “hottest objects” in global object space • Cached object’s life time is set according to HTTP timestamp. Cache consistency is maintained by invalidation scheme.

Update of Access Counters HOC HOC HOC HOC 1 2 3 4 45+200 = 245 245 0 245 GOS LAC is periodically sent back to objects’ HN to maintain an approximate global access counter for every cached object

Experiment Setup • 32-node PC cluster. Each node consists of a 733MHz Pentium III PC running Linux 2.2.4. • The nodes are connected with an 80-port Cisco Catalyst 2980G Fast Ethernet Switch. • A NFS server (2-way SMP) with Gigabit Ethernet link to the switch. • 16 nodes acts as clients, and the rest as Web servers. • Each of the server nodes has 392MB physical memory installed 32-node PC cluster

Effects of Scaling the Cluster Size Experiment Results 4.56 2.02

Effects of Scaling the Cache Size Experiment Results Aggregated Cache size for 16 nodes = 1.8% (8 MB per node), 3.6%, 7.2%, and 14.4% (64 MB per node) of the size of the data set

Analysis of Requests Handle Patterns • Local Cache Object (in local memory) • The server that receives the request has the requested object in its local hot object cache. • Peer Node Cache Object (in remote memory) • The server that receives the request does not have the requested object in its local hot object cache. The object is fetched from either the home node or a or other peer nodes. • Disk Object (local or remote disk) • The requested object is not in the global object space, and has to be fetched from the file server.This has the longest serving time.

Analysis of Requests Handle Patterns • ~60% LFU-based algorithms show high local cache hit rates. With 64 MB cache per node, the local cache hit rate is around 60% for both Weighted-LFU and LFU-Aging,

Analysis of Requests Handle Patterns ~50% ~35.2% ~6.7% ~25% With small cache size (8MB), the cooperative cache can improve the global cache hit rate and reduce the costly file server disk access, which is a common bottleneck for a website.

Conclusions • Use of cluster wide physical memory as object cache can lead to improved performance and scalability of Web server systems • With relatively small amount of memory dedicated for object content caching, we are able to achieve a high hit rate with the cooperative caching • Favor replicating more hot objects rather than squeezing more different objects into the global object space.

Future Work • The HKU “Hub2World” Project • Build a giant proxy cache server on a large PC cluster with HKU’s 300-node Gideon cluster based on p-Jigsaw • Cache hot objects on 150 GB in-memory cache (0.5GB x 300) + 12 Terabytes disk space (40GB x 300) • Design of new caching algorithms

Other SRG Projects Welcome to download our software packages and test them on your clusters. URL: http://www.srg.csis.hku.hk/

Current SRG Clusters

JESSICA2 – A Distributed JVM A Multithreaded Java Program Thread Migration JVM JVM JVM JVM JVM JVM Global Object Space

JUMP Software DSM • Allows programmers to assume a globally shared virtual memory, even they execute program on nodes that do not physically share memory • The DSM system will maintain the memory consistency among different machines. Data faulting, location, and movement are handled by the DSM. Proc 1 Proc 2 Proc N-1 Proc N Mem 1 Mem 2 Mem N-1 Mem N Network Globally Shared Virtual Memory

HKU DP-II on Gigabit Ethernet Single-trip Latency Test (Min: 16.3 µs) Bandwidth Test (Max: 79.5 MB/s) RWCP GigaE PM : 48.3 us round-trip latency and 56.7 MB/s on Essential Gigabit Ethernet NIC Pentium II 400 MHz. RWCP GigaE PM II : 44.6 us round trip time. 98.2 MB/s bandwidth on Packet Engines G-NIC II for connecting Compaq XP-1000 (Alpha 21264 at 500 MHz.),.

SPARKLE Project • A Dynamic Software Architecture for Pervasive Computing –“Computing in Small” Application Facets Won’t Fit Applications distributed as monolithic blocks Our component-based solution

SPARKLE Project Service Providers Facet Servers Intelligent proxies Facet Retrieval Execution Servers Co-operative Caching (User Mobility) Facet Query Computational Grid Delegation/ Mobiie code Clients (Linux + JVM) Peer-to-Peer Interaction Overview of the proposed software architecture

ClusterProbe : Cluster Monitoring Tool

Q&A For more information, please visit http://www.csis.hku.hk/~clwang

p-Jigsaw: A Cluster-based Web Server with Cooperative Caching Supports