Scalability & Availability

Scalability & Availability Paul Greenfield CSIRO

Building Real Systems • Scalable • Fast enough to handle expected load • Grow easily when load grows • Available • Available enough of the time • Performance and availability cost • Aim for ‘enough’ of each but not more

Scalable • Scale-up • Bigger and faster systems • Scale-out • Systems working to handle load • Server farms • Clusters • Implications for application design

Available • Goal is 100% availability • 24x7 operations • Redundancy is the key • No single points of failure • Spare everything • Disks, disk channels, processors, power supplies, fans, memory, .. • Automated fail-over and recovery

Performance • How fast is this system? • Not the same as scalability but related • Scalability is concerned with the limits to possible performance • Measured by response time and throughput • Aim for enough performance • Have a performance target • Tune and add hardware until target hit • Then worry about tomorrow…

Performance Measures • Response time • What delay does the user see? • Instantaneous is good but 95% under 2 seconds is acceptable • Response time varies with ‘heaviness’ of transactions • Fast read-only transactions • Slower update transactions • Effects of database contention

Response Times

Throughput • How many transactions can be handled in some period of time • Transactions/second or tpm, tph or tpd • A measure of overall capacity • Transaction Processing Council • Standard benchmarks for TP systems • TPCC for typical transaction system • www.tpc.org • Current record is 227,000 tpmc

Throughput • Throughput increases until some resource limit is hit • Adding more clients just increases the response time • Run out of processor, disk bandwidth, network bandwidth • Some resources overload badly • Ethernet network performance degrades

Throughput

System Capacity • How many clients can you support? • Name an acceptable response time • Average 95% under 2 secs is common • And what is ‘average’? • Plot response time vs # of clients • Great if you can run benchmarks • Reason for prototyping and proving proposed architectures before leaping into full-scale implementation

System Capacity

Load Balancing I • A few different but related meanings • 1. Balancing across server processes • CORBA-style where clients use objects that live inside server processes • Want all server processes to be busy • Client calls have to go to the process containing their object, even if this process is busy and others are idle

Load Balancing I

Load Balancing I • Client calls on name server to find the location of a suitable server • Name server can spread client objects across multiple servers • Often ‘round robin’ • Client is bound to server and stays bound forever • Can lead to performance problems

Server Object Reference Client Numbers Total Clients per server object Server Object Reference Client Numbers Total Clients per server object 1 1-100, 201, 206, 211, ….496 160 1 1-100 100 2 101-200, 202, 207, 212, …, 497 160 2 101-200 100 3 203, 208, 213, …, 498 60 3 201-300 100 4 4 204, 209, 214, …, 499 301-400 100 60 5 401-500 100 5 205, 210, 215, …, 500 60 Load Balancing I Initial Later

Load Balancing I • Solution to static allocation problem is for clients to throw away their server objects and get new ones every now and again • Application coding problem • And can be objects be discarded? • What kind of ‘objects’ are they if they can be discarded?

Name Servers • Server processes call name server when they come up • Advertising their services • Clients call name server to find the location of a server process • Up to the name server to match clients to servers • Client calls server process to create objects

Advertise service Request server reference Return server reference Get server object reference Server process Call server object’s methods Load Balancing I Name Server Client Client Server process Client Load balancing across processes within a server

Load Balancing II • What happens when our single system is full? • Use faster systems • Scale-up • Use additional systems • Scale-out • Now load-balancing is used to spread load across systems

Load Balancing II • CORBA world… • Name server can distribute across server processes running on different systems • Scales well… • Name server only involved when handing out a reference to a server, not on every method call

Advertise service Request server reference Return server reference Get server object reference Call server object’s methods Load Balancing II Name Server Server process Client Client Client Server process Load balancing across multiple systems

Load Balancing II • COM+ world… • No need for load-balancing within a system • Multithreaded server process • All objects live in a single process space • Component load balancing across systems • Client calls router when creating object • Router returns reference to an object in a COM+ server process • Load balanced at time of object creation

Load Balancing II MTS process DCOM/MTS App DLL Client Client Client Thread pool Shared object space Application code COM+/MTS using thread pools rather than load balancing within a single system

Response time tracker Router Create object Pass request to server Create object and pass back reference Call object’s methods COM+ Component Load Balancing Client Client Client COM + CLB balancing load across multiple systems

Load Balancing II • COM+ scales well… • Router only involved when object is created • May change in later release to support dynamic re-balancing as server load changes • Method calls direct from client to server • Allocation based on response time rather than round-robin • Allocate to least-loaded server

Load Balancing II • No name server in COM world? • COM/MTS clients ‘know’ the name of the server • Set at client installation time • Can change using GUI tools • Admin problem if server app is moved • COM+ uses Active Directory to find services

Load Balancing II • Some systems involve the router in every method call/request • Request goes to router process who then passes it on to a server process • Scales poorly as the router can be a major bottle-neck • Some availability concerns as well • What happens if the router fails?

Load Balancing II Server process Client Router Server process Client Client Load balancing with router in main call path

Scale-up • No need for load-balancing across systems • Just use a bigger box • Add processors, memory, …. • SMP (symmetric multiprocessing) • Runs into limits eventually • Could be less available

Scale-up • Example from the Web • Large auction site • Server farm of NT boxes (scale-out) • Single database server (scale-up) • 64-processor SUN box • More capacity needed? • Add more NT boxes easily • SUN box is full so have to shift some databases to another box

Clusters • A group of independent computers acting like a single system • Shared disks • Single IP address • Single set of services • Fail-over to other members of cluster • Load sharing within the cluster • DEC, IBM, MS, …

Client PCs Server A Server B Heartbeat Cluster management Disk cabinet A Disk cabinet B Clusters

Clusters • Address scalability • Add more boxes to the cluster • Address availability • Fail-over • Add & remove boxes from the cluster for upgrades and maintenance • Can be used as one element of a highly-available system

Web Server Farms • Web servers are highly scalable • Web applications are normally stateless • Next request can go to any Web server • State comes from client or database • Just need to spread incoming requests • IP sprayers (hardware, software) • >1 Web server looking at same IP address with some coordination (see MS WLB docs) • Same technique for other network apps

Available System Web Clients Web Servers Load balanced using Convoy App Servers use COM+ LB Database is installed on Wolfpack cluster for high availability COM+ LBS router node

Availability • How much? • 99% 87.6 hours a year • 99.9% 8.76 hours a year • 99.99% 0.876 hours a year • Need to consider operations as well • Maintenance, software upgrades, backups, application changes • Not just faults and recovery time

Availability and Scalability • Often a question of application design • Stateful vs stateless • What happens if a server fails? • Can requests go to any server? • What language and database API • Balance cost vs speed – VB/C++ - ODBC/ADO • Synchronous method calls or asynchronous messaging? • Reduce dependency between components • Failure tolerant designs

Next Week • Distributed application architectures • How to design systems that will work, scale and be available • Web-based systems • Web technology

Scalability & Availability