320 likes | 454 Views
Scalability Terminology: Farms, Clones, Partitions, and Packs: RACS and RAPS. Bill Devlin, Jim Cray, Bill Laing, George Spix Microsoft Research Dec. 1999 Based on presentation by Hongwei Zhang, Ohio State. Outline. Introduction Basic scalability terminology / techniques
E N D
Scalability Terminology:Farms, Clones, Partitions, and Packs:RACS and RAPS Bill Devlin, Jim Cray, Bill Laing, George Spix Microsoft Research Dec. 1999 Based on presentation by Hongwei Zhang, Ohio State
Outline • Introduction • Basic scalability terminology / techniques • Software requirements for these scalable systems • Cost/performance metrics to consider • Summary
Why need to Scale ? • Server systems must be able to start small • Small-size company (garage-scale) v.s. international company (kingdom-scale) • System should be able to grow as demand grows e.g. • eCommerce made system growth more rapid & dynamic • ASP also need dynamic growth
How to scale? • Scale up - expanding a system by incrementally adding more devices to an existing node – CPUs, discs, NICS, etc. • inherently limited • Scale Out – expanding the system by adding more nodes – convenient (computing capacity can be purchased incrementally), no theoretical scalability limit • slogans: • Buy computing by the slice • Build systems from CyberBricks • slice, cyberbricks: fundamental building blocks for a scalable system
Basic terminology associated with scale-out • Ways to organize massive computation • Farm • Geoplex • Ways to scale a farm • Clone (RACS) • Partition (RAPS) • Pack
Farm/Geoplex • Farm - the collection of servers, applications and data at a particular site • features: • functionally specialized services (email, WWW, directory, database, etc.) • administered as a unit (common staff, management policies, facilities, networking) • Geoplex – a replicated (duplicated?) farm at two or more sites • disaster protection • may be • active-active: all farms carry some of the load; • active-passive: one or more are hot-standbys (waiting for fail-over of corresponding active farms)
Clone • A replica of a server or a service • Allows load balancing • External to the clones - IP sprayer like Cisco LocalDirectorTM • LocalDirectorTM dispatches (sprays) requests to different nodes in the clone to achieve load-balancing • Internal to the clones - IP sieve like Network Load Balancing in Windows 2000 • Every requests arrive at every node in the clone, but each node intelligently accepts a part of these requests; • Distributed coordination among nodes
RACS • DFN: • The collection of clones for a particular service is is called a RACS (Reliable Array of Cloned Services). • Two types of RACS (Fig. 2) • Shared-nothing RACS • Each node duplicate all the storage locally • Shared-disk RACS (also called a cluster) • All the nodes (clones) share a common storage manager. Stateless servers at different nodes access a common backend storage server
RACS (contd.) • Advantages of cloning and RACS • Offer both scalability and availability • Scalability: excellent ways to addprocessing power, network bandwidth, and storage bandwidth to a farm; • Nodes can act as backup for one another: one node fail, other nodes continue to offer service (probably with degraded performance) • Failures could be masked, if node- and application-failure detection mechanisms are integrated with the load-balancing system or with client applications • Easy to manage • Administrative operations on one service instance at one node could be replicated to all others.
RACS (contd.) • Challenges • Shared-nothing RACS • not a good way to grow storage capacity: updates at one node’s must be applied to all other nodes’ storage • problematic for write-intensive services: all clones must perform all writes (no throughput improvement) and need subtle coordination • Shared-disk RACS could ameliorate (to some extent) this cost and complexity of cloned storage; • Shared-disk RACS • Storage server should be fault-tolerant for availability (only one copy of data) • Still require subtle algorithms to manage updates (such as cache validation, lock managers, transaction logs, etc.)
Partition • DFN: • To grow a service by duplicatingthe hardware and software but dividing the data among the nodes (Fig. 3). • Features • Only the software is cloned, data is divided among the nodes (unlike shared-nothing clone) • Transparent to applications • Simple partitioning has only one copy of data, thus not improving availability: • Geoplex to guard against loss of storage • More common: locallyduplex (raid 1) or parity protect (raid 5) the storage
Partition (contd.) • Example • Typically, the application middleware partitions the data and workload by object: • Mail servers partition by mailboxes • Sales systems partition by customer accounts or product lines • Challenge • When a partition (node) is added, the data should be automatically repartitioned among the nodes to balance the storage and computational load. • The partitioning should automatically adapt as new data is added and as the load changes.
Pack • Purpose • To deal with hardware/software failure at a partition • DFN: • Each partition is implemented as a pack of two or more nodes that provide access to the storage (Fig. 3).
Pack (contd.) • Two types of Pack • Shared-disk pack • All members of the pack may access all the disks in the pack; • Similar to shared-disk clone, except that the pack is serving just one part of the total database. • Shared-nothing pack • Each member of the pack may serve just one partition of the disk pool during normal conditions, but serve a failed partition if the partition’s primary server fails;
Shared-nothing Pack (contd.) • Two modes: • Active-active pack: • each member of the pack can have primary responsibility for one or more partitions; • When a node in the pack fails, the service of its partition migrates to another node of the pack. • Active-passive pack: • Just one node of the pack is actively serving requests while the other nodes are acting as hot-standbys
RAPS • DFN: • The collection of nodes that support a packed-partitioned service are called a RAPS (Reliable Array of Partitioned Services). • Advantage • Provides both scalability and availability; • Better performance than RACS for write-intensive services.
Summary (contd.) • Clones and RACS • For read-mostly applications with low consistency and modest storage requirements (<= 100 GB) • Web/file/security/directory servers • Partitions and RAPS • For update-intensive and large database applications (routing requests to specific partitions) • Email/instant messaging/ERP/record keeping
Example • Multi-tier applications (Functional separation) • front-tier: • Web and firewall services (read mostly) • middle-tier: • File servers (read mostly) • data-tier: • SQL servers (update intensive)
Example (contd.) • Load balancing and routing at each tier • Front-tier • IP-level load distribution scheme • Middle-tier • Data and process specific load steering, according to request semantics • Data-tier • Routing to the correct partition
Software Requirements for Geoplex, Farms, RACS and RAPS (more of a wish list than a reflection of current tools and capabilities) • Be able to manage everything from a single remote console, treating RACS and RAPS as entities • Automated operation software to deal with “normal events” (summarizing, auditing, etc.) and to help the operator manage exceptional events (detects failures and orchestrates repair, etc.): reduce operator error (thus, enhancing site availability)
Software requirements (contd.) • Both the software and hardware components must allow online maintenance and replacement • Tools to support versioned software deployment and staging across a site • Good tools to design user interfaces, services, and databases • Good tools to configure and then load balance the system as it evolves
Software requirements (contd.) • RACS • Automatic replication of software and data to new nodes • Automatic request routing to load balance the work and to route around failures • Recognize repaired and new nodes • RAPS • Automatic routing requests to nodes dedicated to serving a partition of data (affinity routing) • Middleware to provide transparent partitioning and load-balancing (a application-level service) • Similar manageability features of cloned system (for Pack)
Price/Performance Metrics • Why need cloning/partitioning ? • One cannot buy a single 60 billion-instructions per second processor or a single 100TB server • So, at least some degree of cloning and partitioning is required • What is the right building block for a site?
Right building block • Mainframe vendors • Mainframe is the right choice! • Their hardware and software offer high availability • Easier to manage their systems than to manage coned PCs • But, mainframe prices are fairly high ! • 3x to 10x more expensive
Right building block (contd.) • Commodity servers and storage • Less costly to use inexpensive clones for those CPU intensive services, such as web service. • Commodity software is easier to manage than the traditional services that require skilled operators and administrators • Consensus • Much easier to manage homogeneous sites (all NT, all FreeBSD, etc.) than to manage heterogeneous sites • Stats: middleware (such as Netscape, IIS, Notes, Exchange) are where the administrators spend most of their time
Summary • Scalability technique • Replicate a service at many nodes • Simpler forms of replication • Duplicate both programs and data: RACS • For large databases or update-intensive services • Data partitioned: RAPS • Packs make partitions highly available • Against disaster • The entire farm is replicated to form a geoplex