450 likes | 536 Views
Tackling Challenges of Scale in Highly Available Computing Systems. Ken Birman Dept. of Computer Science Cornell University. Members of the group. Ken Birman Robbert van Renesse Einar Vollset Krzystof Ostrowski Mahesh Balakrishnan Maya Haridasan Amar Phanishayee. Our topic.
E N D
Tackling Challenges of Scale in Highly Available Computing Systems Ken Birman Dept. of Computer Science Cornell University
Members of the group • Ken Birman • Robbert van Renesse • Einar Vollset • Krzystof Ostrowski • Mahesh Balakrishnan • Maya Haridasan • Amar Phanishayee
Our topic • Computing systems are growing • … larger, • … and more complex, • … and we are hoping to use them in a more and more “unattended” manner • Peek under the covers of the toughest, most powerful systems that exist • Then ask: Can we discern a research agenda?
Some “factoids” • Companies like Amazon, Google, eBay are running data centers with tens of thousands of machines • Credit card companies, banks, brokerages, insurance companies close behind • Rate of growth is staggering • Meanwhile, a new rollout of wireless sensor networks is poised to take off
How are big systems structured? • Typically a “data center” of web servers • Some human-generated traffic • Some automatic traffic from WS clients • The front-end servers are connected to a pool of clustered back-end application “services” • All of this load-balanced, multi-ported • Extensive use of caching for improved performance and scalability • Publish-subscribe very popular
LB LB LB LB LB LB service service service service service service A glimpse inside eStuff.com “front-end applications” Pub-sub combined with point-to-pointcommunication technologies like TCP
Hierarchy of sets • A set of data centers, each having • A set of services, each structured as • A set of partitions, each consisting of • A set of programs running in a clustered manner on • A set of machines … raising the obvious question: how well do platforms support hierarchies of sets?
x y z A RAPS of RACS (Jim Gray) • RAPS: A reliable array of partitioned subservices • RACS: A reliable array of cloned server processes A set of RACS RAPS Pmap “B-C”: {x, y, z} (equivalent replicas) Here, y gets picked, perhaps based on load Ken Birman searching for “digital camera”
Services are hosted at data centers but accessible system - wide Data center B Data center A Query source Update source pmap pmap pmap Logical partitioning of services l2P map Logical services map to a physical Server pool resource pool, perhaps many to one Operators can control pmap , l2P map, other parameters. Large - scale multicast used to disseminate updates RAPS of RACS in Data Centers
Technology needs? • Programs will need a way to • Find the “members” of the service • Apply the partitioning function to find contacts within a desired partition • Dynamic resource management, adaptation of RACS size and mapping to hardware • Fault detection • Within a RACS we also need to: • Replicate data for scalability, fault tolerance • Load balance or parallelize tasks
Membership Within RACS Of the service Services in data centers Communication Point-to-point Multicast Resource management Pool of machines Set of services Subdivision into RACS Fault-tolerance Consistency Scalability makes this hard!
… hard in what sense? • Sustainable workload often drops at least linearly in system size • And this happens because overheads grow worse than linearly (quadratic is common) • Reasons vary… but share a pattern: • Frequency of “disruptive” events rises with scale • Protocols have property that whole system is impacted when these events occur
QuickSilver project • We’ve been building a scalable infrastructure addressing these needs • Consists of: • Some existing technologies, notably Astrolabe, gossip “repair” protocols • Some new technology, notably a new publish-subscribe message bus and a new way to automatically create a RAPS of RACS for time-critical applications
Gossip 101 • Suppose that I know something • I’m sitting next to Fred, and I tell him • Now 2 of us “know” • Later, he tells Mimi and I tell Anne • Now 4 • This is an example of a push epidemic • Push-pull occurs if we exchange data
Gossip scales very nicely • Participants’ loads independent of size • Network load linear in system size • Information spreads in log(system size) time 1.0 % infected 0.0 Time
Gossip in distributed systems • We can gossip about membership • Need a bootstrap mechanism, but then discuss failures, new members • Gossip to repair faults in replicated data • “I have 6 updates from Charlie” • If we aren’t in a hurry, gossip to replicate data too
Bimodal Multicast Gossip source has a message from Mimi that I’m missing. And he seems to be missing two messages from Charlie that I have. Here are some messages from Charlie that might interest you. Could you send me a copy of Mimi’s 7’th message? Send multicasts to report events Periodically, but not synchronously, gossip about messages. Mimi’s 7’th message was “The meeting of our Q exam study group will start late on Wednesday…” Some messages don’t get through ACM TOCS 1999
Most members are healthy…. … but one is slow Stock Exchange Problem: Reliable multicast is too “fragile” Most members are healthy….
32 96 The problem gets worse as the system scales up Virtually synchronous Ensemble multicast protocols 250 group size: 32 group size: 64 group size: 96 200 150 average throughput on nonperturbed members 100 50 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 perturb rate
Bimodal multicast with perturbed processes Bimodal multicastscales well Traditional multicast: throughput collapses under stress
Bimodal Multicast • Imposes a constant overhead on participants • Many optimizations and tricks needed, but nothing that isn’t practical to implement • Hardest issues involve “biased” gossip to handle LANs connected by WAN long-haul links • Reliability is easy to analyze mathematically using epidemic theory • Use the theory to derive optimal parameter setting • Theory also let’s us predict behavior • Despite simplified model, the predictions work!
Kelips • A distributed “index” • Put(“name”, value) • Get(“name”) • Kelips can do lookups with one RPC, is self-stabilizing after disruption
Kelips Take a a collection of “nodes” 110 230 202 30
- N N 1 Kelips Map nodes to affinity groups Affinity Groups: peer membership thru consistenthash 0 1 2 110 230 202 members per affinity group 30
- N N 1 Kelips 110 knows about other members – 230, 30… Affinity group view Affinity Groups: peer membership thru consistenthash 0 1 2 110 230 202 members per affinity group 30 Affinity group pointers
- N N 1 Kelips 202 is a “contact” for 110 in group 2 Affinity group view Affinity Groups: peer membership thru consistenthash 0 1 2 110 Contacts 230 202 members per affinity group 30 Contact pointers
- N N 1 Kelips “cnn.com” maps to group 2. So 110 tells group 2 to “route” inquiries about cnn.com to it. Affinity group view Affinity Groups: peer membership thru consistenthash 0 1 2 110 Contacts 230 202 members per affinity group 30 Resource Tuples Gossip protocol replicates data cheaply
- N N 1 Kelips To look up “cnn.com”, just ask some contact in group 2. It returns “110” (or forwards your request). Affinity Groups: peer membership thru consistenthash 0 1 2 110 230 202 members per affinity group 30 IP2P, ACM TOIS (submitted)
Kelips • Per-participant loads are constant • Space required grows as O(√N) • Finds an object in “one hop” • Most other DHTs need log(N) hops • And isn’t disrupted by churn, either • Most other DHTs are seriously disrupted when churn occurs and might even “fail”
Astrolabe: Distributed Monitoring 1.9 2.1 1.8 3.1 0.9 0.8 1.1 5.3 3.6 2.7 • Row can have many columns • Total size should be k-bytes, not megabytes • Configuration certificate determines what data is pulled into the table (and can change) ACM TOCS 2003
State Merge: Core of Astrolabe epidemic swift.cs.cornell.edu cardinal.cs.cornell.edu
State Merge: Core of Astrolabe epidemic swift.cs.cornell.edu cardinal.cs.cornell.edu
State Merge: Core of Astrolabe epidemic swift.cs.cornell.edu cardinal.cs.cornell.edu
Scaling up… and up… • With a stack of domains, we don’t want every system to “see” every domain • Cost would be huge • So instead, we’ll see a summary cardinal.cs.cornell.edu
Build a hierarchy using a P2P protocol that “assembles the puzzle” without any servers Dynamically changing query output is visible system-wide SQL query “summarizes” data New Jersey San Francisco
(1) Query goes out… (2) Compute locally… (3) results flow to top level of the hierarchy 1 1 3 3 2 2 New Jersey San Francisco
Hierarchy is virtual… data is replicated New Jersey San Francisco ACM TOCS 2003
Astrolabe • Load on participants, in worst case, grows as logrsize(N) • Most partipants see a constant, low load • Incredibly robust, self-repairing • Information visible in log time • And can reconfigure or change aggregation query in log time, too • Well matched to data mining
QuickSilver: Current work • One goal is to offer scalable support for: • Publish(“topic”, data) • Subscribe(“topic”, handler) • Topic associated w/ protocol stack, properties • Many topics… hence many protocol stacks (communication groups) • Quicksilver scalable multicast is running now and demonstrates this capability in a web services framework • Primary developer is Krzys Ostrowski
Tempest • This project seeks to automate a new drag-and-drop style of clustered application development • Emphasis is on time-critical response • You start with a relatively standard web service application having good timing properties (inheriting from our data class) • Tempest automatically clones services, places them, load-balances, repairs faults • Uses Ricochet protocol for time-critical multicast
Ricochet • Core protocol underlying Tempest • Delivers a multicast with • Probabilistically strong timing properties • Three orders of magnitude faster than prior record! • Probability-one reliability, if desired • Key idea is to use FEC and to exploit patterns of numerous, heavily overlapping groups. • Available for download from Cornell as a library (coded in Java)
Our system will be used in… • Massive data centers • Distributed data mining • Sensor networks • Grid computing • Air Force “Services Infosphere”
Next major project? • We’re starting a completely new effort • Goal is to support a new generation of mobile platforms that can collaborate, learn, and can query a surrounding mesh of sensors using wireless ad-hoc communication • Stefan Pleisch has worked on the mobile query problem. Einar Vollset and Robbert van Renesse are building the new mobile platform software. Epidemic gossip remains our key idea…
Summary • Our project builds software • Software that real people will end up running • But we tell users when it works and prove it! • The focus lately is on scalability and QoS • Theory, engineering, experiments and simulation • For scalability, set probabilistic goals, use epidemic protocols • But outcome will be real systems that we believe will be widely used.