930 likes | 1.06k Views
Computing in the. R eliable A rray of I ndependent N odes. Vasken Bohossian, Charles Fan, Paul LeMahieu, Marc Riedel, Lihao Xu, Jehoshua Bruck. Marc Riedel. Marc Riedel. California Institute of Technology. IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems. May 5, 2000.
E N D
Computing in the Reliable Array of Independent Nodes Vasken Bohossian, Charles Fan, Paul LeMahieu, Marc Riedel, Lihao Xu, Jehoshua Bruck Marc Riedel Marc Riedel California Institute of Technology IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems May 5, 2000
RAIN Project Collaboration: • Caltech’s Parallel and Distributed Computing Group www.paradise.caltech.edu • JPL’s Center for Integrated Space Microsystems www.csmt.jpl.nasa.gov
RAIN Platform node node node Heterogeneous network of nodes and switches switch bus network switch node node node
4 eight-way Myrinet Switches 10 Pentium boxes w/multiple NICs RAIN Testbed www.paradise.caltech.edu
Proof of Concept: Video Server Video client & server on every node. C D B A switch switch
Limited Storage Insufficient storage to replicate all the data on each node. C D B A switch switch
a b c d d+c d+a a+b b+c a b c d recover data b = a+b + a d = d+c + c k-of-n Code Erasure-correcting code: from any k of n columns
Encoding Encode video using 2-of-4 code. C D B A switch switch
Decoding Retrieve data and decode. C D B A switch switch
Node Failure C D B A switch switch
Node Failure C D B A switch switch
Node Failure Dynamically switch to another node. C D B A switch switch
Link Failure C D B A switch switch
Link Failure C D B A switch switch
Link Failure Dynamically switch to another network path. C D B A switch switch
Switch Failure C D B A switch switch
Switch Failure C D B A switch switch
Switch Failure Dynamically switch to another network path. C D B A switch switch
Node Recovery C D B A switch switch
Node Recovery Continuous reconfiguration (e.g., load-balancing). C D B A switch switch
tolerates multiple node/link/switch failures no single point of failure Certified Buzz-Word Compliant Features High availability: Efficient use of resources: • multiple data paths • redundant storage • graceful degradation Dynamic scalability/reconfigurability
RAIN Project: Goals Efficient, reliable distributed computing and storage systems: key building blocks Applications Storage Communication Networks
Fault-Tolerant Interconnect Topologies Connectivity Group Membership Distributed Storage Topics Today’s Talk: Applications Storage Communication Networks
Interconnect Topologies Goal: lose at most a constant number of nodes for given network loss N N N N N N N N N N Network N = computing/storage node
Resistance to Partitions Large partitions problematic for distributed services/computation N N N N N N N N N N Network N = computing/storage node
Resistance to Partitions Large partitions problematic for distributed services/computation N N N N N N N N N N Network N = computing/storage node
IEEE ACM Related Work • Hayes et al., Bruck et al., Boesch et al. Embedding hypercubes, rings, meshes, trees in fault-tolerant networks: Bus-based networks which are resistant to partitioning: • Ku and Hayes, 1997. “ Connective Fault-Tolerance in Multiple-Bus Systems”
N N S S S N N S S = Node N N N S S S = Switch N A Ring of Switches degree-2 compute nodes, degree-4 switches a naïve solution
N N S S S N N S S = Node N N N S S S = Switch N A Ring of Switches degree-2 compute nodes, degree-4 switches a naïve solution
A Ring of Switches degree-2 compute nodes, degree-4 switches N N S S S a naïve solution N N easily partitioned S S = Node N N N S S S = Switch N
1 8 2 1 8 2 7 7 3 3 6 4 5 6 4 5 Resistance to Partitioning degree-2 compute nodes, degree-4 switches nodes on diagonals
1 8 2 1 8 2 7 7 3 3 6 4 5 6 4 5 Resistance to Partitioning degree-2 compute nodes, degree-4 switches nodes on diagonals
1 1 6 6 4 Resistance to Partitioning degree-2 compute nodes, degree-4 switches 8 2 8 2 nodes on diagonals 7 7 3 • tolerates any 3 switch failures (optimal) • generalizes to arbitrary node/switch degrees. 3 4 5 5 Details: paper IPPS’98, www.paradise.caltech.edu
Isomorphic 1 1 6 8 2 1 1 8 6 4 3 4 2 7 7 3 3 7 3 7 8 6 8 2 4 5 5 6 4 2 5 5 Resistance to Partitioning Details: paper IPPS’98, www.paradise.caltech.edu
Point-to-Point Connectivity A node node node ? Is thepath from A to Bup or down? Network node node node B
Connectivity Bi-directional communication. Linkis seen asupordownby each node. Node A Node B {U,D} {U,D} Each node sends out pings. A node may time-out, deciding the link is down.
Node State Node State A B A B U U U D D U Time Time D U U U D D D U U D U D D Consistent History A B
Slack n=2: at most 2 unacknowledged transitions before a node waits Time Ais 1 ahead Ais 2 ahead Now A will wait for B to transition The Slack Node State A B U U D D U U D D U D U
Consistent History Consistency in error reporting: If A sees channel error, B sees channel error. Node A Node B {U,D} {U,D} Birman et al.: “Reliability Through Consistency” Details: paper IPPS’99, www.paradise.caltech.edu
B A C D Group Membership Consistent global view given local, point-to-point connectivity information ABCD ABCD • link/node failures • dynamic reconfiguration ABCD ABCD
Theory IEEE ACM Chandra et al., Impossibility of Group Membership in an Asynchronous Environment Systems Totem, Isis/Horus, Transis Related Work
B A D C Group Membership Token-Ring based Group Membership Protocol
1: ABCD Group Membership Token-Ring based Group Membership Protocol B A Token carries: • group membership list • sequence number D C
1: ABCD Group Membership Token-Ring based Group Membership Protocol B A 1 Token carries: • group membership list • sequence number D C
Group Membership Token-Ring based Group Membership Protocol B A 1 2 Token carries: • group membership list • sequence number 2: ABCD D C
Group Membership Token-Ring based Group Membership Protocol B A 1 2 Token carries: • group membership list • sequence number 3: ABCD D C 3
Group Membership Token-Ring based Group Membership Protocol B A 1 2 Token carries: • group membership list • sequence number 4: ABCD D C 4 3
Group Membership Token-Ring based Group Membership Protocol B A 5 2 Token carries: • group membership list • sequence number D C 4 3
Group Membership Node or link fails: B A 5 2 D C 4 3
Group Membership Node or link fails: B A 5 D C 4 3