250 likes | 387 Views
The virtue of dependent failures in multi-site systems. Flavio Junqueira and Keith Marzullo University of California, San Diego Workshop on Hot Topics in System Dependability (HotDep), Yokohama, Japan, June 2005. Collection of sites across a WAN Multiple processors per site Storage nodes
E N D
The virtue of dependent failures in multi-site systems Flavio Junqueira andKeith Marzullo University of California, San Diego Workshop on Hot Topics in System Dependability (HotDep), Yokohama, Japan, June 2005
Collection of sites across a WAN Multiple processors per site Storage nodes Computing nodes Share resources E.g. BIRN, Geon, TeraGrid Failures Processors unavailable Services do not mask failures Improve availability under failures Replication Minimize overhead Multi-site systems
Introduction • Failures in multi-site systems • Processor failures • Site failures • Processors of the site become unavailable • A new failure model • Availability through replication • Replica placement • Operations on replicas: quorums • Replicated data: quorum update • Replicated functionality: state-machine using Paxos • Quorum constructions • Failure model in practice • Implement the model • Site availability in BIRN • Model for processor failures within a site Software and hardware faults • Misconfigured software • Shared resources • Storage • Power circuits • Cooling pipes • Air conditioning • Network
A dependent failure model • Threshold model • Limit on the number of processor failures • Simple • Model well homogeneous processors that fail independently • Multi-site: sites unavailable frequently enough • Processor failures are not IID • All processors become unavailable • The multi-site threshold model • Two components • Threshold on the number of site failures (fs) • One threshold per site on processor failures (t) • Assumptions • Sites are homogeneous • Processors within a site are homogeneous • Processor failure = crash
Quorum systems • Quorum system Q • Quorum system: set of quorums • Quorum: set of processors • Intersection property: every pair of quorums in Q intersect • Algorithms: access a quorum • Example: Majority system • n processors • Every subset of size (n+1)/2 is a quorum • Optimal availability for IID processor failures
QSite Select at least (2fs +1) sites: S Select at least (2t +1) processors from each site in S Quorum Majority of sites in S Majority of processors in each site An example (fs = 1, t = 1) Quorums A quorum construction: QSite Site 1 Site 2 Site 3
Properties of multi-site threshold model hold Same replicas for QSite and Majority Availability fsunavailable sites Remaining fs + 1 sites tunavailable processors Majority: no quorum available Requires: Available: QSite: one quorum available QSite has better availability Majority is not optimal Quorum sizes QSite produces smaller quorums Reduces load Increases capacity QSite vs. Majority
QSite, fs = 2, t = 1: 5 sites 3 processors per site 6 processors per quorum Compromise availability Quorums Reducing quorum sizes and sites Site 1 Site 2 Site 3 Site 4
Site availability • Goals • Show that sites are unavailable frequently enough • Threshold on the number of site failures • BIRN - Biomedical Informatics Research Network • Test bed projects centered around brain imaging • Currently: 19 universities, 26 research groups • Availability • Monthly basis • Pings (BIRN-CC) • Storage broker logs • Site availability • Jan/04-Aug/04 • Availability under 100% • On average in 5 out of the 8 months
BIRN site availability 10 sites experience at least one outage One site under 97%
Threshold on unavailable sites • Worst-case scenario • Assumption: independent site failures • nmost unavailable sites in each month • Probability that all n sites are unavailable • Each 1% of unavailability is approximately 7 hours
Modeling failures in a site • Homogeneous set of processors • Independent processor failures • Identical probability of failure • Processors are repaired • Repair probabilities change with number of failures • Markov chain • From the model: threshold on the number of failures (t) • Desired degree of availability • Stationary probabilities
An example • Three processors per site • Probabilities • Failure probability much smaller than repair probabilities • Repair probabilities increase with failures t = 1 Availability 0.001
Discussion & Future work • Multi-site systems: important class of distributed systems • Share resources • Collaboration among distant groups • Improve availability through replication • A useful abstraction: quorum systems • Algorithms built on top of quorum systems • Dependent failures • Site failures • Enables smaller, higher available quorums • Lessons to learn • Considering dependent failures may improve results • Models are not necessarily complex • Future work • Validate model, evaluate constructions in practice, more constructions, etc.
Software and hardware faults • Software incompatibility, misconfiguration • Shared resources (e.g. storage) • Power failures • Broken pipes • Loss of air conditioning • Network problems Introduction • Failures in multi-site systems • Processor failures • Site failures • Processors of the site become unavailable • A new failure model • Availability through replication • Replica placement • Operations on replicas: quorums • Replicated data (quorum update) • Replicated functionality (state-machine using Paxos) • Quorum constructions • Failure model in practice • Implementability of the model • Real system for site availability (BIRN) • Model for processor failures within a site
Software incompatibility, misconfiguration Shared resources (e.g. storage) Power failures Broken pipes Loss of air conditioning Network problems Introduction • Failures in multi-site systems • Processor failures • E.g. HW failures • Site failures • Strategies for replica placement • Large number of sites and nodes • Updates • Naïve approach: every non-faulty replica up to date • Quorum update: contact a quorum of processors • Distributed shared register (replicated data) • Multiple copies of a data set (Quorum Update) • E.g. Brain images (BIRN); Geological data (Geon) • Consensus (replicated functionality) • State-machine approach (Paxos algorithm) • E.g.: Parallel computation (TeraGrid)
Why sites fa • Software incompatibility, misconfiguration • Shared resources (e.g. storage) • Power failures • Broken pipes • Loss of air conditioning • Network problems
Quorums in a multi-site system • Data replication • Multiple copies of data sets • Functionality replication • State-machine approach • Paxos (Coteries for Classic Paxos) • Question: How do we choose nodes to replicate? • Flat organization • Organization into sites
Quorum systems • Quorum system Q • Quorum system: set of quorums • Quorum: set of processors • Intersection property: every pair of quorums in Q intersect • Algorithms: access a quorum when executing some operation • Examples • Majority system: • n processors • Every subset of size (n+1)/2 is a quorum • Optimal availability for IID processor failures • Multi-colored: colors as sites Processors Quorums
Quorum systems (cont.) • In multi-site systems • Replicated data • Multiple copies of a data set (Quorum update) • E.g. Brain images(BIRN); Geological data (Geon) • Replicated functionality • State-machine approach (Paxos algorithm) • E.g.: Parallel computation (TeraGrid) • Quorums for multi-site systems • Replicating on every node is excessive • Quorum construction • Set of processors to replicate on • Quorums
Examples of quorum systems • Majority system: • n processors • Every subset of size (n+1)/2 is a quorum • Multi-colored: colors as sites • Majority has optimal availability for independent and identically distributed processor failures (IID) Universe Quorum patterns
BIRN site availability 10 sites have at least one outage One site under 97%
Discussion & Future work • Multi-site systems: important class of distributed systems • Share resources • Collaboration among distant groups • Improve availability through replication • A useful abstraction: quorum systems • Algorithms built on top of quorum systems • Dependent failures • Site failures • Enables smaller, higher available quorums • Future work • Validate multi-site threshold model • Evaluate proposed constructions in practice • More constructions • More issues with dependent failures