Large-Scale Distributed Systems

Large-Scale Distributed Systems Andrew Whitaker CSE451

Textbook Definition • “A distributed system is a collection of loosely coupled processors interconnected by a communication network” • Typically, the nodes run software to create an application/service • e.g., 1000s of Google nodes work together to build a search engine

Why Not to Build a Distributed System (1) • Must handle partial failures • System must stay up, even when individual components fail Amazon.com

Why Not to Build a Distributed System (2) • No global state • Machines can only communicate with messages • This makes it difficult to agree on anything • “What time is it?” • “Which happened first, A or B?” • Theory: consensus is slow and doesn’t work in the presence of failure • So, we try to avoid needing to agree in the first place A B

Reasons to Build a Distributed System (1) • The application or service is inherently distributed Andrew Whitaker Joan Whitaker

Reason to Build a Distributed System (2) • Application requirements • Must scale to millions of requests / sec • Must be available despite component failures • This is why Amazon, Google, Ebay, etc. are all large distributed systems

Internet Service Requirements • Basic goal: build a site that satisfies every user requests • Detailed requirements: • Handle billions of transactions per day • Be available 24/7 • Handle load spikes that are 10x normal capacity • Do it with a random selection of mismatched hardware

An Overview of HotMail (Jim Gray) • ~7,000 servers • 100 backend stores with 300TB (cooked) • Many data centers • Links to • Internet Mail gateways • Ad-rotator • Passport • ~ 5 B messages per day • 350M mailboxes, 250M active • ~1M new per day. • New software every 3 months (small changes weekly).

Availability Strategy #1: Perfect Hardware • Pay extra $$$ for components that do not fail • People have tried this • “fault tolerant computing” • This isn’t practical for Amazon / Google: • It’s impossible to get rid of all faults • Software and administrative errors still exist

Availability Strategy #2: Over-provision • Step 1: buy enough hardware to handle your workload • Step 2: buy more hardware Replicate Replicate Replicate Replicate

Benefits of Replication • Scalability • Guards against hardware failures • Guards against software failures (bugs)

Replication Meets Probability • p is probability that a single machine fails • Probability of N failures is: 1-p^n Site unavailability

Availability in the Real World • Phone network: 5 9’s • 99.999% available • ATMs: 4 9’s • 99.99% available • What about Internet services? • Not very good…

2006: typical 97.48% Availability Source: Jim Gray 97.48%

Netcraft’s Crisis-of-the-Day

What Gives? • Why isn’t simple redundancy enough to give very high availability?

Failure Modes • Fail-stop failure: A component fails by stopping • It’s totally dead: doesn’t respond to input or output • Ideally, this happens fast • Like a light-bulb • Byzantine failure: Component fails in an arbitrary way • Produces unpredictable output

Byzantine Generals • Basic goal: reach consensus in the presence of arbitrary failures • Results: • More than 2/3 of the nodes must be “loyal” • 3t + 1 nodes with t traitors • Consensus is possible, but expensive • Lot’s of messages • Many rounds of communication • In practice, people assume that failures are fail-stop, and hope for the best…

Example of a non Fail-Stop Failure Server Server Amazon.com Internet Server Load balancer Server Server Load Balancer uses a “Least Connections” policy Server fails by returning an HTTP error 400 Net result: “failed” server becomes a black hole

Correlated Failures • In practice, components often fail at the same time • Natural disasters • Security vulnerabilities • Correlated manufacturing defects • Human error…

Sources of Failure Public Switched Telephone Network Average of 3 Internet Sites Human error • Human operator error is the leading cause of dependability problems in many domains Source: D. Patterson et al. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies, UC Berkeley Technical Report UCB//CSD-02-1175, March 2002.

Understanding Human Error • Administrator actions tend to involve many nodes at once: • Upgrade from Apache 1.3 to Apache 2.0 • Change the root DNS server • Network / router misconfiguration • This can lead to (highly) correlated failures

Note: recovery time is just as important as failure time! Learning to Live with Failures • If we can’t prevent failures outright, how can we make their impact less severe? • Understanding availability: • MTTF: Mean-time-to-failure • MTTR: Mean-time-to-repair • Availability = MTTR / (MTTR + MTTF) • Approximately MTTR / MTTF

Summary • Large distributed systems are built from many flaky components • Key challenge: don’t let component failures become system failures • Basic approach: throw lots of hardware at the problem; hope everything doesn’t fail at once • Try to decouple failures • Try to avoid single points-of-failure • Try to fail fast • Availability is affected as much by recovery time as by error frequency

Large-Scale Distributed Systems