1 / 24

Large-Scale Distributed Systems

Large-Scale Distributed Systems. Andrew Whitaker CSE451. Textbook Definition. “A distributed system is a collection of loosely coupled processors interconnected by a communication network” Typically, the nodes run software to create an application/service

gwylan
Download Presentation

Large-Scale Distributed Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Large-Scale Distributed Systems Andrew Whitaker CSE451

  2. Textbook Definition • “A distributed system is a collection of loosely coupled processors interconnected by a communication network” • Typically, the nodes run software to create an application/service • e.g., 1000s of Google nodes work together to build a search engine

  3. Why Not to Build a Distributed System (1) • Must handle partial failures • System must stay up, even when individual components fail Amazon.com

  4. Why Not to Build a Distributed System (2) • No global state • Machines can only communicate with messages • This makes it difficult to agree on anything • “What time is it?” • “Which happened first, A or B?” • Theory: consensus is slow and doesn’t work in the presence of failure • So, we try to avoid needing to agree in the first place A B

  5. Reasons to Build a Distributed System (1) • The application or service is inherently distributed Andrew Whitaker Joan Whitaker

  6. Reason to Build a Distributed System (2) • Application requirements • Must scale to millions of requests / sec • Must be available despite component failures • This is why Amazon, Google, Ebay, etc. are all large distributed systems

  7. Internet Service Requirements • Basic goal: build a site that satisfies every user requests • Detailed requirements: • Handle billions of transactions per day • Be available 24/7 • Handle load spikes that are 10x normal capacity • Do it with a random selection of mismatched hardware

  8. An Overview of HotMail (Jim Gray) • ~7,000 servers • 100 backend stores with 300TB (cooked) • Many data centers • Links to • Internet Mail gateways • Ad-rotator • Passport • ~ 5 B messages per day • 350M mailboxes, 250M active • ~1M new per day. • New software every 3 months (small changes weekly).

  9. Availability Strategy #1: Perfect Hardware • Pay extra $$$ for components that do not fail • People have tried this • “fault tolerant computing” • This isn’t practical for Amazon / Google: • It’s impossible to get rid of all faults • Software and administrative errors still exist

  10. Availability Strategy #2: Over-provision • Step 1: buy enough hardware to handle your workload • Step 2: buy more hardware Replicate Replicate Replicate Replicate

  11. Benefits of Replication • Scalability • Guards against hardware failures • Guards against software failures (bugs)

  12. Replication Meets Probability • p is probability that a single machine fails • Probability of N failures is: 1-p^n Site unavailability

  13. Availability in the Real World • Phone network: 5 9’s • 99.999% available • ATMs: 4 9’s • 99.99% available • What about Internet services? • Not very good…

  14. 2006: typical 97.48% Availability Source: Jim Gray 97.48%

  15. Netcraft’s Crisis-of-the-Day

  16. What Gives? • Why isn’t simple redundancy enough to give very high availability?

  17. Failure Modes • Fail-stop failure: A component fails by stopping • It’s totally dead: doesn’t respond to input or output • Ideally, this happens fast • Like a light-bulb • Byzantine failure: Component fails in an arbitrary way • Produces unpredictable output

  18. Byzantine Generals • Basic goal: reach consensus in the presence of arbitrary failures • Results: • More than 2/3 of the nodes must be “loyal” • 3t + 1 nodes with t traitors • Consensus is possible, but expensive • Lot’s of messages • Many rounds of communication • In practice, people assume that failures are fail-stop, and hope for the best…

  19. Example of a non Fail-Stop Failure Server Server Amazon.com Internet Server Load balancer Server Server Load Balancer uses a “Least Connections” policy Server fails by returning an HTTP error 400 Net result: “failed” server becomes a black hole

  20. Correlated Failures • In practice, components often fail at the same time • Natural disasters • Security vulnerabilities • Correlated manufacturing defects • Human error…

  21. Sources of Failure Public Switched Telephone Network Average of 3 Internet Sites Human error • Human operator error is the leading cause of dependability problems in many domains Source: D. Patterson et al. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies, UC Berkeley Technical Report UCB//CSD-02-1175, March 2002.

  22. Understanding Human Error • Administrator actions tend to involve many nodes at once: • Upgrade from Apache 1.3 to Apache 2.0 • Change the root DNS server • Network / router misconfiguration • This can lead to (highly) correlated failures

  23. Note: recovery time is just as important as failure time! Learning to Live with Failures • If we can’t prevent failures outright, how can we make their impact less severe? • Understanding availability: • MTTF: Mean-time-to-failure • MTTR: Mean-time-to-repair • Availability = MTTR / (MTTR + MTTF) • Approximately MTTR / MTTF

  24. Summary • Large distributed systems are built from many flaky components • Key challenge: don’t let component failures become system failures • Basic approach: throw lots of hardware at the problem; hope everything doesn’t fail at once • Try to decouple failures • Try to avoid single points-of-failure • Try to fail fast • Availability is affected as much by recovery time as by error frequency

More Related