1 / 29

Limplock: Understanding the Impact of Limpware on Scale-out Cloud Systems

Understand the detrimental impact of limpware on scale-out cloud systems through a comprehensive study. Explore limpware-related failures and their cascading effects on system performance and reliability. Investigate how limpware-tolerant designs can mitigate such issues. Discover findings from experiments benchmarking key systems and protocols under limpware scenarios. Gain insights into limpware's adverse effects and implications for system design and operation.

davidgsmith
Download Presentation

Limplock: Understanding the Impact of Limpware on Scale-out Cloud Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Limplock: Understanding the Impact of Limpware on Scale-out Cloud Systems Thanh Do*, MingzheHao, TanakornLeesatapornwongsa, TiratatPatana-anake, and Haryadi S. Gunawi

  2. Hardware fails • Growing complexity of … • Technology scaling • Manufacturing • Design logic • Usage • Operating environment • … makes HW fail differently • Complete fail-stop • Fail partial • Corruption • Performance degradation? Rich literature

  3. The 1st anecdote Degraded NIC! (1000000x) “… 1Gb NIC card on a machine that suddenly starts transmitting at 1 kbps, this slow machine caused a chain reaction upstream in such a way that the performance of entire workload for a 100 node cluster was crawling at a snail's pace, effectively making the system unavailable for all practical purposes.” – Borthakur of Facebook Cascading impact!

  4. Cases of Degraded HW • “In 2011, one of the DDN 9900 units had 4 servers having high wait times on I/O for a certain set of disk LUNs. The maximum wait time was 103 seconds. This was left uncorrected for 50 days.” – Kasick of CMU, Harms of Argonne • “The disk attempts to re-read each block multiple times before responding.” – Baptist of Cleversafe • “On Intrepid, we had a bad batch of optical transceiverswith an extremely high error rate. That results in an effective throughput of 1-2 Kbps.” – Harms of Argonne • Many others: “Yes we've seen that in production”

  5. Limpware • Does HW degrade? Yes • Limpware: Hardware whose performance degrades significantly compared to its specification • Is this a destructive failure mode? Yes • Cascading failures, no “fail in place” • No systematic analysis on its impact

  6. Study Summary • 56 experiments that benchmark 5 systems • Hadoop, HDFS, Zookeeper, Cassandra, HBase • 22 protocols • 8 hours under normal scenarios • 207 hours under limpware scenarios • Unearth many limpware-intolerant designs Our findings: A single piece of limpware (e.g. NIC) causes severe impact on a whole cluster

  7. Outline • Introduction • System analysis • Limplock • Limpware-Tolerant Systems • Conclusion

  8. Anecdotal impacts • “The performance of a 100 node cluster was crawling at a snail's pace” – Facebook • But, … why?

  9. System analysis • Goals • Measure system-level impacts • Find design flaws • Methodology • Target cloud systems (e.g., HDFS, Hadoop, ZooKeeper) • Inject load+ limpware • E.g. slow a NIC to 1 Mbps, 0.1 Mbps, etc. • White-box analysis (internal probes) • Find design flaws

  10. Example Execution slowdown 0.1 Mbps NIC • Run a distributed protocol • E.g., 3-node write in HDFS • Measure slowdowns under: • No failure, crash,a degraded NIC 1000x slower 1Mbps NIC 100x slower workload 10x slower 10 Mbps NIC 1

  11. Outline • Introduction • System analysis • Hadoop case study • Limplock • Limpware Tolerant Cloud Systems • Conclusion

  12. Hadoop Spec. Exec. 10 1 • Hadoop tail-tolerant? • Why speculative exec is not triggered? • Consider degraded NIC on a map node • Task M2’s speed = M1 and M3 • Input data is local! • Butall reducers are slow • Straggler: slow vs. others of same job • No straggler detected! • Flaws • Task-level straggler detection • Single point of failure! Wordcount on Hadoop Mappers Reducers M1 M2 M3

  13. Cascading failures • A degraded NIC  degraded tasks • (Degraded tasks are slower by orders of magnitude) • Slow tasks use up slots  degraded node • Default: 2 mappers and 2 reducers per node • If all slots are used  node is “unavailable” • All nodes in limp mode  degraded cluster Node with slow NIC Healthy node in limp mode M M R R

  14. Cluster collapse • Macrobenchmark: Facebook workload • 30-node cluster • One node w/ degraded NIC (0.1 Mbps) 1 job/hour 172 jobs/hour Cluster collapse! Why?

  15. Fail-stop tolerant, but not limpware tolerant (no failover recovery)

  16. Outline • Introduction • System analysis • Formalizing the problem: Limplock • Definitions and causes • Limpware-Tolerant Systems • Conclusion

  17. Limplock • Definition • The system progresses slowly due to limpware and is not capable of failing over to healthy components • (i.e., the system is “locked” in limping mode) • 3 levels of limplock • Operation • Node • Cluster

  18. Limplock Levels • Operation Limplock • Operation involving limpware is “locked” in limping mode; no failover • Node Limplock • A situation where operations that must be served by this node experience limplock, although the operations do notinvolve limpware • Cluster Limplock • The whole cluster is in limplock due to limpware

  19. Causes of Limplock • Operation Limplock • Single point of failure • Hadoop slow map task • HBase “Gateway” Mappers Reducers M1 M2 M3

  20. Causes of Limplock • Operation Limplock • Single point of failure • Coarse-grained timeout • (more in the paper) 100 10 • Reason: No timeout is triggered • Coarse-grained timeout in HDFS • 60 second timeout on every 64 KB • Could limp almost to 1 KB/s Slowdown 1 512MB write to HDFS

  21. Causes of Limplock • Operation Limplock • Single point of failure • Coarse-grained timeout • … • Node Limplock • Bounded multi-purpose thread pool In-memory meta reads Master 100 In-memory meta reads Master • Resource exhaustion by limplocked operation • In-memory metadata reads are blocked 10 Slowdown Meta writes 1 In-memory reads > 100x slower than normal

  22. Causes of Limplock • Operation Limplock • Single point of failure • Coarse-grained timeout • … • Node Limplock • Bounded multi-purpose thread pool • Bounded multi-purpose queue messages messages

  23. ZooKeeper Leader Causes of Limplock • Operation Limplock • Single point of failure • Coarse-grained timeout • … • Node Limplock • Bounded multi-purpose thread pool • Bounded multi-purpose queue • Unbounded thread pool/queue • Ex: Backlogged queue at leader • Node limplock at leader because garbage collection works hard • Quorum write: 10x slowdown Followers Client quorum write 10 1 Slowdown Stress load 20 seconds Stress load 600 seconds

  24. Causes of Limplock • Operation Limplock • Single point of failure • Coarse-grained timeout • … • Node Limplock • Bounded multi-purpose thread pool • Bounded multi-purpose queue • Unbounded thread pool/queue • Cluster Limplock • All nodes in limplock • Ex: resource exhaustion in Hadoop, HDFS Regeneration • Master limplock in master-slave architecture • Ex: cases in ZooKeeper, HDFS

  25. Analysis Results • Found 15 protocols that exhibit limplock • 8 in HDFS • 1 in Hadoop • 2 in ZooKeeper • 4 in HBase Limplock happens in almost all systems we have analyzed

  26. Outline • Introduction • System analysis • Limplock • Limpware-Tolerant Cloud Systems • Conclusion

  27. Principles of limpware … • Recovery • How to “fail in place”? • Better to fail-stop than fail-slow? • Quarantine? • Utilization • Fail-stop: fail or working • Limpware: degrade 1-100% • Anticipation • Limpware-tolerant design patterns • Limpware static analysis • Limpware statistics • Existing work: memory failure, disk failure, etc. • Detection • Performance degradation  implicit (no hard errors) • Study explicit causes (e.g. block remapping, error correcting)

  28. Conclusion • New failure modes  transform systems • Limpware is a “new”, destructive failure mode • Orders of magnitude slowdown • Cascading failures • No “fail in place” in current systems A need for Limpware-Tolerant Systems

  29. Thank you!Questions?

More Related