1 / 14

Presentation: Charles Yang 2005/3/17

The Seven Deadly Sins of Distributed Systems by Steve Muir Princeton University Workshop on Real Large Distributed Systems, 2004. Presentation: Charles Yang 2005/3/17. Introduction. PlanetLab Time of paper: 400+ Nodes, 175 sites Now: 537 machines, 254 sites. What does it do?.

macha
Download Presentation

Presentation: Charles Yang 2005/3/17

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Seven Deadly Sins of Distributed Systemsby Steve MuirPrinceton UniversityWorkshop on Real Large Distributed Systems, 2004 Presentation: Charles Yang 2005/3/17

  2. Introduction • PlanetLab • Time of paper: 400+ Nodes, 175 sites • Now: 537 machines, 254 sites

  3. What does it do? • IA-32’s running Linux • distributed virtualization • experiment with a variety of planetary-scale services • file sharing and network-embedded storage • content distribution networks • routing and multicast overlays • network measurement tools. • Etc.

  4. The Paper (and this Presentation) • Node Manager (NM) • Each user creates a slice • NM configures slivers on actual nodes • Paper describes 7 challenges encountered • Applicable to all large-scale heterogeneous environments

  5. The se7en Sins • Networks are unreliable in the worst way • DNS is not a good naming system • Local clocks are inaccurate/unreliable • Large-scale systems always inconsistent • The improbable will happen • Over-utilization is the steady-state condition • Limited system transparency hampers debugging

  6. Large Heterogeneous Networks are Fundamentally Unreliable • Specifications of IP, TCP, UDP say that • They really really mean it • Instead of just losing packets • Delayed, duplicated, corrupted • High variable latency – 24 hours to d/l a small file • Unexpected termination – RPC interupted

  7. Unreliable Networks – Solutions • ALL possible errors should be handled gracefully • It will happen • For variable latency • Multithreading or async I/O + timeouts • For RPC operations • Transactions may be too heavyweight • NM: acquire & bind • Interference from other users • Someone might port-scan you • SEARCH \x90\x90\x90\x90\x90…. +

  8. DNS Names Make Poor Node Identifiers • Suffers from ambiguity & instability • Human errors • Network reorginzations, renaming of hosts • DNS servers may be overloaded -> secondary servers • Network asymmetry: internal names, NAT, etc • Non-static addresses & multihoming • NM: unique numeric ID’s (MAC address)

  9. Local Clocks are Unreliable and Untrustworthy • Local clocks are bad • NTP helps, but some sites block it • Bottom-line: make sure your application knows about the problem

  10. Inconsistent Node Configuration is the Norm • Multiple versions of software packages • Multiple versions of your own application • Updates not well ordered • NM • Incorporate failsafe behavior • No slices in XML file -> probably major format change • Version numbering

  11. There’s No Such Thing as “One-in-a-Million” • Hundreds of nodes, 24/7 • Murphy’s Law • Unexpected reboots happen more than you’d think • Power outages all over the world • Must not cut corners when handling errors

  12. No PlanetLab Node is Under-Utilized • The Norm: • 100% utilization, load in 5-10 range • +20 concurrent slices • Several hundred processes are not uncommon • Hence: • Applications run much slower • So make your apps aware • System-wide solution • Smart scheduler

  13. Limited System Transparency Hampers Debugging • Virtualization -> not a complete view • System solutions • Develop debugging tools • App side: • Make sure you debug thoroughly before deploying • Report as much info as possible in a readable format

  14. Guidelines • Assumptions made in non-distributed apps are not valid in large-scale and/or heterogeneous environments • Dist. apps must gracefully handle a broad range of unlikely errors • Resource mgmt is dist. env is much different than non-dist. • Even local operations can behave radically diff in a heavily utilized system

More Related