160 likes | 308 Views
The Seven Deadly Sins of Distributed Systems by Steve Muir Princeton University Workshop on Real Large Distributed Systems, 2004. Presentation: Charles Yang 2005/3/17. Introduction. PlanetLab Time of paper: 400+ Nodes, 175 sites Now: 537 machines, 254 sites. What does it do?.
E N D
The Seven Deadly Sins of Distributed Systemsby Steve MuirPrinceton UniversityWorkshop on Real Large Distributed Systems, 2004 Presentation: Charles Yang 2005/3/17
Introduction • PlanetLab • Time of paper: 400+ Nodes, 175 sites • Now: 537 machines, 254 sites
What does it do? • IA-32’s running Linux • distributed virtualization • experiment with a variety of planetary-scale services • file sharing and network-embedded storage • content distribution networks • routing and multicast overlays • network measurement tools. • Etc.
The Paper (and this Presentation) • Node Manager (NM) • Each user creates a slice • NM configures slivers on actual nodes • Paper describes 7 challenges encountered • Applicable to all large-scale heterogeneous environments
The se7en Sins • Networks are unreliable in the worst way • DNS is not a good naming system • Local clocks are inaccurate/unreliable • Large-scale systems always inconsistent • The improbable will happen • Over-utilization is the steady-state condition • Limited system transparency hampers debugging
Large Heterogeneous Networks are Fundamentally Unreliable • Specifications of IP, TCP, UDP say that • They really really mean it • Instead of just losing packets • Delayed, duplicated, corrupted • High variable latency – 24 hours to d/l a small file • Unexpected termination – RPC interupted
Unreliable Networks – Solutions • ALL possible errors should be handled gracefully • It will happen • For variable latency • Multithreading or async I/O + timeouts • For RPC operations • Transactions may be too heavyweight • NM: acquire & bind • Interference from other users • Someone might port-scan you • SEARCH \x90\x90\x90\x90\x90…. +
DNS Names Make Poor Node Identifiers • Suffers from ambiguity & instability • Human errors • Network reorginzations, renaming of hosts • DNS servers may be overloaded -> secondary servers • Network asymmetry: internal names, NAT, etc • Non-static addresses & multihoming • NM: unique numeric ID’s (MAC address)
Local Clocks are Unreliable and Untrustworthy • Local clocks are bad • NTP helps, but some sites block it • Bottom-line: make sure your application knows about the problem
Inconsistent Node Configuration is the Norm • Multiple versions of software packages • Multiple versions of your own application • Updates not well ordered • NM • Incorporate failsafe behavior • No slices in XML file -> probably major format change • Version numbering
There’s No Such Thing as “One-in-a-Million” • Hundreds of nodes, 24/7 • Murphy’s Law • Unexpected reboots happen more than you’d think • Power outages all over the world • Must not cut corners when handling errors
No PlanetLab Node is Under-Utilized • The Norm: • 100% utilization, load in 5-10 range • +20 concurrent slices • Several hundred processes are not uncommon • Hence: • Applications run much slower • So make your apps aware • System-wide solution • Smart scheduler
Limited System Transparency Hampers Debugging • Virtualization -> not a complete view • System solutions • Develop debugging tools • App side: • Make sure you debug thoroughly before deploying • Report as much info as possible in a readable format
Guidelines • Assumptions made in non-distributed apps are not valid in large-scale and/or heterogeneous environments • Dist. apps must gracefully handle a broad range of unlikely errors • Resource mgmt is dist. env is much different than non-dist. • Even local operations can behave radically diff in a heavily utilized system