Presentation: Charles Yang 2005/3/17

The Seven Deadly Sins of Distributed Systemsby Steve MuirPrinceton UniversityWorkshop on Real Large Distributed Systems, 2004 Presentation: Charles Yang 2005/3/17

Introduction • PlanetLab • Time of paper: 400+ Nodes, 175 sites • Now: 537 machines, 254 sites

What does it do? • IA-32’s running Linux • distributed virtualization • experiment with a variety of planetary-scale services • file sharing and network-embedded storage • content distribution networks • routing and multicast overlays • network measurement tools. • Etc.

The Paper (and this Presentation) • Node Manager (NM) • Each user creates a slice • NM configures slivers on actual nodes • Paper describes 7 challenges encountered • Applicable to all large-scale heterogeneous environments

The se7en Sins • Networks are unreliable in the worst way • DNS is not a good naming system • Local clocks are inaccurate/unreliable • Large-scale systems always inconsistent • The improbable will happen • Over-utilization is the steady-state condition • Limited system transparency hampers debugging

Large Heterogeneous Networks are Fundamentally Unreliable • Specifications of IP, TCP, UDP say that • They really really mean it • Instead of just losing packets • Delayed, duplicated, corrupted • High variable latency – 24 hours to d/l a small file • Unexpected termination – RPC interupted

Unreliable Networks – Solutions • ALL possible errors should be handled gracefully • It will happen • For variable latency • Multithreading or async I/O + timeouts • For RPC operations • Transactions may be too heavyweight • NM: acquire & bind • Interference from other users • Someone might port-scan you • SEARCH \x90\x90\x90\x90\x90…. +

DNS Names Make Poor Node Identifiers • Suffers from ambiguity & instability • Human errors • Network reorginzations, renaming of hosts • DNS servers may be overloaded -> secondary servers • Network asymmetry: internal names, NAT, etc • Non-static addresses & multihoming • NM: unique numeric ID’s (MAC address)

Local Clocks are Unreliable and Untrustworthy • Local clocks are bad • NTP helps, but some sites block it • Bottom-line: make sure your application knows about the problem

Inconsistent Node Configuration is the Norm • Multiple versions of software packages • Multiple versions of your own application • Updates not well ordered • NM • Incorporate failsafe behavior • No slices in XML file -> probably major format change • Version numbering

There’s No Such Thing as “One-in-a-Million” • Hundreds of nodes, 24/7 • Murphy’s Law • Unexpected reboots happen more than you’d think • Power outages all over the world • Must not cut corners when handling errors

No PlanetLab Node is Under-Utilized • The Norm: • 100% utilization, load in 5-10 range • +20 concurrent slices • Several hundred processes are not uncommon • Hence: • Applications run much slower • So make your apps aware • System-wide solution • Smart scheduler

Limited System Transparency Hampers Debugging • Virtualization -> not a complete view • System solutions • Develop debugging tools • App side: • Make sure you debug thoroughly before deploying • Report as much info as possible in a readable format

Guidelines • Assumptions made in non-distributed apps are not valid in large-scale and/or heterogeneous environments • Dist. apps must gracefully handle a broad range of unlikely errors • Resource mgmt is dist. env is much different than non-dist. • Even local operations can behave radically diff in a heavily utilized system

Presentation: Charles Yang 2005/3/17

Presentation: Charles Yang 2005/3/17

Presentation Transcript

November 17, 2005

Mathew Charles sanders presentation Musical vocalists.

Ray Charles

Presentation 17

FOCUS Fall 2005 November 17, 2005

Presentation 2005

June 17, 2005

NPfIT presentation to SCATA 17 th November 2005

17 June 2005

Presentation...

Martin Charles Scorsese

May 17, 2005

February 17, 2005

May 17, 2005

Presentation #17

17 August 2005