310 likes | 616 Views
Emergent (Mis)behavior vs. Complex Software Systems. Jeff Mogul HP Labs – Palo Alto April 2006. Emergent behavior?. Ants are dumb Anthills are “smart” The global behavior of the anthill emerges from the local behaviors of the ants
E N D
Emergent (Mis)behavior vs. Complex Software Systems Jeff Mogul HP Labs – Palo Alto April 2006
Emergent behavior? • Ants are dumb • Anthills are “smart” • The global behavior of the anthill emerges from the local behaviors of the ants • The individual ants don’t know what the global behavior is supposed to be Emergent (Mis)behavior vs. Complex Software Systems
Opening day on theMillennium Footbridge • Opening day (10 June 2000): • “unexpected lateral vibrations occured” • “a significant number of pedestrians [had] difficulty walking” • The bridge was closed; the engineers got back to work • They had already done very careful modelling of a novel design • What went wrong? • People on a swaying surface tend to synchronize their footsteps to the swaying, even if initial amplitude is small • Bridge’s natural frequency was close to normal footsteps • This effect was unknown in engineering literature • Novel bridge design + unusual pedestrian-only load • Once the problem was understood, modelling and retrofit were fairly straightforward Emergent (Mis)behavior vs. Complex Software Systems
Why is that bridge interesting to us? • People have been designing bridges for millennia • Civil engineering is a well-regulated profession • Lots of experience with unexpected dynamic failures • Lots of computer modelling expertise • But the engineers still got it wrong: why? • Answer: emergent misbehavior • The system’s behavior emerged – it wasn’t easy to predict • Particularly, not from understanding of individual “parts” • And the result was unexpected and bad • If these engineers got it wrong, what about us? • Computer systems are worse than bridges! Emergent (Mis)behavior vs. Complex Software Systems
The importance of emergent misbehavior in computer systems Much past focus has been on: • Fault-tolerant systems • Correctness-by-construction Both are valuable, but … • System-wide failures not always caused by “faults” • Modern systems are too complex to understand • Performancematters! All three issues can result from emergent misbehavior Goals of this talk: • Illustrate the scope and nature of the problem • Propose a research agenda Emergent (Mis)behavior vs. Complex Software Systems
What this talk is NOT about • Dealing with malicious behavior • Game theory and incentives for people • Telling anyone that their approach is wrong • We still need fault tolerance, program verification, correct-by-construction techniques, etc.! • Improving peak (best-case) system performance This talk is 100% uncontaminated by: • Implementation or architecture • Experiments or results Emergent (Mis)behavior vs. Complex Software Systems
Outline • Examples • What is/is not “emergent misbehavior”? • A research agenda • Thoughts about visions of the future • Related work Emergent (Mis)behavior vs. Complex Software Systems
Examples of emergent misbehavior Examples can be found in: • Non-computer technology • Millennium Footbridge (London); Traffic jams • Computer hardware • Vibrations in large disk arrays • Networking • Ethernet capture effect, Router synchronization; BGP Route flap damping; TCP’s Nagle algorithm • Distributed systems and operating systems • Misconfigured load balancer; Herd behavior; Priority inversion in the Mars Pathfinder Emergent (Mis)behavior vs. Complex Software Systems
Examples of emergent misbehavior Examples described in this talk: • Non-computer technology • Millennium Footbridge (London); Traffic jams • Computer hardware • Vibrations in large disk arrays • Networking • Ethernet capture effect, Router synchronization; BGP Route flap damping; TCP’s Nagle algorithm • Distributed systems and operating systems • Misconfigured load balancer; Herd behavior; Priority inversion in the Mars Pathfinder Emergent (Mis)behavior vs. Complex Software Systems
Ethernet Capture Effect:an example scenario Assume both hosts have full transmit queues Host A, count = 1, flips “backoff coin” = 0 Host A decides to transmit Host A decides to transmit Host A, count = 1, flips “backoff coin” = 0 Host A wins, transmits Host A wins, transmits Idle Idle … ad infinitum Host B, count = 1, flips “backoff coin” = 1 Host B decides to transmit Host B decides to transmit Host B, count = 2, flips “backoff coin” = 01 B’s disadvantage doubles on each round Emergent (Mis)behavior vs. Complex Software Systems
Ethernet Capture Effect (II) • No component here has failed • Problem didn’t show up until chips met the spec • Older chips were too slow to send back-to-back packets • The extra delay left B a chance to sneak in • Apparently was not caught in original modelling • Problem doesn’t require large scale to show up • In fact, adding more hosts tends to blur the picture • Solution involved adding extra delay • “Don’t send back-to-back if you just won a collision” • [Ramakrishnan and Yang, 1994] Emergent (Mis)behavior vs. Complex Software Systems
Herd behavior in a distributed system • Planetary-Scale Event Prop & Routing System • (a.k.a. PsEPR) [Brett et al., WORLDS 2005] • Runs on PlanetLab • Aims for very large scale • Requires clients to be distributed evenly among servers • Clients keep ordered preference lists of servers • Prefer “nearby” servers (based on all-pairs-ping) • On server failure: • Demote failed server • Try to connect to top server on list Emergent (Mis)behavior vs. Complex Software Systems
PsEPR system structures Desirable Undesirable Emergent (Mis)behavior vs. Complex Software Systems
Herd behavior in a distributed system:what went wrong with PsEPR • Initially, clients generally balanced among servers • As servers/links failed: • Same servers tended to look bad to most clients • So, client preference lists tended to converge • So, clients tended to connect to a small subset of servers • Clients mostly converged on a few servers: • These servers became overloaded • Server-local response-time monitors caused restarts • Causing further convergence of client preference lists • Clients all moved to the next server on their list • At rate governed by server restart times • Fix: adjust ordering by success count + random # Emergent (Mis)behavior vs. Complex Software Systems
Outline • Examples • What is/is not “emergent misbehavior”? • A research agenda • Thoughts about visions of the future • Related work Emergent (Mis)behavior vs. Complex Software Systems
One definition of emergent behavior Emergent behavior is that which cannot be predicted through analysis at any level simpler than that of the system as a whole. • George Dyson (1998) • Emergent misbehavior is just emergent behavior that we don’t want Emergent (Mis)behavior vs. Complex Software Systems
Distinguishing betweenemergent and “normal” misbehavior • Misbehavior that is not emergent: • Single-component bugs that break the whole system • Inherently inefficient algorithms • Insufficient resources • Much work on computer systems reliability • Focuses on handling faults • Aims for “correct by construction” • Emergent misbehavior tends to be: • Global misbehavior arising from “correct” local behaviors • Related to the composition of independent parts • Related to delays and to decentralized control • It might not ever be possible to be definitive Emergent (Mis)behavior vs. Complex Software Systems
Outline • Examples • What is/is not “emergent misbehavior”? • A research agenda • Thoughts about visions of the future • Related work Emergent (Mis)behavior vs. Complex Software Systems
Outline of a proposedresearch agenda • Create a taxonomy of emergent misbehaviors • To guide the rest of the agenda • Create a taxonomy of frequent causes • Generalize when possible; tie back to taxonomy #1 • Develop detection and diagnosis techniques • Look for distinctive signatures from taxonomies • Develop prediction techniques • For better prediction of performance and failures • Develop amelioration techniques • System design tricks to avoid emergent misbehavior • Develop testing techniques • Strategies for smoking out emergent misbehavior during testing Emergent (Mis)behavior vs. Complex Software Systems
Taxonomy #1:kinds of emergent misbehavior • Thrashing • Unwanted synchronization • Unwanted oscillation or periodicity • Deadlock • Livelock • Phase change • Chaotic behavior • etc. Emergent (Mis)behavior vs. Complex Software Systems
Taxonomy #2:Frequent causes of emergent misbehavior • Unexpected resource sharing • Massive scale • Decentralized control • Lack of composability • Misconfiguration • Unexpected inputs or loads • Communication delay • etc. Emergent (Mis)behavior vs. Complex Software Systems
There’s a lot more work to do! • A little more discussion in the paper … • Hopefully, a few dissertations, from people with more energy than I have. Emergent (Mis)behavior vs. Complex Software Systems
Outline • Examples • What is/is not “emergent misbehavior”? • A research agenda • Thoughts about visions of the future • Related work Emergent (Mis)behavior vs. Complex Software Systems
Visions of the future(large-scale and enterprise systems) • Automatic control of data centers and services • Beyond “lights out” to “minimal human involvement” • Feedback control of almost everything • Service-oriented computing • Construction by composition of “services” • Correctness by construction • Loose coupling via networks • Declarative approaches • “Models” for components and their composition Emergent (Mis)behavior vs. Complex Software Systems
Visions of the future:ignoring emergent misbehavior? • Automatic control of data centers and services • Feedback loops can lead to surprises • Especially when several loops are working at cross purposes • Service-oriented computing • Composition of dynamic behaviors could yield surprises • Loose coupling via networks: adds latency • Declarative approaches • Rule-based systems are hard to debug • Less explicit control over dynamics than procedural style? Emergent (Mis)behavior vs. Complex Software Systems
Outline • Examples • What is/is not “emergent misbehavior”? • A research agenda • Thoughts about visions of the future • Related work Emergent (Mis)behavior vs. Complex Software Systems
Related work • Lots of related work on good side of emergence • E.g.: Dyson, Darwin Among the Machines (1998) • Non-computer work on misbehavior: • Parunak & VanderBok (1997) • “Managing emergent behavior in distributed control systems” • Computer systems work on emergent misbehavior: • Term first(?) used by Ed Nisley (Dr. Dobb’s J., 2004) • Steven Gribble (HotOS, 2001) • Making systems more robust in the face of the unexpected • National Research Council report: A Research Agenda for Networked Systems of Embedded Computers (2001) Emergent (Mis)behavior vs. Complex Software Systems
Summary • We’ve already seen lots of emergent misbehavior • Trends could make things worse in the future • CS research on reliability has focussed on faults • We need to understand emergent misbehavior • We needs ways to cope with it • A lot more detail in the paper Emergent (Mis)behavior vs. Complex Software Systems
Advice for OSDI Authors • There will be no extensions to the deadline • Papers that violate the format requirements will be rejected. Emergent (Mis)behavior vs. Complex Software Systems