160 likes | 316 Views
Towards an Internet that “Never Fails”. Hari Balakrishnan MIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru . What We Should Aim Toward. Carrier airlines (2002 FAA Fact Book) 41 accidents, 6.7 million flights (five “nines” availability) 911 phone service (1993 NRIC report)
E N D
Towards an Internet that “Never Fails” Hari BalakrishnanMIT Joint work with Nick Feamster, Scott Shenker, Mythili Vutukuru
What We Should Aim Toward • Carrier airlines (2002 FAA Fact Book) • 41 accidents, 6.7 million flights (five “nines” availability) • 911 phone service (1993 NRIC report) • 29 minutes downtime per year per line (four “nines” availability) • Standard phone service (various sources) • 53 minutes downtime per year per line (four “nines” availability) • The Internet? • One to two “nines”
Example Catastrophic Failures “…a glitch at a small ISP… triggered a major outage in Internet access across the country. The problem started when MAI Network Services...passed bad router information from one of its customers onto Sprint.” -- news.com, April 25, 1997 “Microsoft's websites were offline for up to 23 hours...because of a [router] misconfiguration…it took nearly a day to determine what was wrong and undo the changes.” -- wired.com, January 25, 2001 “WorldCom Inc…suffered a widespread outage on its Internet backbone that affected roughly 20 percent of its U.S. customer base. The network problems…affected millions of computer users worldwide. A spokeswoman attributed the outage to "a route table issue." -- cnn.com, October 3, 2002 "A number of Covad customers went out from 5pm today due to, supposedly, a DDOS (distributed denial of service attack) on a key Level3 data center, which later was described as a route leak (misconfiguration).” -- dslreports.com, February 23, 2004
NANOG List Failure “Analysis” More than 70% of threads discussing failures relatedto router configuration or route announcement problems Note: Only includes problems openly discussed on this list.
Faults and Failures • Fault = Underlying defect in a component that causes it to violate a specification • Latent or Active (i.e., cause errors) • Unmasked faults (errors) cause failures • Failure of subsystem (spec violation) causes fault in system • Internet faults occur for complex reasons • Hardware, software, protocol, design, implementation, operational faults: could be triggered by malice • Internet failure: A cannot communicate with B
Three Directions • Configuration as programming • Defines BGP behavior • Tools to cope with routing complexity • Coping with protocol faults: failure-atomic interdomain routing • Prefix-based routing considered harmful • End-to-end routing • Exposing multiple paths to end systems (and stubs)
Today: Reactive Operation What happens if I tweak this policy…? • Problems cause downtime • Problems often not immediately apparent
Coping with Complexity • View configuration as (distributed) programming • Large-scale: over 1M lines of code in some networks • Programming tools to reduce fault frequency • Static analysis can detect many faults [rcc] • Sandboxing to overcome current “stimulus-response” reasoning [FR03] • Centralize configuration platform • More “intentional” config specs • Push configs to routers • Push routes to routers [RCP:F+04] • Use static analysis and sandboxing tools
rcc Configure Detect Faults Deploy Proactive Operation with rcchttp://nms.csail.mit.edu/rcc • Represent complex, distributed configuration • Define a correctness specification • Map specification to constraints Distributed router configurations (Single AS) rcc Correctness Specification Constraints Faults Normalized Representation
Correctness Specification Path Visibility Every destination with a usable path has a route advertisement If there exists a path, then there exists a route Example violation: Signaling partition Route Validity Every route advertisement corresponds to a usable path If there exists a route, then there exists a path Example violation: Routing loop
Results: Faults across 17 ASes Every AS had faults, regardless of network size Most faults can be attributed to distributed configuration Route Validity Path Visibility
Three Directions • Configuration as programming • Tools to cope with routing complexity • Coping with protocol faults: failure-atomic interdomain routing • Prefix-based routing considered harmful • End-to-end routing • Exposing multiple paths to end systems
Prefixes are too coarse-grained Validity: If a failure occurs that makes a network unreachable via a given path, then the route corresponding to that path must be withdrawn 70% of intra-AS failuresnot visible in BGP [FABK03]
…but they are also too fine-grained! • ~70% of discontiguous prefix pairs from the same AS are announced from the same location • Allocation explains about 60% of these cases: • Registries often allocate discontiguous address blocks to a single AS on the same day • Routes for these prefixes will “flap” together. • 135.36.0.0/16 (Agere) and 135.12.0.0/14 (Lucent) Route objects should correspond to an “atom” of hosts that share fate
Proposal: Atomic Interdomain Protocol (AIP) • Exterminate prefixes • Name “atomic domains” (AD) directly • Addressing, forwarding and routing on ADs • Like current AS numbers, but finer-grained • Example: MIT, Microsoft Redmond, one PoP of a large ISP, … • Flat AD IDs can carry cryptographic meaning • Self-certifying (hash of public key) • End-system addresses have the form [AD : LocalID]
Summary It’s worth shooting for a two or three order-of-magnitude improvement in Internet availability It’s possible to get four or five nines of Internet availability, if we: • Develop tools to cope with configuration complexity • Develop a failure-atomic routing system • Expose multiple IP-layer paths to higher layers