270 likes | 408 Views
Networked Systems Laboratory. Online Testing of BGP. Marco Canini EPFL, Switzerland. Joint work with: Vojin Jovanović , Daniele Venzano, Gautam Kumar, Dejan Novaković , Boris Spasojević , Olivier Crameri, and Dejan Kostić. Work supported by the European Research Council.
E N D
NetworkedSystems Laboratory Online Testing of BGP Marco CaniniEPFL, Switzerland Joint work with: Vojin Jovanović, Daniele Venzano, Gautam Kumar, Dejan Novaković, Boris Spasojević, Olivier Crameri, and Dejan Kostić Work supported by the European Research Council Marco Canini, RIPE 62
Is it hard to crash the Internet? • Software bugs in inter-domain routers 0-length AS4_PATH attribute! Protocol-compliant, confusing message ? Reset session! At 17:07:26 UTC on August 19, 2009 CNCI (AS9354), a small network service provider in Nagoya, Japan, advertised a handful of BGP updates containing an empty AS4_PATH attribute. [renesys blog] Router type A Router type B Marco Canini, RIPE 62
Is it hard to crash the Internet? • What went wrong ? Unreachable! ? ? ? ? Repeated service disruptions: routing instabilities! Unaffected router Affected router Marco Canini, RIPE 62
BGP not always reliable • Distributed system behavior • Aggregate result of interleaved actions of multiple routers • Federated, heterogeneous and failure-prone environment • Difficult to reason about all corner cases or combinations of configurations • Unanticipated interactions, subtle differences in inter-operable implementations, system-wide conflicts, seemingly valid local fault handling Marco Canini, RIPE 62
Agenda • Our system for online testing • Disclaimer: still a research work! • Not going to be an immediate solution • Hope it will be a tool for this community • Solicit feedback • Which faults would you look for? • What would convince you to deploy our system? • … discussion Marco Canini, RIPE 62
DiCE comes to the rescue • Key idea: automatically explore system behavior to detect potential faults • Create an isolated snapshot of a BGP neighborhood • Subject a router’s BGP process to many inputs that systematically exercise router actions • For each input, check if the snapshot misbehaves DiCE Error in the snapshot Evidence of possible future behavior of production system BGP neighbors BGP process Marco Canini, RIPE 62
BGP snapshot • Isolate testing from production environment Special IP prefix Custom attribute Local checkpoint of current state and configuration BGP’s federated environment Each router keeps its local checkpoint Private state & config stays in the AS ASes collaborate to detect potential faults BGP process Cloned BGP process FIB Sockets BGP peers Sockets BGP checkpoints Marco Canini, RIPE 62
Exploration of behavior Use a path exploration engine Concolic (CONCrete + symbOLIC) execution systematically exercises code paths DiCE Clone of BGP process Error! Is there an error? 1 2 3 Marco Canini, RIPE 62
Driving behavior by inputs Route selection Inputs Route ranking: is most preferred route? Input generation Code & current config Failures Messages Configuration changes UPDATE Random choices Timeouts Header Withdrawn Routes Path exploration engine Path Attributes Attribute Type | Length | Value Network Layer Reachability Information Symbolic inputs NLRI Length | Prefix Path constraints Marco Canini, RIPE 62
Detecting faults • Check properties that capture desired behavior • Example: Harmful Global Events (session resets) f() ? f() ? • 1 BGP error f() ? f() 0 Error count > threshold? Log inputs that have harmful global behavior • 1 BGP error f() ? Valid but ambiguous messages f() f() • 1 BGP error 0 0 f() ? • 1 BGP error DiCEcontroller f() ∑ 0 • 5 BGP errors Unaffected router • 1 BGP error Affected router Marco Canini, RIPE 62
Other properties • Policy-induced divergence • Origin misconfiguration • Check: routing tables polluted in external ASes? • Route leaks (hijacks) by customer or provider C P List of prefixes that can leak UPDATEAS_PATHCprefix d C Marco Canini, RIPE 62
Keeping confidential information • Potential router behavior • Common code paths already exposed • Reverse engineering any easier than today? • Private state or configuration • Information hiding through randomization • Avoid inputs driven by confidential data cannot leak • Rate limit, refuse certain explorers • Anonymous property checks • Secure multi-party computation no need for trusted 3rd party Marco Canini, RIPE 62
Implementation details • Integrated DiCE in BIRD 1.1.7 • Open source router, coded in C • Concolic execution instruments code to track symbolic inputs • Instrumentation needed only for testing • Negligible impact on the production environment Marco Canini, RIPE 62
Evaluation • Multiple BIRD instances on a 48-core machine • Properties checked • Harmful global events • Origin misconfiguration • Policy conflict Marco Canini, RIPE 62
Evaluation topology [Haeberlen et al., NSDI ’09] + Annotations customer-provider link AS 1 • Loaded ~300k BGP prefixes • Replayed 15-min trace • Policy and filtering • Installed in ModelNet network emulator [OSDI ‘02] • 30 ms intra-AS • 5 ms inter-AS • 620 Mbps peering link Rest of the Internet backup link router that resets session due to 0-length AS4_PATH AS 3 AS 2 AS 4 AS 5 AS 6 AS 10 AS 9 AS 165053 AS 8 Marco Canini, RIPE 62
Micro benchmarks • CPU overhead • Metric: BGP updates per s • Stress test during RIB load • Baseline: 15.1 – W/ exploration: 13.9 – Impact 8% • Realistic test during trace replay • Negligible impact • Memory overhead • Cloned process has 37% overhead on avg • Bandwidth • 8 Kbps avg for exploratory messaging Marco Canini, RIPE 62
Results • Avg: 243 s, 756 explorations • Max 670 s, 2002 explorations • Without ModelNet: avg 155 s • Detected session reset and origin misconfiguration Explored all paths in the UPDATE handlers + across the Internet-like testbed in ~4 min avg (11 min max) Marco Canini, RIPE 62
Deployment option 1 • Convince Cisco, Juniper, Huawei, etc. to integrate DiCE Marco Canini, RIPE 62
Deployment option 2 • Deploy DiCE+BIRD in a server • Potentially run multiple router instances • Configure with the AS policy & BGP feed • Connect with DiCE servers in neighboring ASes Marco Canini, RIPE 62
Incentives • Common infrastructure • ISP benefits as an exploration target • Knowing about its faults • Upstream ISPs can incentivize customer ISPs to serve as an “explorer” • Fewer faults, lower operational costs Marco Canini, RIPE 62
Conclusion • We have an online testing system for BGP • Are you interested to try out our prototype? • Do you have suggestions for properties to check? • Get in touch: marco.canini@epfl.ch • Thank you! Questions? • More info in our papers • [LADIS ’10, USENIX ATC ’11] Marco Canini, RIPE 62
Backup slides Marco Canini, RIPE 62
My Research • Improving the reliability of distributed systems • Why? • Foundation of our society’s infrastructure • ... but it is difficult to make them reliable • Produce robust design and implementation • Deploy and operate reliably • A prime example: BGP (inter-domain routing) • Fundamental service for Internet’s operation • Additional challenges: federation & heterogeneity Marco Canini, RIPE 62
Node 1 (explorer) Node 2 DiCE/BGP Prototype in Action 1’: annotated message 1’’: fork() 1’: fork() 1’’: ack 2’: fork()/ run 1: create snapshot 2’’’: fork()/ run 2’’: connect 2: input constraints 3: message 3’: ack 4: check 4: property check ctrl constraints/ inputs path exploration engine Marco Canini, RIPE 62
Original input Inputs produced by DiCE x.y.z.w/l a.b.c.d/l x.y.z.w/l: (fuzz) fuzz? Input generation code Fuzz? Fuzz? Fuzz attr Fuzz attr Fuzz attr Import filter1? yes Import filter1? Import filter1? Import filter2? yes Import filter2? Import filter2? yes Router update handling code Apply update Apply update Apply update Drop update Send update Drop update Send update Drop update Send update x.y.z.w/l: (0-length AS4_PATH) a.b.c.d/l (leaked prefix) x.y.z.w/l Marco Canini, RIPE 62
Property 3: BGP Policy Conflicts 1 3 0 2 1 0 Checking convergence is hard [Varadhanet al.,‘96, Griffin et al.,’00] • Check: Dispute wheel? • Absence of: sufficient condition for robust convergence 1 0 2 0 Nodes locally prefer not routing directly to 0 1 2 BAD GADGET II 3 4 2 0 4 2 0 0 3 0 4 3 0 3 4 • Cycle! [Timothy G. Griffin, Leiden Global Internet talk ‘00] Marco Canini, RIPE 62
Dispute Wheel Detection with DiCE • Use symbolic input to change policy • Can cause adispute wheel in a single step • Use global precedence metric to detect and resolve conflict [Eeet al., SIGCOMM ‘07] • Metric invoked DW in the cloned snapshot Fault Report: List of policy changes that cause oscillations 1 2 1 0 2 1 0 1 3 0 2 0 GOOD GADGET BAD GADGET II 0 3 4 2 0 4 2 0 3 4 3 0 4 3 0 Marco Canini, RIPE 62