300 likes | 334 Views
Validating Datacenters at Scale. Karthick Jayaraman Nikolaj Bjørner , Jitu Padhye, Amar Agrawal, Ashish Bhargava, Paul-Andre C Bissonnette, Shane Foster, Andrew Helwer, Mark Kasten, Ivan Lee, Anup Namdhari, Haseeb Niaz,
E N D
Validating Datacenters at Scale Karthick Jayaraman Nikolaj Bjørner, Jitu Padhye, Amar Agrawal, Ashish Bhargava, Paul-Andre C Bissonnette, Shane Foster, Andrew Helwer, Mark Kasten, Ivan Lee, Anup Namdhari, Haseeb Niaz, Aniruddha Parkhi, Hanukumar Pinnamraju, Adrian Power, Neha Milind Raje, Parag Sharma Microsoft Azure Networking
Hyperscale Azure Datacenter Network 54 regions worldwide 140 countries network devices maintenance changes/day servers policies
Reliablity at Hyperscale Is the network operating as expected? Will my change affect the network?
Reality Checker for Datacenters (RCDC) What is the Reality? What is the Intent? How to scale verification? What do we do with the results?
Forwarding Information Base (FIB) i1 i2 Collectively determine forwarding behavior of the network • Determines forwarding behavior of each device • Longest prefix matching i3 i4 dstIp=100.26.0.1 dstIp=100.25.0.1
Reality Checker for Datacenters (RCDC) What is the Reality? What is the Intent? How to scale verification? What do we do with the results?
What is the intent? • All Pairs ToR Reachability R1 R2 R3 R4 D1 D2 D3 D4 A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 • 10.0.0.0/16 11.0.0.0/16 12.0.0.0/16 13.0.0.0/16 Cluster 2 Cluster 1
What is the intent? • All Pairs ToR Reachability • Traffic must follow shortest path • Intra-cluster path length = 2 • Intra-datacenter path length = 4 R1 R2 R3 R4 D1 D2 D3 D4 A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 • 10.0.0.0/16 11.0.0.0/16 12.0.0.0/16 13.0.0.0/16 Cluster 2 Cluster 1
What is the intent? • All Pairs ToR Reachability • Traffic must follow shortest path • All Equal Cost Multi Paths (ECMP) must be available R1 R2 R3 R4 D1 D2 D3 D4 A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 • 10.0.0.0/16 11.0.0.0/16 12.0.0.0/16 13.0.0.0/16 Cluster 2 Cluster 1
Where does the intent come from? Network Graph Service Automatic Intent Extraction R1 R2 R3 R4 • All pairs ToR reachability • Traffic must follow shortest path • ECMP redundancy Topology D1 D2 D3 D4 A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 • 10.0.0.0/16 11.0.0.0/16 12.0.0.0/16 13.0.0.0/16
Reality Checker for Datacenters (RCDC) What is the Reality? What is the Intent? How to scale verification? What do we do with the results?
Challenges Anteater [Mai 2011] HSA [Kazemian 2012] Veriflow [Kurshid 2013] NetKat [Anderson 2014] NoD [Lopes 2015] Symmetries [Plotkin 2016, Beckett 2018] All pairs ToR reachability analysis is O(N3) Composite FIB snapshot is a hard engineering problem Libra [Zeng 2014]
Local Validation Exploit Azure network’s regular structure • Each router has a fixed role for a set of addresses • Enough to verify role is enforced on each router Decompose into local contracts R1 R2 R3 R4 Backbone D1 D2 D3 D4 Spine router Leaf routers A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 11.0.0.0/16 12.0.0.0/16 13.0.0.0/16 • 10.0.0.0/16
What are the contracts? R1 R2 R3 R4 Backbone D1 D2 D3 D4 Spine router ToR1 Contracts Leaf routers A1 A2 A3 A4 B1 B2 B3 B4 Default contacts Specific contacts ToR3 ToR1 ToR2 ToR4 11.0.0.0/16 12.0.0.0/16 13.0.0.0/16 • 10.0.0.0/16
What are the contracts? R1 R2 R3 R4 Backbone D1 D2 D3 D4 Spine router A1 Contracts Leaf routers A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 11.0.0.0/16 12.0.0.0/16 13.0.0.0/16 • 10.0.0.0/16
What are the contracts? R1 R2 R3 R4 Backbone D1 D2 D3 D4 Spine router Leaf routers A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 11.0.0.0/16 12.0.0.0/16 13.0.0.0/16 • 10.0.0.0/16
Live Monitoring of Forwarding Behavior R1 R2 R3 R4 Network Graph Service D1 D2 D3 D4 Validation time for one datacenter < 3 minutes Reachability invariants A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 Error Reports • 10.0.0.0/16 11.0.0.0/16 12.0.0.0/16 13.0.0.0/16
Realtime Checker for Datacenters (RCDC) What is the Reality? What is the Intent? How to scale verification? What do we do with the results?
Latent Error R1 R2 R3 R4 Backbone D1 D2 D3 D4 Spine router Leaf routers A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 13.0.0.0/16 12.0.0.0/16 11.0.0.0/16 • 10.0.0.0/16
Latent Errors R1 R2 R3 R4 Backbone D1 D2 D3 D4 Spine router Leaf routers A1 A2 A3 A4 B1 B2 B3 B4 ToR3 ToR1 ToR2 ToR4 13.0.0.0/16 12.0.0.0/16 11.0.0.0/16 • 10.0.0.0/16
What did we do about the errors? O(100) • Risk Categorization • Role of device • No of additional faults required to cause an impact
Experience: Types of Errors Software bugs Hardware failures Operational Drift Migrations Software bug that caused rib-fib inconsistency Operationally down links BGP Sessions that are shut Port channels not configured on T1s Two T1 sets configured with the same ASN
Reliablity at Hyperscale Is the network operating as expected? Will my change affect the network?
Verifying Device Access-Control Lists (ACL) srcIpdstIp protocol action Contracts * 100.64.0.0/16 UDP deny * * * permit Parsers * * * deny Policy bit-vector logic formulas Z3: Check SecGuru
Refactoring a Large Legacy ACL Edge ACL Edge ACL Refactor Few hundred lines Move out service specific protections Several thousands lines Intent was poorly understood Difficult to make changes
Refactoring a Large Legacy ACL Regression contracts Regression contracts Regression contracts SecGuru SecGuru SecGuru Fix errors in policy Deploy refactored ACL Deploy refactored ACL Contract expects: Policy only allows:
Summary • Captured and checked intent in Azure Datacenters • Incorporated verification to monitor drift and check impact of changes. • Optimized for hyper scale
More Challenges • Wide area networks • Better abstractions for intent • Model-based testing of device firmware • Verifying virtual network policies • Contact • dmaltz@microsoft.com • karjay@microsoft.com • padhye@microsoft.com