290 likes | 304 Views
Learn about the importance of internet measurement in self-driving networks and how it can improve availability, latency, and traffic engineering. Explore the challenges and future directions in this field.
E N D
Internet Measurement for Self-Driving Networks Matt Calder Minerva Chen, Jose Nunez de Caceres Estrada, Diego Perez Botero, Madhura Phadke, Manuel Schröder April 4, 2019
Background • Azure Frontdoor – Microsoft’s content delivery network • 1st and (recently) 3rd party CDN • Servers deployed on Microsoft's network edge • Global application load balancing • Reverse proxy / Split TCP • Dedicated Internet measurement team • Systems used across Microsoft • Network experimentation • Network monitoring • User-facing analytics • Build out and capacity planning • Traffic engineering
We need Self-driving Networks • Use cases • Availability drop / outages. • Seattle Comcast users failure rate of 10% over the last 10 minutes. • Latency regression • P90 RTT of users in Taiwan increased 100 ms. • Frequent and unending • Operate self-driving networks to mitigate • Traffic engineering • Building blocks to enable self-driving networks took years
Outline Introduction Path to Self-Driving Solutions Data Collection Odin Measurement Platform Challenges and Future Directions
Leading up to Self-driving Networks #1 No insight • Customer reported incidents • Can't or unaware of how to measure it • Ad-hoc measurement and analysis • Get info from customer • TCP dump on production machine, copy trace locally, write script to analyze • Traceroute from prod machine. Have customer send you traceroute from their end or find looking glass. • Time consuming investigations for engineers • Best outcome is troubleshooting guide
Leading up to Self-driving Networks #1 No insight #2 Automate Data Collection Data is there when you need it • Measurements • Telemetry • Large geo-distributed service • Data ingestion latency • Raw or aggregate • Queryability
Leading up to Self-Driving Networks #1 No insight #2 Automate Data Collection #3 Automate Issue Detection • Methodology • Invest in statistics, data quality, and validation • Translate networking domain knowledge into process • Schedule recurring jobs to look for issues • Produce reports
Leading up to Self-Driving Networks #1 No insight #2 Automate Data Collection #3 Automate Issue Detection #4 Alerting True test of #3 Raise alert to on-call engineer Follow troubleshooting guide Too many alerts -> on-call burnout Issues are mostly short-lived
Arriving at Self-driving Networks #1 No insight #2 Automate Data Collection #3 Automate Issue Detection #4 Alerting #5 Closing the loop • Feed data to traffic engineering system • Auth DNS, BGP • Missing piece is measurements of alternate paths • Examples • Change egress traffic links • Change ingress traffic PoPs
Outline Introduction Path to Self-Driving Solutions Data Collection Odin Measurement Platform Challenges and Future Directions
Data Collection at Azure Frontdoor Passive Server-side Client requests instrumented at server Collect TCP and application layer metrics
Data Collection at Azure Frontdoor Passive Server-side Active Server-side • Traceroute, ping • From servers to Internet destinations Client requests instrumented at server Collect TCP and application layer metrics
Data Collection at Azure Frontdoor Passive Server-side Active Server-side • Traceroute, ping • From servers to Internet destinations Active Client-side • + HTTP(S) • From Microsoft users to Internet destinations Client requests instrumented at server Collect TCP and application layer metrics
Data Collection at Azure Frontdoor Azure Global Telemetry Passive Server-side Data Access Real-time Active Server-side Near Real-time • Traceroute, ping • From servers to Internet destinations Offline Active Client-side • + HTTP(S) • From Microsoft users to Internet destinations Client requests instrumented at server Collect TCP and application layer metrics
Measurement Limitations Passive server-side Issue 1: No explicit outage signal Issue 2: Alternate path exploration adds risk
Measurement Limitations Active layer 3 measurements from servers • Issue 1: Poor coverage • 74% of end-users are unresponsive • Issue 2: Missing layer 7 behaviors • HTTP redirection • SSL/TLS Passive server-side Issue 1: No explicit outage signal Issue 2: Alternate path exploration adds risk
Outline Introduction Path to Self-Driving Solutions Data Collection Odin Measurement Platform Challenges and Future Directions
Odin Design HTTP(S) GET tiny.png Server A Odin 20 ms A: 20ms 1. Client-side Platform Offline Analysis 2. Active Measurement Report Endpoint 3. Application Layer Online Alerting Microsoft
Odin Design HTTP(S) Odin GET tiny.png 20 ms Server B Stock Ticker Desktop User B: 20ms 1. Client-side Platform Offline Analysis 2. Active Measurement Report Endpoint 3. Application Layer Online Alerting 4. Both Web and Rich Clients Microsoft
Odin Design HTTP(S) Odin GET tiny.png Server B Stock Ticker Desktop User B: ERROR 1. Client-side Platform Offline Analysis 2. Active Measurement Report Endpoint 3. Application Layer Online Alerting 4. Both Web and Rich Clients Microsoft 5. Explicit Failure Notification
Odin Design Examples showed measurements to the application server Mail Server B Odin Want richer measurements Offline Analysis Report Upload Endpoint Online Alerting Microsoft
Odin Design Orchestration Service Odin Offline Analysis Report Upload Endpoint Online Alerting Microsoft Target URLs Primary and backup report endpoints
Odin Design Odin Server B 1. Orchestration Service m1.contoso.com: 20ms 20ms GET tiny.png Offline Analysis 3. 2. Report Endpoint Online Alerting Microsoft Europe Microsoft U.S. m3.contoso.com
Odin Design: Fault tolerance Odin GET tiny.gif Server B Orchestration Service Offline Analysis B: ERROR Report Endpoint Need to receive measurements even if Microsoft’s network is unavailable Online Alerting Microsoft
Odin Design: Fault tolerance Odin GET tiny.gif Server B Orchestration Service Offline Analysis B: ERROR Report Endpoint Online Alerting Report Proxy Microsoft 3rd Party Network
Summary: Odin enables Self-driving Networks • Coverage • No better vantage points than your actual customers • Safety • Don’t need to experiment/measure with prod traffic • Ability to validate • Flexibility • Supports enterprise network and privacy requirements • Fault tolerance • Measurements available during outages
Outline Introduction Path to Self-Driving Solutions Data Collection Odin Measurement Platform Challenges and Future Directions
Challenges and Future Directions • Data Quality • How do we build systems which avoid making bad decisions? • When do we get humans back in the loop? • TE keeps fixing recurring problems • May require change in service, additional capacity • Need for collaboration • Issues impacting common resources e.g. IXPs, transit, end-users • Self-driving networks will route accordingly • Want to help fix underlying issue • Still need to email NOC • Signals published by content providers • Network operators subscribe