750 likes | 943 Views
Minimizing Wide-Area Performance Disruptions in Inter-Domain Routing. Yaping Zhu yapingz@cs.princeton.edu Advisor: Prof. Jennifer Rexford Princeton University. Minimize Performance Disruptions. Network changes affect user experience Equipment failures Routing changes Network congestion
E N D
Minimizing Wide-Area Performance Disruptions in Inter-Domain Routing Yaping Zhu yapingz@cs.princeton.edu Advisor: Prof. Jennifer Rexford Princeton University
Minimize Performance Disruptions • Network changes affect user experience • Equipment failures • Routing changes • Network congestion • Network operators have to react and fix problems • Fix equipment failure • Change route selection • Change server selection
Diagnosis Framework: Enterprise Network Diagnose Measure: network changes Fix: equipment, config, etc. Full Control Full Visibility
Challenges to Minimize Wide-Area Disruptions • The Internet is composed of many networks • ISP (Internet Service Provider): provides connectivity • CDN (Content Distribution Network): provides services • Each network has limited visibility and control Small ISPs Large ISP Client CDN
ISP’s Challenge: Provide Good Transit for Packets • Limited visibility • Small ISP: lack of visibility into problem • Limited control • Large ISP: lack of direct control to fix congestion Small ISPs Large ISP Client CDN
CDN’s Challenge: Maximize Performance for Services • Limited visibility • CDN: can’t figure out exact root cause • Limited control • CDN: lack of direct control to fix problem Small ISPs Large ISP Client CDN
Summary of Challenges of Wide-Area Diagnosis • Measure: large volume and diverse kinds of data • Diagnosis today: ad-hoc • Takes a long time to get back to customers • Does not scale to large number of events Our Goal: Build Systems for Wide-Area Diagnosis Formalize and automate the diagnosis process Analyze a large volume of measurement data
Rethink Routing Protocol Design • Many performance problems caused by routing • Route selection not based on performance • 42.2% of the large latency increases in a large CDN correlated with inter-domain routing changes • No support for multi-path routing Our Goal: Routing Protocol for Better Performance Fast convergence to reduce disruptions Route selection based on performance Scalable multi-path to avoid disruptions Less complexity for fewer errors
Route Oracle: Where Have All the Packets Gone? Work with: Jennifer Rexford Aman Shaikh and Subhabrata Sen AT&T Research
Route Oracle: Where Have All the Packets Gone? • Inputs: • Destination: IP Address • When? Time • Where? Ingress router • Outputs: • Where leaving the network? Egress router • What’s the route to destination? AS path AT&T IP Network IP Packet Egress Router Ingress Router AS Path Destination IP Address
Application: Service-Level Performance Management AT&T CDN Server in Atlanta • Troubleshoot CDN throughput drop • Case provided by AT&T ICDS (Intelligent Content Distribution Service) Project AT&T Leave AT&T in Atlanta Router in Atlanta Leave AT&T in Washington DC Sprint Atlanta users
Background: IP Prefix and Prefix Nesting • IP prefix: IP address / prefix length • E.g. 12.0.0.0 / 8 stands for [12.0.0.0, 12.255.255.255] • Suppose the routing table has routes for prefixes: • 12.0.0.0/8: [12.0.0.0-12.255.255.255] • 12.0.0.0/16: [12.0.0.0-12.0.255.255] • [12.0.0.0-12.0.255.255] covered by both /8 and /16 prefix • Prefix nesting: IPs covered by multiple prefixes • 24.2% IP addresses are covered by more than one prefix
Background: Longest Prefix Match (LPM) • BGP update format • by IP prefix • egress router, AS path • Longest prefix match (LPM): • Routers use LPM to forward IP packets • LPM changes as routes are announced and withdrawn • 13.0% BGP updates cause LPM changes Challenge: determine the route for an IP address -> LPM for the IP address -> track LPM changes for the IP address
Challenge: Scale of the BGP Data • Data collection: BGP Monitor • Have BGP session with each router • Receive incremental updates of best routes • Data Scale • Dozens of routers (one per city) • Each router has many prefixes (~300K) • Each router receives lots of updates (millions per day) Best routes Software Router Centralized Server BGP Routers
Background: BGP is Incremental Protocol • Incremental Protocol • Routes not changed are not updated • How to log routes for incremental protocol? • Routing table dump: daily • Incremental updates: 15mins Daily table dump 15 mins updates Best routes Software Router Centralized Server BGP Routers
Route Oracle: Interfaces and Challenges • Challenges • Track longest prefix match • Scale of the BGP data • Need answer to queries • At scale: for many IP addresses • In real time: for network operation BGP Routing Data Inputs Destination IP Address Ingress Router Time Route Oracle Outputs Egress Router AS Path Yaping Zhu, Princeton University
Strawman Solution: Track LPM Changesby Forwarding Table • How to implement • Run routing software to update forwarding table • Forwarding table answers queries based on LPM • Answer query for one IP address • Suppose: n prefixes in routing table at t1, m updates from t1 to t2 • Time complexity: O(n+m) • Space complexity: • O(P): P stands for #prefixes covering the query IP address
Strawman Solution: Track LPM Changes by Forwarding Table • Answer queries for k IP addresses • Keep all prefixes in forwarding table • Space complexity: O(n) • Time complexity: major steps • Initialize n routes: n*log(n)+k*n • Process m updates: m*log(n)+k*m • In sum: (n+m)*(log(n)+k) • Goal: reduce query processing time • Trade more space for less time: pre-processing • Store pre-processed results: not scale for 232 IPs • need to track LPM scalably
Track LPM Scalably: Address Range • Prefix set • Collection of all matching prefixes for given IP address • Address range • Contiguous addresses that have the same prefix set • E.g. 12.0.0.0/8 and 12.0.0.0/16 in routing table • [12.0.0.0-12.0.255.255] has prefix set {/8, /16} • [12.1.0.0-12.255.255.255] has prefix set {/8} • Benefits of address range • Track LPM scalably • No dependency between different address ranges
Track LPM by Address Range: Data Structure and Algorithm • Tree-based data structure: node stands for address range • Real-time algorithm for incoming updates [12.0.1.0-12.0.255.255] [12.0.0.0-12.0.0.255] [12.1.0.0-12.255.255.255] /8 /16 /24 /8 /16 /8 Routing Table
Track LPM by Address Range: Complexity • Pre-processing: • for n initial routes in the routing table and m updates • Time complexity: (n+m)*log(n) • Space complexity: O(n+m) • Query processing: for k queries • Time complexity: O((n+m)*k) • Parallelization using c processors: O((n+m)*k/c)
Route Oracle: System Implementation BGP Routing Data: Daily table dump, 15 mins updates Precomputation Daily snapshot of routes by address ranges Incremental route updates for address ranges Query Inputs: Destination IP Ingress router Time Query Processing Output for each query: Egress router, AS path
Query Processing: Optimizations • Optimize for multiple queries • Amortize the cost of reading address range records: across multiple queried IP addresses • Parallelization • Observation: address range records could be processed independently • Parallelization on multi-core machine
Performance Evaluation: Pre-processing • Experiment on SMP server • Two quad-core Xeon X5460 Processors • Each CPU: 3.16 GHz and 6 MB cache • 16 GB of RAM • Experiment design • BGP updates received over fixed time-intervals • Compute the pre-processing time for each batch of updates • Can we keep up? pre-processing time • 5 mins updates: ~2 seconds • 20 mins updates: ~5 seconds
Performance Evaluation: Query Processing • Query for one IP (duration: 1 day) • Route Oracle 3-3.5 secs; Strawman approach: minutes • Queries for many IPs: scalability (duration: 1 hour)
NetDiag: Diagnosing Wide-Area Latency Changes for CDNs Work with: Jennifer Rexford Benjamin Helsley, Aspi Siganporia, and Sridhar Srinivasan Google Inc.
Background: CDN Architecture • Life of a client request • Front-end (FE) server selection • Latency map • Load balancing (LB) Ingress Router Front-end Server (FE) CDN Network Client AS Path Egress Router
Challenges • Many factors contribute to latency increase • Internal factors • External factors • Separate cause from effect • e.g., FE changes lead to ingress/egress changes • The scale of a large CDN • Hundreds of millions of users, grouped by ISP/Geo • Clients served at multiple FEs • Clients traverse multiple ingress/egress routers
Contributions • Classification: • Separating cause from effect • Identify threshold for classification • Metrics: analyze over sets of servers and routers • Metrics for each potential cause • Metrics by an individual router or server • Characterization: • Events of latency increases in Google’s CDN (06/2010)
Background: Client Performance Data Ingress Router Performance Data Front-end Server (FE) CDN Network Client AS Path Egress Router Performance Data Format: IP prefix, FE, Requests Per Day (RPD), Round-Trip Time (RTT)
Background: BGP Routing and Netflow Traffic • Netflow traffic (at edge routers): 15 mins by prefix • Incoming traffic: ingress router, FE, bytes-in • Outgoing traffic: egress router, FE, bytes-out • BGP routing (at edge routers): 15 mins by prefix • Egress router and AS path
Background: Joint Data Set • Granularity • Daily • By IP prefix • Format • FE, requests per day (RPD), round-trip time (RTT) • List of {ingress router, bytes-in} • List of {egress router, AS path, bytes-out} BGP Routing Data Netflow Traffic Data Performance Data Joint Data Set
Classification of Latency Increases Latency Map FE Capacity and Demand Latency Map Change vs. Load Balancing Performance Data FE Changes Group by Region Identify Events FE Change vs. FE Latency Increase BGP Routing Netflow Traffic Events FE Latency Increase Routing Changes: Ingress Router vs. Egress Router, AS path
Case Study: Flash Crowd Leads some Requests to a Distant Front-End Server • Identify event: RTT doubled for an ISP in Malaysia • Diagnose: follow the decision tree Latency Map FE Capacity and Demand Latency Map Change vs. Load Balancing 97.9% by FE changes 32.3% FE change By load balancing FE Change vs. FE Latency Increase RPD (requests per day) jumped: RPD2/RPD1 = 2.5
Classification: FE Server and Latency Metrics Latency Map FE Capacity and Demand Latency Map Change vs. Load Balancing Performance Data FE Changes Group by Region Identify Events FE Change vs. FE Latency Increase BGP Routing Netflow Traffic Events FE Latency Increase Routing Changes: Ingress Router vs. Egress Router, AS path
FE Change vs. FE Latency Increase • RTT: weighted by requests from FEs • Break down RTT change by two factors • FE change • Clients switch from one FE to another (with higher RTT) • FE latency change • Clients using the same FE, latency to FE increases
FE Change vs. FE Latency Change Breakdown • FE change • FE latency change • Important properties • Analysis over a set of FEs • Sum up to 1
FE Changes: Latency Map vs. Load Balancing Latency Map FE Capacity and Demand Latency Map Change vs. Load Balancing Performance Data FE Changes Group by Region Identify Events FE Change vs. FE Latency Increase BGP Routing Netflow Traffic Events FE Latency Increase Routing Changes: Ingress Router vs. Egress Router, AS path
FE Changes: Latency Map vs. Load Balancing • Classify FE changes by two metrics: • Fraction of traffic shift by latency map • Fraction of traffic shift by load balancing Latency Map FE Capacity and Demand Latency Map Change vs. Load Balancing FE Changes FE Change vs. FE Latency Increase
Latency Map: Closest FE Server • Calculate latency map • Latency map format: (prefix, closest FE) • Aggregate by groups of clients: list of (FEi, ri) ri: fraction of requests directed to FEi by latency map • Define latency map metric
Load Balancing: Avoiding Busy Servers • FE request distribution change • Fraction of requests shifted by the load balancer • Sum only if positive: target request load > actual load • Metric: more traffic load balanced on day 2
FE Latency Increase: Routing Changes • Correlate with routing changes: • Fraction of traffic shifted ingress router • Fraction of traffic shifted egress router, AS path FE hange vs. FE Latency Increase BGP Routing Netflow Traffic FE Latency Increase Routing Changes: Ingress Router vs. Egress Router, AS path
Routing Changes: Ingress, Egress, AS Path • Identify the FE with largest impact • Calculate fraction of traffic which shifted routes • Ingress router: • f1j, f2j: fraction of traffic entering ingress j on days 1 and 2 • Egress router and AS path • g1k, g2k: fraction of traffic leaving egress/AS path k on day 1, 2
Identify Significant Performance Disruptions Latency Map FE Capacity and Demand Latency Map Change vs. Load Balancing Performance Data FE Changes Group by Region Identify Events FE Change vs. FE Latency Increase BGP Routing Netflow Traffic Events FE Latency Increase Routing Changes: Ingress Router vs. Egress Router, AS path
Identify Significant Performance Disruptions • Focus on large events • Large increases: >= 100 msec, or doubles • Many clients: for an entire region (country/ISP) • Sustained period: for an entire day • Characterize latency changes • Calculate daily latency changes by region
Latency Characterization for Google’s CDN • Apply the classification to one month of data (06/2010)