High -Fidelity Latency Measurements in Low -Latency Networks

High-Fidelity Latency Measurements inLow-Latency Networks Ramana RaoKompella Myungjin Lee (Purdue), Nick Duffield (AT&T Labs – Research)

Low Latency Applications • Many important data center applications require low end-to-end latencies (microseconds) • High Performance Computing – lose parallelism • Cluster Computing, Storage – lose performance • Automated Trading – lose arbitrage opportunities Stanford

Low Latency Applications • Many important data center applications require low end-to-end latencies (microseconds) • High Performance Computing – lose parallelism • Cluster Computing, Storage – lose performance • Automated Trading – lose arbitrage opportunities • Cloud applications • Recommendation Systems, Social Collaboration • All-up SLAs of 200ms [AlizadehSigcomm10] • Involves backend computation timeand network latencies have little budget Stanford

Latency Measurements are Needed … Core Router Edge Router ToR S/W … … … … Which router causes the problem?? Router Measurement within a router is necessary At every router, high-fidelity measurements are critical to localize root causes Once root cause localized, operators can fix by rerouting traffic, upgrade links or perform detailed diagnosis Stanford 1ms

Vision: Knowledge Plane SLA Diagnosis Routing/Traffic Engineering Scheduling/Job Placement Knowledge Plane Query Interface Push Query Latency Measurements Response Latency Measurements Pull Data Center Network Stanford

Contributions Thus Far… • Aggregate Latency Estimation • Lossy Difference Aggregator – Sigcomm 2009 • FineComb – Sigmetrics2011 • mPlane – ReArch2009 • Differentiated Latency Estimation • MultiflowEstimator – Infocom2010 • Reference Latency Interpolation – Sigcomm 2010 • RLI across Routers – Hot-ICE 2011 • Delay Sketching – (under review at Sigcomm 2011) • Scalable Query Interface • MAPLE – (under review at Sigcomm 2011) Per-flow latency measurements at every hop Per-Packet Latency Measurements Stanford

1) Per-Flow Measurements WITH REFERENCE LATENCY INTERPOLATION[sigcomm 2010] Stanford

Obtaining Fine-Grained Measurements • Native router support: SNMP, NetFlow • No latency measurements • Active probes and tomography • Too many probes (~10000HZ) required wasting bandwidth • Use expensive high-fidelity measurement boxes • London Stock Exchange uses Corvil boxes • Cannot place them ubiquitously • Recent work: LDA [Kompella09Sigcomm] • Computes average latency/variance accurately within a switch • Provides a good start but may not be sufficient to diagnose flow-specific problems Stanford

From Aggregates to Per-Flow … Delay S/W Average latency Time Interval Queue Small delay Large delay • Observation: Significant amount of difference in average latencies across flows at a router • Goal of this paper: How to obtain per-flow latency measurements in a scalable fashion ? Stanford

Measurement Model Ingress I Egress E Router • Assumption: Time synchronization between router interfaces • Constraint: Cannot modify regular packets to carry timestamps • Intrusive changes to the routing forwarding path Stanford

Naïve Approach Egress E Ingress I + − − = 22 20 27 10 15 = + − − 32 30 23 13 18 Avg. delay = 22/2 = 11 Avg. delay = 32/2 = 16 • For each flow key, • Store timestamps for each packet at I and E • After a flow stops sending, I sends the packet timestamps to E • E computes individual packet delays • E aggregates average latency, variance, etc for each flow • Problem: High communication costs • At 10Gbps, few million packets per second • Sampling reduces communication, but also reduces accuracy Stanford

A (Naïve) Extension of LDA Egress E Ingress I LDA LDA … LDA LDA Coordination LDA LDA 2 28 Per-flow latency 1 15 Sum of timestamps Packet count • Maintaining LDAs with many counters for flows of interest • Problem: (Potentially) high communication costs • Proportional to the number of flows Stanford

Key Observation: Delay Locality Delay D3 WD1 WD2 Time D2 D1 True mean delay = (D1 + D2 + D3) / 3 Localized mean delay = (WD1 + WD2 + WD3) / 3 How close is localized mean delay to true mean delay as window size varies? Stanford WD3

Key Observation: Delay Locality Local mean delay per key / ms 1s: RMSRE=1.72 10ms: RMSRE=0.16 0.1ms: RMSRE=0.054 True Mean delay per key / ms Data sets from real router and synthetic queueing models Global Mean Stanford

Exploiting Delay Locality Delay Ingress Timestamp Reference Packet Time • Reference packets are injected regularly at the ingress I • Special packets carrying ingress timestamp • Provides some reference delay values (substitute for window averages) • Used to approximate the latencies of regular packets Stanford

RLI Architecture R 1) Reference Packet Generator 2) Latency Estimator Ingress I Egress E 3 1 2 3 1 2 Ingress Timestamp L • Component 1: Reference Packet generator • Injects reference packets regularly • Component 2: Latency Estimator • Estimates packet latencies and updates per-flow statistics • Estimates directly at the egress with no extra state maintained at ingress side (reduces storage and communication overheads) Stanford

Component 1: Reference Packet Generator • Question: When to inject a reference packet ? • Idea 1: 1-in-n: Inject one reference packet every npackets • Problem: low accuracy under low utilization • Idea 2: 1-in-τ: Inject one reference packet every τseconds • Problem: bad in case where short-term delay variance is high • Our approach: Dynamic injection based on utilization • High utilization  low injection rate • Low utilization  high injection rate • Adaptive scheme works better than fixed rate schemes • Details in the paper Stanford

Component 2: Latency Estimator Linear interpolation line Interpolated delay Estimated delay L Delay Error in delay estimate Error in delay estimate Reference Packet Arrival time and delay are known Arrival time is known R R Regular Packet Time • Question 1:How to estimate latencies using reference packets ? • Solution: Different estimators possible • Use only the delay of a left reference packet (RLI-L) • Use linear interpolation of left and right reference packets (RLI) • Other non-linear estimators possible (e.g., shrinkage) Stanford

Component 2: Latency Estimator Interpolation buffer Flow Key L R Delay 4 1 5 Update Square of delay Estimate Right Reference Packet arrived 16 1 25 Update Any flow selection strategy When a flow is exported Selection Avg. latency = C2 / C1 • Question 2: How to compute per-flow latency statistics • Solution: Maintain 3 counters per flow at the egress side • C1: Number of packets • C2: Sum of packet delays • C3: Sum of squares of packet delays (for estimating variance) • To minimize state, can use any flow selection strategy to maintain counters for only a subset of flows Stanford

Experimental Setup • Data sets • No public data center traces with timestamps • Real router traces with synthetic workloads: WISC • Real backbone traces with synthetic queueing: CHIC and SANJ • Simulation tool: Open source NetFlow software – YAF • Supports reference packet injection mechanism • Simulates a queueing model with RED active queue management policy • Experiments with different link utilizations Stanford

Accuracy under High Link Utilization CDF Median relative error is 10-12% Relative error Stanford

Comparison with Other Solutions Packet sampling rate = 0.1% Average relative error 1-2 orders of magnitude difference Utilization Stanford

Overhead of RLI • Bandwidth overhead is low • less than 0.2% of link capacity • Impact to packet loss is small • Packet loss difference with and without RLI is at most 0.001% at around 80% utilization Stanford

Summary A scalable architecture to obtain high-fidelity per-flow latency measurements between router interfaces Achieves a median relative error of 10-12% Obtains 1-2 orders of magnitude lower relative error compared to existing solutions Measurements are obtained directly at the egress side Stanford

Contributions Thus Far… • Aggregate Latency Estimation • Lossy Difference Aggregator – Sigcomm 2009 • FineComb – Sigmetrics2011 • mPlane – ReArch2009 • Differentiated Latency Estimation • MultiflowEstimator – Infocom2010 • Reference Latency Interpolation – Sigcomm 2010 • RLI across Routers – Hot-ICE 2011 • Virtual LDA – (under review at Sigcomm 2011) • Scalable Query Interface • MAPLE – (under review at Sigcomm 2011) Stanford

2) Scalable PER-PACKET LATENCY MEASUREMENT ARCHITECTURE (Under REVIEW at SIGCOMM 2011) Stanford

MAPLE Motivation • LDA and RLI are ossified in the aggregation level • Not suitable for obtaining arbitrary sub-population statistics • Single packet delay may be important • Key Goal: How to enable a flexible and scalable architecture for packet latencies ? Stanford

MAPLE Architecture • Timestamping not strictly required • Can work with RLI estimated latencies Router B Router A P1 P1 1) Packet Latency Store 2) Query Engine A(P1) Q(P1) Timestamp Unit Central Monitor P1 P1 T1 D1 Stanford

Packet Latency Store (PLS) • Challenge: How to store packet latencies in the most efficient manner ? • Naïve idea:Hashtables does not scale well • At a minimum, require label (32 bits) + timestamp (32 bits) per packet • To avoid collisions, need a large number of hash table entries (~147 bits/pkt for a collision rate of 1%) • Can we do better ? Stanford

Our Approach • Idea 1: Cluster packets • Typically few dominant values • Cluster packets into equivalence classes • Associate one delay value with a cluster • Choose cluster centers such that error is small • Idea 2: Provision storage • Naïvely, we can use one Bloom Filter per cluster (Partitioned Bloom Filter) • We propose a new data structure called Shared Vector Bloom Filter (SVBF) that is more efficient Stanford

Selecting Representative Delays • Approach 1: Logarithmic delay selection • Divide delay range into logarithmic intervals • E.g., 0.1-10,000μs 0.1-1μs, 1-10μs … • Simple to implement, bounded relative error, but accuracy may not be optimal • Approach 2: Dynamic clustering • k-means (medians) clustering formulation • Minimizes the average absolute error of packet latencies (minimizes total Euclidean distance) • Approach 3: Hybrid clustering • Split centers equally across static and dynamic • Best of both worlds Stanford

K-means • Goal: Determine k-centers every measurement cycle • Can be formulated as a k-means clustering algorithm • Problem 1: Running k-means typically hard • Basic algorithm has O(nk+1 log n) run time • Heuristics (Lloyd’s algorithm) also complicated in practice • Solution: Sampling and streaming algorithms • Use sampling to reduce n to pn • Use a streaming k-medians algorithm (approximate but sufficient) • Problem 2: Can’t find centers and record membership at the same time • Solution: Pipelined implementation • Use previous interval’s centers as an approximation for this interval Stanford

Streaming k-Medians [CharikarSTOC03] np packets at i-thepoch O(k log(np)centers at (i+1)th epoch SOFTWARE Packet Sampling Online Clustering Stage Offline Clustering Stage Packet Stream k-centers HARDWARE Storage Data Structure Packets in (i+2)th epoch Flushed after every epoch for archival DRAM/SSD Data Stanford

Naïve: Partitioned BF (PBF) INSERTION c1 c2 1 1 Packet Latency … … 1 1 1 1 0 0 1 1 1 1 1 1 Bits are set by hashing packet contents c3 Parallel matching ofclosest center c4 … … 0 0 1 1 1 1 … … LOOKUP c1 0 0 0 0 1 1 1 0 0 0 1 1 1 1 0 0 1 0 1 1 1 1 … … c2 1 1 0 0 0 0 1 1 1 1 Packet Contents All bits are 1 c3 Query all Bloom filters c4 Stanford

Problems with PBF • Provisioning is hard • Cluster sizes not known apriori • Over-estimation or under estimation of BF sizes • Lookup complexity is higher • Need the data structure to be partitioned every cycle • Need to lookup multiple random locations in the bitmap (based on number of hash functions) Stanford

Shared-Vector Bloom Filter Bit position is located by hashing Packet Contents INSERTION c1 H1 H2 c2 Packet Latency 1 1 c3 Parallel matching ofclosest center Bit is set to 1 after offset by the number of matched center c4 … … LOOKUP # of centers 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 1 1 1 1 1 Packet Contents H1 H2 Bulk read AND c2 0 0 1 1 0 0 0 0 Offset is center id 0 1 1 1 Stanford

Comparing PBF and SVBF • PBF − Lookup is not easily parallelizable − Provisioning is hard since number of packets per BF is not known apriori • SVBF + One Bloom filter is used +Burst read at the length of word • COMB [Hao10Infocom] + Single BF with groups of hash functions − More memory usage than SVBF and burst read not possible Stanford

Comparing Storage Needs For same classification failure rate of 1% and 50 centers (k=50) Stanford

Tie-Breaking Heuristic • Bloom filters have false positives • Lookups involve search across all BFs • So, multiple BFs may return match • Tie-breaking heuristic returns the group that has the highest cardinality • Store a counter per center to store number of packets that match the center (cluster cardinality) • Works well in practice (especially when skewed distributions) Stanford

Estimation Accuracy CDF Absolute error (μs) Stanford

Accuracy of Aggregates CDF Relative error Stanford

MAPLE Architecture Router B Router A 2) Query Engine A(P1) Q(P1) Central Monitor Stanford

Query Interface • Assumption: Path of a packet is known • Possible to determine using forwarding tables • In OpenFlow-enabled networks, controller has the information • Query answer: • Latency estimate • Type: (1) Match, (2) Multi-Match, (3) No-Match Stanford

Query Bandwidth Continuous IPID block Query message: f1 1 5 f1 20 35 • Query method 1: Query using packet hash • Hashed using invariant fields in a packet header • High query bandwidth for aggregate latency statistics (e.g., flow-level latencies) • Query method 2: Query using flow key and IP identifier • Support range search to reduce query bandwidth overhead • Inserts: use flow key and IPID for hashing • Query: use a flow key and ranges of continuous IPIDs are sent Stanford

Query Bandwidth Compression Median compression per flow reduces bw by 90% CDF Compression ratio Stanford

Storage • OC192 interface • 5 Million packets • 60Mbits per second • Assuming 10% utilization, 6 Mbits per second • DRAM – 16 GB • 40 minutes of packets • SSD – 256 GB • 10 hours – enough time for diagnosis Stanford

Summary • RLI and LDA are ossified in their aggregation level • Proposed MAPLE as a mechanism to compute measurements across arbitrary sub-populations • Relies on clustering dominant delay values • Novel SVBF data structure to reduce storage and lookup complexity Stanford

Conclusion • Many applications demand low latencies • Network operators need high-fidelity tools for latency measurements • Proposed RLI for fine-grained per-flow measurements • Proposed MAPLE to: • Store per-packet latencies in a scalable way • Compose latency aggregates across arbitrary sub-populations • Many other solutions (papers on my web page) Stanford

Sponsors • CNS – 1054788: NSF CAREER: Towards a Knowledge Plane for Data Center Networks • CNS – 0831647: NSF NECO: Architectural Support for Fault Management • Cisco Systems: Designing Router Primitives for Monitoring Network Health Stanford

High -Fidelity Latency Measurements in Low -Latency Networks

High -Fidelity Latency Measurements in Low -Latency Networks

Presentation Transcript

Low-Latency Networks for Financial Applications

Bounded-Latency Alerts in Vehicular Networks

Data Latency

Low Latency Networking

Anonymous communications: High latency systems

Low Latency Broadcast in Multi-Rate Wireless Mesh Networks

Anonymous communications: High latency systems

“ PC  PC Latency measurements ”

Sparrow Distributed , Low Latency Scheduling

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Low latency via redundancy

Low-Cost, High-Latency, Unlimited-Bandwidth Communication

Latency in Mobile

Low-Latency Pipelined Crossbar Arbitration

Latency

Low Latency Invocation in Condor Project Overview

LOW-LATENCY VIDEO STREAMING OVER PEER-TO-PEER NETWORKS

SPREAD NETWORKS Chicago – New Jersey low latency Route

Delivering Capacity, Low Latency and Low Jitter

Low-Cost, High-Latency, Unlimited-Bandwidth Communication

Low Latency Server