450 likes | 718 Views
Software-defined Measurement. Minlan Yu University of Southern California. Joint work with Lavanya Jose, Rui Miao, Masoud Moshref , Ramesh Govindan , Amin Vahdat. Management = Measurement + Control . Accounting Count resource usage for tenants Traffic engineering
E N D
Software-defined Measurement Minlan Yu University of Southern California Joint work with LavanyaJose, Rui Miao, MasoudMoshref, Ramesh Govindan, Amin Vahdat
Management = Measurement + Control • Accounting • Count resource usage for tenants • Traffic engineering • Identify large traffic aggregates, traffic changes • Understand flow characteristics (flow size, etc.) • Performance diagnosis • Why my application has high delay, low throughput?
Yet, measurement is underexplored • Measurement is an afterthought in network device • Control functions are optimized w/ many resources • Limited, fixed measurement support with NetFlow/sFlow • Traffic analysis is incomplete and indirect • Incomplete: May not catch all the events from samples • Indirect: Offline analysis based on pre-collected logs • Network-wide view of traffic is especially difficult • Data are collected at different times/places
Software-defined Measurement Controller Heavy Hitter detection Change detection 1 2 1 Configure resources Fetch statistics (Re)Configure resources • SDN offers unique opportunities for measurement • Simple, reusable primitives at switches • Diverse and dynamic analysis at controller • Network-wide view
Challenges • Diverse measurement tasks • Generic measurement primitives for diverse tasks • Measurement library for easy programming • Limited resources at switches • New data structures to reduce memory usage • Multiplexing across many tasks
Software-defined Measurement • OpenSketch • (NSDI’13) • DREAM • (SIGCOMM’14) • Sketch-based • commodity switch components • Flow-based OpenFlow TCAM • Data plane • Primitives • Optimization w/ Provable resource-accuracy bounds • Dynamic Allocation w/ Accuracy estimator Resource alloc across tasks • OpenSource • NetFPGA + Sketch library • networks of hardware switches and Open vSwitch Prototype
Software Defined Networking Controller Configure devices and collect measurements API to the data plane (OpenFlow) Fields action counters Src=1.2.3.4drop, #packets, #bytes Rethink the abstractions for measurement Switches Forward/measure packets
Tradeoff of Generality and Efficiency • Generality • Supporting a wide variety of measurement tasks • Who’s sending a lot to 23.43.0.0/16? • Is someone being DDoS-ed? • How many people downloaded files from 10.0.2.1? • Efficiency • Enabling high link speed (40 Gbps or larger) • Ensuring low cost (Cheap switches with small memory) • Easy to implement with commodity switch components
NetFlow: General, Not Efficient • Cisco NetFlow/sFlow • Log sampled packets, or flow-level counters • General • Ok for many measurement tasks • Not ideal for any single task • Not efficient • It’s hard to determine the right sampling rate • Measurement accuracy depends on traffic distribution • Turned off or not even available in datacenters
Streaming Algo: Efficient, Not General Data plane Control plane Query: 23.43.12.1 3 0 5 1 9 Hash1 # bytes from 23.43.12.1 5 3 4 0 1 9 3 0 Hash2 Hash3 1 2 0 3 4 Pick min: 3 • Streaming algorithms • Summarize packet information with Sketches • E.g. Count-Min Sketch, Who’s sending a lot to host A? • Not general:Each algorithm solves just one question • Require customized hardware or network processors • Hard to implement every solution in practice
Where is the Sweet Spot? General Efficient NetFlow/sFlow (too expensive) Streaming Algo (Not practical) • OpenSketch • General, and efficient data plane based on sketches • Modularized control plane with automatic configuration
Flexible Measurement Data Plane • Picking the packets to measure • Hashes to represent a compact set of flows • A set of blacklisting IPs • Classify flows with different resources/accuracy • Filter out traffic for 23.43.0.0/16 • Storing and exporting the data • A table with flexible indexing • Complex indexing using hashes and classification • Diverse mappings between counters and flows
A three-stage pipeline 3 0 5 1 9 Hash1 # bytes from 23.43.12.1 0 1 9 3 0 Hash2 Hash3 1 2 0 3 4 • Hashing: A few hash functions on packet source • Classification: based on hash value or packets • Counting: Update a few counters with simple calc.
Build on Existing Switch Components • A few simple hash functions • 4-8 three-wise or five-wise independent hash functions • Leverage traffic diversity to approx. truly random func. • A few TCAM entries for classification • Match on both packets and hash values • Avoid matching on individual micro-flow entries • Flexible counters in SRAM • Many logical tables for different sketches • Different numbers and sizes of counters • Access counters by addresses
Modularized Measurement Libarary • A measurement library of sketches • Bitmap, Bloom filter, Count-Min Sketch, etc. • Easy to implement with the data plane pipeline • Support diverse measurement tasks • Implement Heavy Hitters with OpenSketch • Who’s sending a lot to 23.43.0.0/16? • count-min sketch to count volume of flows • reversible sketch to identify flows with heavy counts in the count-min sketch
Resource management • Automatic configuration within a task • Pick the right sketches for measurement tasks • Allocating resources across sketches • Based on provable resource-accuracy curves • Resource allocation across tasks • Operators simply specify relative importance of tasks • Minimizing weighted error using convex optimization • Decompose to optimization problem of individual tasks
Evaluation • Prototype on NetFPGA • No effect on data plane throughput • Line speed measurement performance • Trace Driven Simulators • OpenSketch, NetFlow, and streaming algorithm • One-hour CAIDA packet traces on a backbone link • Tradeoff between generality and efficiency • How efficient is OpenSketch compared to NetFlow? • How accurate is OpenSketch compared to specific streaming algorithms?
Heavy Hitters: false positives/negatives • Identify flows taking > 0.5% bandwidth OpenSketchrequires less memory with higher accuracy
Tradeoff Efficiency for Generality In theory, OpenSketch requires 6 times memory than complex streaming algorithm
OpenSketch Conclusion • OpenSketch: • Bridging the gap between theory and practice • Leveraging good properties of sketches • Provable accuracy-memory tradeoff • Making sketches easy to implement and use • Generic support for different measurement tasks • Easy to implement with commodity switch hardware • Modularized library for easy programming
Dynamic Resource AllocationFor TCAM-based MeasurementSIGCOMM’14
SDM Challenges Many Management tasks Controller Heavy Hitter detection Change detection Heavy Hitter detection Heavy Hitter detection H Dynamic Resource Allocator 1 2 1 Configure resources Fetch statistics (Re)Configure resources Limited resources (TCAM)
Dynamic Resource Allocator Recall= detected true HH/all • Diminishing return of resources • More resources make smaller accuracy gain • More resources find less significant outputs • Operators can accept an accuracy bound <100%
Dynamic Resource Allocator Recall= detected true HH/all • Temporal and spatial resource multiplexing • Traffic varies over time and switches • Resource for an accuracy bound depends on Traffic
Challenges • No ground truth of resource-accuracy • Hard to do traditional convex optimization • New ways to estimate accuracy on the fly • Adaptively increase/decrease resources accordingly • Spatial & temporal changes • Task and traffic dynamics • Coordinate multiple switches to keep a task accurate • Spatial and temporal resource adaptation
Dynamic Resource Allocator Controller Heavy Hitter detection Change detection Heavy Hitter detection Heavy Hitter detection H Estimated accuracy Estimated accuracy Allocated resource Allocated resource Dynamic Resource Allocator • Decompose the resource allocator to each switch • Each switch separately increase/decrease resources • When and how to change resources?
Per-switch Resource Allocator: When? Controller Detected HH: 14 out of 30 Global accuracy=47% Heavy Hitter detection Detected HH:5 out of 20 Local accuracy=25% Detected HH:9 out of 10 Local accuracy=90% A B • When a task on a switch needs more resources? • Based on A’s accuracy (25%) is not enough • if bound is 40%, no need to increase A’s resources • Based on the global accuracy (47%) is not enough • if bound is 80%, increasing B’s resources is not helpful • Conclusion: when max(local, global) < accuracy bound
Per-Switch Resource Allocator: How? • How to adapt resources? • Take from rich tasks, give to poor tasks • How much resource to take/give? • Adaptive change step for fast convergence • Small steps close to bound, large steps otherwise
Task Implementation Controller Heavy Hitter detection Change detection Heavy Hitter detection Heavy Hitter detection H Estimated accuracy Estimated accuracy Allocated resource Allocated resource Dynamic Resource Allocator 1 1 2 (Re)Configure resources Fetch statistics Configure resources
Flow-based algorithms using TCAM New 36 Current *** 26 10 0** 1** 12 14 5 5 00* 01* 10* 11* 111 001 011 101 5 7 12 2 0 5 2 3 010 110 000 100 • Goal: Maximize accuracy given limited resources • A general resource-aware algorithm • Different tasks: e.g., HH, HHH, Change detection • Multiple switches: e.g., HHs from different switches • Assume: Each flow is seen at one switch (e.g., at sources)
Divide & Merge at Multiple Switches New: A:00*, B:00*,01*, C:01* Current: A:0**, B:0**, C:0** 26 0** {A,B,C} {A,B} {B,C} 12 14 00* 01* • Divide: Monitor children to increase accuracy • Requires more resources on a setof switches • Example: Needs an additional entry on switch B • Merge: Monitor parent to free resources • Each node keeps the switch set it frees after merge • Finding the least important prefixes to merge is the minimum set cover problem
Accuracy Estimation: Heavy Hitter Detection 76 *** 26 50 0** 1** 12 14 15 35 00* 01* 10* 11* Threshold=10 111 001 011 101 At level 2 missed <=2 HH 5 7 12 2 0 15 20 15 With size 26 missed <=2 HHs 010 110 000 100 • Any monitored leaf with volume > threshold is a true HH • Recall: • Estimate missing HHs using volume and level of counter
DREAM Overview • Task type (Heavy hitter, Hierarchical heavy hitter, Change detection) • Task specific parameters (HH threshold) • Packet header field (source IP) • Filter (srcIP=10/24, dstIP=10.2/16) • Accuracy bound (80%) Prototype Implementation with DREAM algorithms on Floodlight and Open vSwitches 1) Instantiate task 2) Accept/Reject 5) Report 7) Allocate / Drop Task object 1 Task object n Resource Allocator 6) Estimate accuracy DREAM 4) Fetch counters SDN Controller 3) Configure counters
Evaluation • Evaluation Goals • How accurate are tasks in DREAM? • Satisfaction: Task lifetime fraction above given accuracy • How many more accurate tasks can DREAM support? • % of rejected/dropped tasks • How fast is the DREAM control loop? • Compare to • Equal: divide resources equally at each switch, no reject • Fixed: 1/nresources to each task, reject extra tasks
Prototype Results DREAM: High satisfaction for avg & 5th % of tasks with low rejection Mean 5th % Equal: only keeps small tasks satisfied Fixed: High rejection as over-provisions for small tasks 256 tasks (various task types) on 8 switches
Prototype Results DREAM: High satisfaction for avg & 5th % of tasks at the expense of more rejection Equal & Fixed: only keeps small tasks satisfied
Control Loop Delay Allocation delay is negligible vs. other delays Incremental saving lets reduce save delay
DREAM Conclusion • Challenges with software-defined measurement • Diverseand dynamic measurement tasks • Limited resources at switches • Dynamic resource allocation across tasks • Accuracy estimators for TCAM-based algorithms • Spatial and temporal resource multiplexing
Summary • Software-defined measurement • Measurement is important, yet underexplored • SDN brings new opportunities to measurement • Time to rebuild the entire measurement stack • Our work • OpenSketch:Generic, efficient measurement on sketches • DREAM: Dynamic resource allocation for many tasks