600 likes | 711 Views
Multi-Level Architecture for Data Plane Virtualization. Eric Keller Oral General Exam 5/5/08. The Internet (and IP). Usage of Internet continuously evolving The way packets forwarded hasn’t (IP) Meant for communication between machines Address tied to fixed location
E N D
Multi-Level Architecture for Data Plane Virtualization Eric Keller Oral General Exam 5/5/08
The Internet (and IP) • Usage of Internet continuously evolving • The way packets forwarded hasn’t (IP) • Meant for communication between machines • Address tied to fixed location • Hierarchical addressing • Best-effort delivery • Addresses easy to spoof • Great innovation at the edge (Skype/VoIP, BitTorrent) • Programmability of hosts at application layer • Can’t add any functionality into network
Proposed Modifications • Many proposals to modify some aspect of IP • No single one is best • Difficult to deploy • Publish/Subscribe mechanism for objects • Instead of routing on machine address, route on object ID • e.g. DONA (Data oriented network architecture), scalable simulation • Route through intermediary points • Instead of communication between machines • e.g. i3 (internet indirection infrastructure), DOA (delegation oriented architecture) • Flat Addressing to separate location from ID • Instead of hierarchical based on location • e.g. ROFL (routing on flat labels), SEIZE (scalable and efficient, zero-configuration enterprise)
Challenges • Want to Innovate in the Network • Can’t because networks are closed • Need to lower barrier for who innovates • Allow individuals to create a network and define its functionality • Virtualization as a possible solution • For both network of future and overlay networks • Programmable and sharable • Examples: PlanetLab, VINI
Network Virtualization • Running multiple virtual networks at the same time over a shared physical infrastructure • Each virtual network composed of virtual routers having custom functionality Physical machine Virtual router Virtual network – e.g. blue virtual routers plus Blue links
Virtual Network Tradeoffs • Goal: Enable custom data planes per virtual network • Challenge: How to create the shared network nodes Programmability Isolation Performance
Virtual Network Tradeoffs • Goal: Enable custom data planes per virtual network • Challenge: How to create the shared network nodes Programmability How easy is it to add new functionality? What is the range of new functionality that can be added? Does it extend beyond “software routers”? Isolation Performance
Virtual Network Tradeoffs • Goal: Enable custom data planes per virtual network • Challenge: How to create the shared network nodes Programmability Does resource usage by one virtual networks have an effect on others? Faults? How secure is it given a shared substrate? Isolation Performance
Virtual Network Tradeoffs • Goal: Enable custom data planes per virtual network • Challenge: How to create the shared network nodes Programmability How much overhead is there for sharing? What is the forwarding rate? Throughput? Latency? Isolation Performance
Virtual Network Tradeoffs • Network Containers • Duplicate stack or data structures • e.g. Trellis, OpenVZ, Logical Router • Extensible Routers • Assemble custom routers from common functions • e.g. Click, Router Plug Ins, Scout • Virtual Machines+Click • Run operating system on top of another operating system • e.g. Xen, PL-VINI (Linux-VServer) Programmability Programability, Isolation, Performance Isolation Performance Programmability, Isolation, Performance Programmability, Isolation, Performance
Outline • Architecture • Implementation • Virtualizing Kernel • Challenges with kernel execution • Extending beyond commodity hardware • Evaluation • Conclusion/Future Work
Outline • Architecture • Implementation • Virtualizing Kernel • Challenges with kernel execution • Extending beyond commodity hardware • Evaluation • Conclusion/Future Work
User Experience (Creating a virtual network) • Custom functionality • Custom user environment on each node (for controlling virtual router) • Specify single node’s packet handling as graph of common functions • Isolated from others sharing same node • Allocated share of resources (e.g. CPU, memory, bandwidth) • Protected from faults in others (e.g. another virtual router crashing) • Highest performance possible For example… User Control Environment Determine Shortest Path Config/Query interface Populate routing tables A1 A2 A3 From devices To devices Check Header, Destination Lookup A4 A5
Lightweight Virtualization • Combine graphs into single graph • Provides lightweight virtualization • Add extra packet processing (e.g. mux/demux) • Needed to direct packets to the correct graph • Add resource accounting Graph 1 Master graph combine Graph 2 Master Graph Graph 1 Output port Input port Graph 2
Increasing Performance and Isolation • Partition into multiple graphs across multiple targets • Each target with different capabilities • Performance, Programmability, Isolation • Add connectivity between targets • Unified run-time interface (it appears as a single graph) • To query and configure the forwarding capabilities Target0 graph Graph 1 Target1 graph Master graph partition combine Graph 2 Target2 graph
Examples of Multi-Level • Fast Path/Slow Path • IPv4: forwarding in fast path, exceptions in slow path • i3: Chord ring lookup function in fast path, handling requests in slow path • Preprocessing • IPSec – do encryption/decryption in HW, rest in SW • Offloading • TCP Offload • TCP Splicing • Pipeline of coarse grain services • e.g. transcoding, firewall • SoftRouter from Bell Labs
Outline • Architecture • Implementation • Virtualizing Kernel • Challenges with kernel execution • Extending beyond commodity hardware • Evaluation • Conclusion/Future Work
Implementation • Each network has custom functionality • Specified as graph of common functions • Click modular router • Each network allocated share of resources • e.g. CPU • Linux-VServer – single resource accounting for both control and packet processing • Each network protected from faults in others • Library of elements considered safe • Container for unsafe elements • Highest performance possible • FPGA for modules with HW option, Kernel for modules without
FromDevice(eth0) Counter Discard Click Background: Overview • Software architecture for building flexible and configurable routers • Widely used – commercially and in research • Easy to use, flexible, high performance (missing sharable) • Routers assembled from packet processing modules (Elements) • Simple and Complex • Processing is directed graph • Includes a scheduler • Schedules tasks (a series of elements)
Linux-VServer + Click + NetFPGA click click click Coordinating Process Installer Installer Installer Click Click on NetFPGA
Outline • Architecture • Implementation • Virtualizing Click in the Kernel • Challenges with kernel execution • Extending beyond software routers • Evaluation • Conclusion/Future Work
Virtual Kernel Mode Click • Want to run in Kernel mode • Close to 10x higher performance than user mode • Use library of ‘safe’ elements • Since Kernel is shared execution space • Need resource accounting • Click scheduler does not do resource accounting • Want resource accounting system-wide (i.e. not just inside of packet processing)
Resource Accounting with VServer • Purpose of Resource Accounting • Provides isolation between virtual networks • Unified resource accounting • For packet processing and control • VServer’s Token Bucket Extension to Linux Scheduler • Controls eligibility of processes/threads to run • Integrating with Click • Each individual Click configuration assigned to its own thread • Each thread associated with VServer context • Basic mechanism is to manipulate the task_struct
Outline • Architecture • Implementation • Virtualizing Kernel • Challenges with kernel execution • Extending beyond software routers • Evaluation • Conclusion/Future Work
Unyielding Threads • Linux kernel threads are cooperative (i.e. must yield) • Token scheduler controls when eligible to start • Single long task can have short term disruptions • Affecting delay and jitter on other virtual networks • Token bucket does not go negative • Long term, a virtual network can get more than its share Tokens added (rate A) Size of Bucket (S) Min tokens to exec (M) Tokens consumed (1 per scheduler tick)
Unyielding Threads (solution) • Determine maximum allowable execution time • e.g. from token bucket parameters, network guarantees • Determine pipeline’s execution time • Elements from library have known execution times • Custom elements execution times are unknown • Break pipeline up (for known) • Execute inside of container (for unknown) elem1 elem2 elem3 elem1 elem2 elem3 elem1 elem2 To User From Kern elem3
Custom Elements Written in C++ • Elements have access to global state • Kernel state/functions • Click global state • Could… • Pre-compile in user mode • Pre-compile with restricted header files • Not perfect: • With C++, you can manipulate pointers • Instead, custom elements are unknown (“unsafe”) • Execute in container in user space
Outline • Architecture • Implementation • Virtualizing Kernel • Challenges with kernel execution • Extending beyond commodity hardware • Evaluation • Conclusion/Future Work
PE1 PE2 PEm ProcessingEngines . . . Switch Fabric LC1 LC2 LCn LineCards . . . Extending beyond commodity HW • PC + Programmable NIC (e.g. NetFPGA) • FPGA on PCI card • 4 GigE ports • On board SRAM and DRAM • Jon Turner’s “Pool of Processing Elements” – with crossbar • PEs can be GPP, NPU, FPGA • Switch Fabric = Crossbar Partition between FPGA and Software Generalize: Partition among PEs
FPGA Click • Two previous approach • Cliff – Click graph to Verilog, standard interface on modules • CUSP – Optimize Click graph by parallelizing internal statements. • Our approach: • Build on Cliff by integrating FPGAs into Click (the tool) • Software Analogies • Connection to outside environment • Packet Transfer • Element specification and implementation • Run-time querying and configuration • Memory • Notifiers • Annotations FromDevice (eth0) Element (LEN 5) ToDevice (eth0)
Outline • Architecture • Implementation • Virtualizing Kernel • Challenges with kernel execution • Extending beyond commodity hardware • Evaluation • Conclusion/Future Work
Experimental Evaluation • Is multi-level the right approach? • i.e. is it worth effort to support kernel and FPGA • Does programmability imply less performance? • What is the overhead of virtualization? • From container: when you need to go to user space. • From using multiple threads: when running in kernel. • Are the virtual networks isolated in terms of resource usage? • What is the maximum short-term disruption from unyeilding threads? • How long can a task run without leading to long-term unfairness?
Setup n1 Modify header (IP and ETH) To be from n1 to n2. PC3000 on Emulab 3GHz, 2GB RAM n0 n2 rtr *Generates Packets from n0 to n1, tagged with time * Receives packets, diffs the current time and packet time (and stores avg in mem) The router under test (Linux or a Click config) n3
Is multi-Level the right approach? • Performance benefit going from user to kernel, and • Kernel to FPGA • Programmability imply less performance? • Not sacrificing performance by introducing programmability
What is the overhead of virtualization?From container • When you must go to user space, what is the cost of executing in a container? • Overhead of executing in a VServer is minimal
What is the overhead of virtualization? From using multiple threads Put same click graph in each thread Round robin traffic between them Thread (each runs X tasks/yield) 4portRouter (compound element) PollDevice RoundRobin ToDevice 4portRouter (compound element)
How long to run before yielding • # tasks per yield: • Low => high context switching, I/O executes often • High => low context switching, I/O executes infrequently
What is the overhead of virtualization? From using multiple threads • Given sweet spot for each # of virtual networks • Increasing number of virtual networks from 1 to 10 does not hurt aggregate performance significantly • Alternatives to consider • Single threaded with VServer • Single threaded, modify Click to do resource accounting • Integrate polling into threads
What is the maximum short-term disruption from unyeilding threads? • Profile of (some) Elements • Standard N port router example - ~ 5400 cycles (1.8us) • RadixIPLookup (167k entries) - ~1000 cycles • Simple Elements • CheckLength - ~400 cycles • Counter - ~700 cycles • HashSwitch - ~450 cycles • Maximum Disruption is length of longest task • Possible to break up pipelines Infinite Source RoundTrip CycleCount Elem Discard NoFree
How long can a task run without leading to long-term unfairness? Infinite Source Chewy 4portRouter (compound element) Discard Limited to 15% Infinite Source 4portRouter (compound element) Discard Count cycles
How long can a task run without leading to long-term unfairness? • Tasks longer than 1 token can lead to unfairness • Run long executing elements in user-space • performance overhead of user-space is not as big of an issue ~10k extra cycles / task Zoomed In
Outline • Architecture • Implementation • Virtualizing Kernel • Challenges with kernel execution • Extending beyond commodity hardware • Evaluation • Conclusion/Future Work
Conclusion • Goal: Enable custom data planes per virtual network • Tradeoffs • Performance • Isolation • Programmability • Built a multi-level version of Click • FPGA • Kernel • Container
Future Work • Scheduler • Investigate alternatives to improve efficiency • Safety • Process to certify element as safe (can it be automated?) • Applications • Deploy on VINI testbed • Virtual router migration • HW/SW Codesign Problem • Partition decision making • Specification of elements (G language)
Click! Multi Level Questions
Signs of Openness • There are signs that network owners and equipment providers are opening up • Peer-to-peer and network provider collaboration • Allowing intelligent selection of peers • e.g. Pando/Verizon (P4P), BitTorrent/Comcast • Router Vendor API • allowing creation of software to run on routers • e.g. Juniper PSDP, Cisco AXP • Cheap and easy access to compute power • Define functionality and communication between machines • e.g. Amazon EC2, Sun Grid
Example 1: User/Kernel Partition • Execute “unsafe” elements in container • Add communication elements container u1 fk u1 tk User Kernel s1 s2 s3 tu fu Safe (s1, s2, s3) Unsafe (u1) s1 s2 s3 ToUser (tu), FromKernel (fk) ToKernel(tk), FromUser (fu)
PE1 PE2 PEm ProcessingEngines . . . Switch Fabric LC1 LC2 LCn LineCards . . . Example 2: Non-Commodity HW • PC + Programmable NIC (e.g. NetFPGA) • FPGA on PCI card • 4 GigE ports • On board SRAM and DRAM • Jon Turner’s “Pool of Processing Elements” – with crossbar • PEs can be GPP, NPU, FPGA • Switch Fabric = Crossbar Partition between FPGA and Software Generalize: Partition among PEs