280 likes | 473 Views
Framework For Supporting Multi-Service Edge Packet Processing On Network Processors. Arun Raghunath, Aaron Kunze, Erik J. Johnson Intel Research and Development Vinod Balakrishnan Openwave Systems Inc. ANCS 2005. Overview. Problem.
E N D
Framework For Supporting Multi-Service Edge Packet Processing On Network Processors Arun Raghunath, Aaron Kunze, Erik J. Johnson Intel Research and Development Vinod Balakrishnan Openwave Systems Inc. ANCS 2005
Overview Problem • Edge routers need to support sophisticated set of services • How to best use the numerous hardware resources that Network processors provide • Cores, multiple memory levels, inter core queuing, crypto assists • Workloads fluctuate over time
200000 Overview http_data avg 150000 100000 50000 0 0 100 200 300 400 500 600 700 800 900 1000 Source: “A Case for Run-time Adaptation in Packet Processing Systems”, R. Kokku, et. al, Hotnets II, vol. 34, Issue 1, January, 2004 http://ita.ee.lbl.gov/html/contrib/UCB.home-IP-HTTP.html Location: Network edge in front of a group of Internet clients Duration: 5 days ProblemWorkload variations There is no representative workload !
Overview Problem • Edge routers need to support large sets of sophisticated services • How to best use the numerous hardware resources that Network processors provide • Cores, multiple memory levels, inter core queuing, crypto assists • Workloads fluctuate over time • There is no representative workload • Usually over provision to handle worst case Run time adaptation Ability to change mapping of services to hardware resources
Overview IPv6 Compression and Forwarding MEv2 1 MEv2 2 MEv2 3 MEv2 4 Change allocation to increase individual service performance IPv6 Compression and Forwarding IPv6 Compression and Forwarding IPv6 Compression and Forwarding MEv2 1 MEv2 1 MEv2 1 MEv2 2 MEv2 2 MEv2 2 MEv2 3 MEv2 3 MEv2 3 MEv2 4 MEv2 4 MEv2 4 MEv2 8 MEv2 7 MEv2 6 MEv2 5 IntelXScale®core IPv4 Compression and Forwarding IPv4 Compression and Forwarding IPv4 Compression and Forwarding MEv2 8 MEv2 8 MEv2 8 MEv2 7 MEv2 7 MEv2 7 MEv2 6 MEv2 6 MEv2 6 MEv2 5 MEv2 5 MEv2 5 Ex. 1 IntelXScale®core IntelXScale®core IntelXScale®core MEv2 9 MEv2 10 MEv2 11 MEv2 12 MEv2 9 MEv2 9 MEv2 9 MEv2 10 MEv2 10 MEv2 10 MEv2 11 MEv2 11 MEv2 11 MEv2 12 MEv2 12 MEv2 12 IPv4 Compression and Forwarding MEv2 16 MEv2 15 MEv2 14 MEv2 13 MEv2 16 MEv2 16 MEv2 16 MEv2 15 MEv2 15 MEv2 15 MEv2 14 MEv2 14 MEv2 14 MEv2 13 MEv2 13 MEv2 13 IPv6 Compression and Forwarding MEv2 1 MEv2 2 MEv2 3 MEv2 4 MEv2 8 MEv2 7 MEv2 6 MEv2 5 IntelXScale®core IPv4 Compression and Forwarding MEv2 9 MEv2 10 MEv2 11 MEv2 12 MEv2 16 MEv2 15 MEv2 14 MEv2 13 VPN Encrypt/Decrypt MEv2 1 MEv2 2 MEv2 3 MEv2 4 Support a large set of services in the “fast path”, according to use Power-down unneeded processors IPv6 Compression and Forwarding VPN Encrypt/Decrypt Ex. 2 IPv4 Compression and Forwarding MEv2 8 MEv2 7 MEv2 6 MEv2 5 IntelXScal®core Ex. 3 MEv2 9 MEv2 10 MEv2 11 MEv2 12 MEv2 16 MEv2 15 MEv2 14 MEv2 13 Adaptation Opportunities
Overview C C A B B C C A A B B A 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 B MEv2 1 MEv2 2 MEv2 3 MEv2 4 MEv2 8 MEv2 7 MEv2 6 MEv2 5 C IntelXScale®core MEv2 9 MEv2 10 MEv2 11 MEv2 12 MEv2 16 MEv2 15 MEv2 14 MEv2 13 Theory of Operation Executable binaries XScale Bind resources ME Checkpoint processors A A B, C B System Monitor Queue info C Linker Resource Abstraction Layer (RAL) Traffic Mix Resource Mapping Run-time system
Monitoring Rate based Monitoring • Observe queue between two stages • Arrival/departure rates indicative of processing needs Rarr Rdep Rarr = Current arrival rate Rdep = Current departure rate Rworst = Worst case arrival rate tsw = Time to switch on a core Qsize • Assumption: Rdep scales linearly. • So for a stage running on n cores, Rdep = n * Rdep1
Policy Allocation policy • Number of Cores = R / Rdep1 • If R = Rworst, system directly moves to worst case provisioned state • Only request cores as needed • NumCores (Rarr) = Rarr / Rdep1 Rarr Rdep Qadapt Buffer space to handle worst burst • If Rarr >> Rdep, request allocation of processors, immediately • How many? function of (Rarr / Rdep1) • If Rarr slightly larger, let queue grow till Qadapt, then request allocation of one processor
Policy De-allocation policy • While increasing allocation, latch Rdep1 • if Rarr / Rdep1 < current allocation • Request de-allocation of one core • Hysterisis: Wait for some cycles before requesting de-allocation again • Avoids fluctuations for transient dips in arrival rate
Overview A C C A B B B C C C A A B B A 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 B MEv2 1 MEv2 2 MEv2 3 MEv2 4 MEv2 8 MEv2 7 MEv2 6 MEv2 5 C IntelXScale®core MEv2 9 MEv2 10 MEv2 11 MEv2 12 MEv2 16 MEv2 15 MEv2 14 MEv2 13 Theory of Operation Executable binaries XScale ME Resource Allocator A A Triggers B, C B System Monitor Queue info C Linker Resource Abstraction Layer (RAL) Traffic Mix Resource Mapping Run-time system
Resource Allocation Resource allocator • Handles requests for allocation/de-allocation from individual stages • Aware of global system state and decides • specific processor to allocate/free • to de-allocate or migrate stage when no free processors available • Steal only when arrival rate < arrival rate for requesting stage • whether request is declined
Overview A C C A B B B C C C A A B B 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 MEv2 1 MEv2 2 MEv2 3 MEv2 4 MEv2 8 MEv2 7 MEv2 6 MEv2 5 IntelXScale®core MEv2 9 MEv2 10 MEv2 11 MEv2 12 MEv2 16 MEv2 15 MEv2 14 MEv2 13 Theory of Operation Executable binaries System Evaluation XScale ME Resource Allocator A A Triggers B, C B System Monitor Mapping Queue info C Linker Resource Abstraction Layer (RAL) Traffic Mix Resource Mapping Run-time system
Results Experimentalsetup • Radisys, Inc. ENP-2611* • 600MHz Intel® IXP2400 Processor • MontaVista Linux* • 3 optical Gigabit Ethernet ports • IXIA* traffic generator for packet stimulus * Third party brands/names are property of their respective owners
Overhead due to function calls to resource abstraction layer 14% performance degradation for processing min size packets at line rate Overall adaptation time is: Binding time + (checkpointing and loading time * number of cores) Cumulative effects: ~100ms Dominated by cost of binding mechanism Results Adaptation Costs
Results Adaptation benefitsTesting Methodology • Need to measure ability of system to handle long term workload variations • Systems compared • Static system (Profile driven compilation) • Adaptive system
Results L3 forwarder L3 forwarder L3 forwarder L3 forwarder L3 forwarder L3 forwarder L3 forwarder L2 classifier Ethernet encapsulation L2 bridge L2 bridge Rx Tx L2 bridge L2 bridge L2 bridge L2 bridge L2 bridge 10101010101010101010101010101010 10101010101010101010101010101010 10101010101010101010101010101010 MEv2 1 MEv2 2 MEv2 3 MEv2 4 MEv2 8 MEv2 7 MEv2 6 MEv2 5 IntelXScale®core MEv2 9 MEv2 10 MEv2 11 MEv2 12 MEv2 16 MEv2 15 MEv2 14 MEv2 13 Adaptation benefitsTesting Methodology Layer 3 switching application System Traffic Performance Static binary Profile Compiler
Results 0%, 100% 20%, 80% 40%, 60% 50%, 50% 60%, 40% 80%, 20% 100%, 0% Benefits of run time adaptation Source: Intel Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
Conclusion Future work • Study ability of an adaptive system to handle short term fluctuations • Would it drop more packets than a non-adaptive system • Enable flow-aware run time adaptation • Explore more sophisticated resource allocation algorithms • support properties like fairness and performance guarantees
Conclusion Related work • Ease of programming • NP-Click: N Shah etc, NP-2 workshop 2003 • Nova: L George, M Blume, ACM SIGPLAN 2003 • Auto-Partitioning programming model: Intel, whitepaper 2003 • Dynamic extensibility • Router plugins: D Decasper etc, SIGCOMM 1998 • PromethOS: R Keller etc, IWAN 2002 • VERA: S Karlin, L Peterson, Computer Networks 2002 • NetBind: M Kounavis, Software Practice and experience, 2004 • Load balancing • ShaRE: R Kokku, Ph.D Thesis UT Austin, 2005
Conclusion Conclusion • Run time adaptation is an attractive approach for handling traffic fluctuations • Implemented a framework capable of adapting processing cores allocated to network services • Implemented a policy that • Automatically balances service pipeline • Overcomes the code store limitation of fixed control store processor cores
Mechanisms CheckpointingLeveraging domain characteristics • Finding the best checkpoint is easier in packet processing than in general domains • Characteristics of data-flow applications • Typically implemented as a dispatch loop • Dispatch loop is executed at high-frequency • Top of the dispatch loop has no stack information • Since compiler creates dispatch loop, compiler inserts checkpoints in the code
Mechanisms MEv2 1 MEv2 1 MEv2 2 MEv2 2 MEv2 3 MEv2 3 MEv2 4 MEv2 4 MEv2 8 MEv2 8 MEv2 7 MEv2 7 MEv2 6 MEv2 6 MEv2 5 MEv2 5 MEv2 9 MEv2 9 MEv2 10 MEv2 10 MEv2 11 MEv2 11 MEv2 12 MEv2 12 MEv2 16 MEv2 16 MEv2 15 MEv2 15 MEv2 14 MEv2 14 MEv2 13 MEv2 13 Why Have Binding? Now we can use NN rings, local locks A B A B A A B B IntelXScale™Core IntelXScale™Core Want to be able to use the fastest implementations of resources available
Mechanisms (4/6) Binding • Goal: Use the fastest implementations of resources available • Resource abstraction • Programmer’s write to abstract resources (Packet channels, uniform memory, locks etc) • Must have little impact on run-time performance • Our approach: Adaptation time linking
Mechanisms (6/6) Application Code Resource binding approachAdaptation-time linking A microengine-based example RAL calls are initially undefined Application .o file Final .o file RAL .o file RAL Implementation 0 RAL Implementation 1 RAL Implementation 2 RAL Implementation 3 At run time, the RTS has the application .o file and the RAL .o file At run time, the RTS has the application .o file RAL Implementation 4 RAL Implementation 5 RAL Implementation 6 Linker adjusts jump targets using import variable mechanism Linker adjusts jump targets using import variable mechanism Linker adjusts jump targets using import variable mechanism Process repeated after each adaptation
Binding: The Value of Choosing the Right Resource Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance.
Problem domain • Compression • Monitoring (billing, QoS) • Forwarding • Switching MAN/WAN • VPN Gateway • Firewall • Intrusion Detection Access Network Enterprise LAN • XML & SSL acceleration • L4-L7 switching • Application acceleration
Policy Determining Qadapt and monitoring interval • Want to maximize Qadapt • Qadapt function of queue monitoring interval Rarr Rdep Qadapt Qadapt Buffer space to handle worst burst with n+1 cores Buffer space to handle worst burst with n cores Queue fill up while core comes online Theoretical max Qadapt when queue depth can be detected instantaneously