230 likes | 532 Views
9/13/2012. Laxmi N. Bhuyan University of California, Riverside. 2. NP Architecture Design - Research Goals. NPs provide performance and programmability for packet processing without O/S overheadFuture design of NPs (with hardware accelerators) should be based on accurate estimation of e
E N D
1. NePSim: A Network Processor Simulator with Power Evaluation FrameworkIEEE Micro, Sept/Oct 2004 – Source code at http://www.cs.ucr.edu/~yluo/nepsim Yan Luo, Jun Yang, Laxmi N. Bhuyan, Li Zhao
Computer Science & Engineering
University of California at Riverside
2. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 2 NP Architecture Design - Research Goals NPs provide performance and programmability for packet processing without O/S overhead
Future design of NPs (with hardware accelerators) should be based on accurate estimation of execution performance gain instead of intuition
Power consumption of NPs is becoming a big concern. Techniques are needed to save power when traffic is low.
Need for an open-source Execution-driven simulator that can be used to explore architectural modifications for improved performance and power consumption
A ST200 edge router can support up to 8 NP boards each of which consumes 95~150W,
The total power of such a 88.4cm x 44cm x 58cm router can reach 2700W when two
chasis are supported in a single rack! – Laurel Networks ST series router data sheet
A ST200 edge router can support up to 8 NP boards each of which consumes 95~150W,
The total power of such a 88.4cm x 44cm x 58cm router can reach 2700W when two
chasis are supported in a single rack! – Laurel Networks ST series router data sheet
3. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 3 NP Simulation Tools Intel IXA SDK
+ accuracy, visualization
- closed-source, low speed, inflexibility, no power model – Can’t incorporate new hardware designs
SimpleScalar
+ open-source, popular, power model (wattch)
- Uniprocessor architecture - disparity with real NP
NePSim
+ open-source, real NP, power model, accuracy
- currently target IXP1200 – 2400 under development
About 60 software downloads and 300 hits at http://www.cs.ucr.edu/~yluo/nepsim/
4. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 4 Objectives of NePSim An open-source simulator for a real NP (Intel® IXP1200, later IXP2400/2800…)
Cycle-level accuracy of performance simulation
Flexibility for users to add new instructions and functional units
Integrated power model to enable power dissipation simulation and optimization
Extensibility for future NP architectures
Faster simulation compared to SDK
5. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 5 NePSim Software Architecture ME core (ME ISA, 5-stage pipeline, GPR, xfer registers)
SRAM and SDRAM (data memory and controller with command queues)
FBI unit (ixbus and CSRs)
Device (network interface with in/out buffer, packet streams)
Dlite(a light-weighted debugger)
Stats (collection of statistics data)
Traffic generator, program parser and loader Each block is a software module in NePSim.Each block is a software module in NePSim.
6. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 6 NePSim Overview SDK compiler compiles microengine C program. NePSim includes a parser which converts compiler generated code into an internal format. NePSim takes programs in the internal format as input.
NePSim currently cannot take assembler generated code or microcode.
Host C compiler is used for build NePSim executable from NePSim source code.SDK compiler compiles microengine C program. NePSim includes a parser which converts compiler generated code into an internal format. NePSim takes programs in the internal format as input.
NePSim currently cannot take assembler generated code or microcode.
Host C compiler is used for build NePSim executable from NePSim source code.
7. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 7 NePSim Internals (I) NePSim internal instruction, command, and event data structure. Number of cycles taken by each instruction depends on the stages of execution in the pipeline. Details were derived from SDK Hardware manual.NePSim internal instruction, command, and event data structure. Number of cycles taken by each instruction depends on the stages of execution in the pipeline. Details were derived from SDK Hardware manual.
8. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 8 NePSim Internals (II) Yellow box is the ME. It has a 5-stage pipeline. Instructions (I) go through this pipeline. P4 may generate memory reference commands (C) to sram/sdram/fbi controllers.
Each controller has queues with different priority levels. SRAM 3 queues and SDRAM 4.
Arbiters service commands according to its arbitration policy.
Memory commands are served and a corresponding “data ready” event are put into the event queue.
The head of event queue generates a wake-up signal to sleeping threads.Yellow box is the ME. It has a 5-stage pipeline. Instructions (I) go through this pipeline. P4 may generate memory reference commands (C) to sram/sdram/fbi controllers.
Each controller has queues with different priority levels. SRAM 3 queues and SDRAM 4.
Arbiters service commands according to its arbitration policy.
Memory commands are served and a corresponding “data ready” event are put into the event queue.
The head of event queue generates a wake-up signal to sleeping threads.
9. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 9 NePSim parameters -d: enable debug message
-I: start simulation in Dlite debugger mode
-proc:speed processor speed in MHz
-vdd default chip power supply
-me0 program for microengine 0
-script script file used for initialization
-strmconf stream config file of packet traces to network devices
-max:cycle maximum number of cycles to execute
-indstrm mark packet stream file as indefintely repeated
-power flag to enable power calculations Some example parameters to run NePSim simulationsSome example parameters to run NePSim simulations
10. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 10 Dlite Debugger Similar with Dlite in SimpleScalar
Run simulation in debug mode
Set/delete breakpoints
Step into pipeline execution, check ALU condition code
Examine threads’ PC, status etc.
Examine register contents
Examine memory (sram, sdram) contents
We provide a integrated debugger. Users can run simulation in debug mode. This is very helpful for debugging both benchmark applications and NePSim itself.We provide a integrated debugger. Users can run simulation in debug mode. This is very helpful for debugging both benchmark applications and NePSim itself.
11. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 11 IVERI – a verification tool To verify NePSim against IXP1200 is not a trivial task
Multiple MEs, threads, memory units
Huge amount of events (of pipeline and memory) generated in simulation process
Have to pin-point error by scanning huge log traces
IVERI tool
Assertion checking based on Linear Temporal Logic (LTL) and Logic of Constraint (LOC)
Log architectural events in both NePSim and IXP1200(with SDK)
<cycle, PC, alu_out, address, event_type>
Event_type can be pipeline, sram_enq, sdram_deq etc.
Use LOC Assertion to specify performance requirement
E.g. “the execution time of an instruction in pipeline of NePSim is no more than D cycles away from the execution time of the same instruction in IXP1200”
PC(pipline[I])== PC(pipeline_IXP(I)) ^
|cycle(pipeline(I)) - cycle(pipeline_IXP(I))| <= D
IVERI generate verification code based on assertions
Verification code scans log trace, reports number and location of constraint violations.
12. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 12 Performance Validation of NePSim Throughput We We
13. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 13 Power Model GPR – General Purpose Registers, XFER – Transfer registers for memory referenceGPR – General Purpose Registers, XFER – Transfer registers for memory reference
14. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 14 Benchmarks Ipfwdr
IPv4 forwarding(header validation, trie-based lookup)
Medium SRAM access
url
Examining payload for URL pattern, used in content-aware routing
Heavy SDRAM access
Nat
Network address translation
medium SRAM access
Md4
Message digest (compute a 128-bit message “signature”)
Heavy computation and SDRAM access
15. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 15 Performance implications More MEs do not necessarily bring performance gain
More ME cause more memory contention
ME idle time is abundant (up to 42%)
Faster ME core results in more ME idle time with the same memory
Non-optimal rcv/xmit configuration for NAT (transmitting ME is a bottleneck) Speed ratio is the ratio between processor speed and memory speed. Memory speed is fixed at 116 MHz, so the bars represent processor speeds of 232 and 464 MHz.
All Figs for 4 receiving and 2 transmitting MEs.
Speed ratio is the ratio between processor speed and memory speed. Memory speed is fixed at 116 MHz, so the bars represent processor speeds of 232 and 464 MHz.
All Figs for 4 receiving and 2 transmitting MEs.
16. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 16 Where does the Power Go? Power dissipation by rcv and xmit MEs is similar across benchmarks
Transmitting MEs consume ~5% more than receiving
ALU consumes significant power ~45% (wattch model)
Control store uses ~28% (accessed almost every cycle)
GPRs burn ~13% , shifter ~7%, static ~7% The ALU power is high. It might be because we derive it from Wattch model. Wattch’s ALU is more complex than ME. And Wattch derives the number from a old reference.
Figures show power consumed by different MEs. The top blue part is the power consumed by SRAM, SDRAM controllers, command bus, etc. The IX bus is not modelled.
The ALU power is high. It might be because we derive it from Wattch model. Wattch’s ALU is more complex than ME. And Wattch derives the number from a old reference.
Figures show power consumed by different MEs. The top blue part is the power consumed by SRAM, SDRAM controllers, command bus, etc. The IX bus is not modelled.
17. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 17 Power efficiency observations Power consumption increases faster than performance
More MEs/threads bring more idle time due to memory contention Motivation for DVSMotivation for DVS
18. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 18 Dynamic Voltage Scaling in NPs During the ME idle time, all threads are in ``wait'' state and the pipeline has no activity.
Applying DVS while MEs are not very active can reduce the total power consumption substantially.
DVS control scheme
Observes the ME idle time (%) periodically.
When idle > threshold, scale down the voltage and frequency (VF in short) by one step unless the minimum allowable VF is hit.
Idle < threshold, scale up the VF by one step unless they are at maximum allowable values.
19. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 19 DVS Considerations transition step meaning whether or not to use continuous or discrete changes in VF
Use discrete VF steps
transition status indicating if we allow the ME to continue working during VF regulation
Pause ME while regulating VF
transition time between two different VF states
10 us, [Burd][Shang][Sidiropoulos]
transition logic complexity, i.e. the overhead of the control circuit that monitors and determines a transition
20. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 20 DVS Power-performance Initial VF=1.3V, 600MHz
DVS period: every 15K, 20K or 30K cycles make a DVS decision to reduce or increase FV.
Up to 17% power savings with less than 6% performance loss
On average 8% power saving with <1% performance degradation Threshold = ?Threshold = ?
21. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 21 Ongoing and future work Extend NePSim to IXP2400/2800
Dynamically shutdown/activate MEs
Dynamically allocate task on MEs
Model SRAM and SDRAM module power
Integrate StrongARM/Xscale simulation
… Dynamically allocate task means a ME can be configured or instructed to run rcv/xmit/processing task based on current system status. For example, if transmitting is bottleneck now, a receiving ME can be turned into a transmitting ME at run-time.Dynamically allocate task means a ME can be configured or instructed to run rcv/xmit/processing task based on current system status. For example, if transmitting is bottleneck now, a receiving ME can be turned into a transmitting ME at run-time.
22. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 22 NP-Based Projects at UCR Intel IXA Program: NP Architecture Lab – Architecture research - CS 162 Assignments Based on IXP 2400
NSF: Design and Analysis of a Web Switch (Layer 5/7 switch) Using NP – TCP Splicing – Load Balancing etc.
Intel and UC Micro: Architectures to Accelerate Data Center Servers – TCP and SSL Offload to Dedicated Servers and NPs, XML Servers, etc.
Los Alamos National Lab: Intelligent NP-Based NIC Design for Clusters – O/S Bypass Protocols, User Level Communication, etc.
23. 9/13/2012 Laxmi N. Bhuyan University of California, Riverside 23 References [Burd] T. Burd and R. Brodersen, Design issues for dynamic voltage scaling,International Symposium on Low Power Electronics and Design, pp. 9--14, 2000.
[Shang] L. Shang, L.-S. Peh, and N. K. Jha, Dynamic voltage scaling with links for power optimization of interconnection networks,The 9th International Symposium on High-Performance Computer Architecture,pp. 91--102, 2003.
[Sidiropoulos] S. Sidiropoulos, D. Liu, J. Kim, G. Wei, and M. Horwitz, Adaptive bandwidth DLLs and PLLs using regulated supply CMOS buffers,IEEE Symposium on VLSI Circuits, pp. 124--127, 2000.