160 likes | 313 Views
chea p silicon: myth or reality?. Pickin g the right data plane hardware for software defined networking Gergely Pongrácz , László Molnár, Zoltán Lajos Kis, Zoltán Turányi TrafficLab , Ericsson Research, Budapest, Hungary. DP CHIP landscape the usual way of thinking.
E N D
cheap silicon: myth or reality? Picking the right data plane hardware for software defined networking Gergely Pongrácz, László Molnár, Zoltán Lajos Kis, Zoltán Turányi TrafficLab, Ericsson Research, Budapest, Hungary
DP CHIP landscapethe usual way of thinking The main question that is seldom asked: How big is the difference? Programmability Assuming same use case and table sizes SNP, Netronome Generic NPrun-to-completion(lower performance) NP4 Programmable pipeline Fulcrum Broadcom/Marvel Fixedpipeline(higher performance) performance
first comparison So it seems there is a 5-10x difference between “cheap silicon” and programmable devices NPUs ~4-5 W / 10G But do we compare apples to apples? CPUs ~25 W / 10G Prog. pipelines~3-4 W / 10G Switches~0.5 W / 10G
Simple NP/CPU model Internal bus Accelerators (e.g., RE engines, TCAM, hw queue, encyption) I/O Ethernet(e.g, 10G, 40G) ExternalResourceControl (e.g., optional TCAM,external memory) Optional accelerators(e.g, TCAM) Processing unit(s) (e.g., pipeline, execution unit) Fabric(e.g, Interlaken) External memory(e.g, DDR3) System(e.g, PCIe) On-Chip memory (e.g., cache, scratchpad)
Internal bus None 96x10G • 8 MCT • 340 Mtps • >1 GB External port 256 cores@ 1 GHz Low latency RAM(e.g, RLDRAM) Fabric High capacity memory(e.g, DDR3) System • L1 • SRAM • 4B/clock • per core • >128B • L2 • eDRAM • 24 Gtps • shared • >2 MB
packet walkthrough • read frame from I/O: copy to L2 memory, copy header to L1 memory • parse fixed header fields • find extended VLAN {sport, S-VID eVLAN}: table in L2 memory • MAC lookup and learning {eVLAN, C-DMAC B-DMAC, dport, flags}: table in ext. memory • encapsulate: create new header, fill in values (no further lookup) • send frame to I/O
assembly code • Don’t worry, no time for this • but the code pieces can be found in the paper
pps/bw calculationonly summary* • PBB processing in average • 104 clock cycles • 25 L2 operations (depends on packet size) • 1 external RAM operation • Calculated performance (pps) • packet size = 750B 960 Mpps = 5760 Gbps • cores + L1 memory: 2462 Mpps • L2 memory: 960 Mpps • ext. memory: 2720 Mpps • packet size = 64B bottleneck moves to cores 2462 Mpps = 1260 Gbps * assembly code and detailed calculation is available in the paper
Ethernet PBB scenariooverview of results • Results are theoretical: programmable chips today are designed for more complex tasks with less I/O ports 13-16 vs. 10-13 Mpps / Watt: 20-30% advantage: around 1.25x instead of 10x
summary and next steps I’d have to make it really fast if I spent >8 minutes so far
what we’ve learnedso far… • Performance depends mainly on the use case, not on the selected hardware solution • not valid for Intel-like generic CPU – much lower perf. at simple use cases • but even this might change with manycore Intel products (e.g. Xeon Phi) • on a board/card level local processor also counts – known problem for NP4 • Future memory technologies (e.g. HMC, HBM, 3D) might change the picture again • much higher transaction rate, low power consumption
But! – no free lunchthe hard part: I/O balance • So far it seems that a programmable NPU would be suitable for all tasks • BUT! For which use case shall we balance the I/O and the processing complex? • today we have a (mostly) static I/O built together with the NPU • we do have >10x packet processing performance difference between important use cases • How to solve it? • Different NPU – I/O flavors: still quite static solution • but an (almost) always oversubscribed I/O could do the job • I/O – forwarding separation: modular HW
what is nextongoing and planned activities • Prove by prototyping • use ongoing OpenFlow prototyping activity • OF switch can be configured to act as PBB • SNP hardware will be available in our lab at 2013 Q4 • Intel (DPDK) version is ready, first results will be demonstrated @ EWSDN 13 • Evaluate the model and make it more accurate • more accurate memory and processor models • e.g. calculate with utilization based power consumption • identify possible other bottlenecks • e.g. backplane, on-chip network
thank you! And let’s discuss these further