220 likes | 653 Views
Smart Memory. Bill Lynch Sailesh Kumar and Team. Overview. High Performance Packet Processing Challenges Solution – Smart Memory Smart Memory Architecture. Packet Processing Workload Challenges. Sequential memory references For lookups (L2, L3, L4, and L7) Finite automata traversal
E N D
Smart Memory • Bill Lynch • Sailesh Kumar • and Team
Overview • High Performance Packet Processing Challenges • Solution – Smart Memory • Smart Memory Architecture
Packet Processing Workload Challenges • Sequential memory references • For lookups (L2, L3, L4, and L7) • Finite automata traversal • Read-modify-write • Statistics, counters, token-bucket, mutex, etc • Pointer and link-list management • Buffer management, packet queues, etc • Traditional implementations use • Commodity memory to store data • NPs and ASICs to process data in memory Tons of memory references and minimal compute Memory Memory Memory Memory • Performance Barriers: • Memory and chip I/O bandwidth • Memory latency • Lock for atomic access P P P P P P P P P P P P
1 0 1 0 2 3 1 1 P1 4 5 6 1 0 P3 P2 7 0 8 P4 9 P5 Low IPC Need more processors More latency In interconnect Illustration of Performance Barrier I Memory Memory Memory Memory Interconnection network IP lookup tree P P P P P P P P P P P P Requires several transactions between memory and processors High lookup delay due tohigh interconnect and memory latency
Enqueue Dequeue Counters Lock free-list Lock counter Get free node Read counter Unlock free-list Write counter Lock list tail Unlock counter Read list tail Link free node Update list tail Unlock list tail Illustration of Performance Barrier II Memory Memory Memory • Lookups are read-only so relatively easy • Link-list, counters, policers, etc are read-modify-write • Requires per memory address lock in multi-core systems Interconnection network P P P P P P P P P P P P Locks often kept in memory Requires another transaction Adds significant latency Single queue or single counter operations are extremely slow
Overview • High Performance Packet Processing Challenges • Memory bandwidth and latency • Limited I/O bandwidth • Solution – Smart Memory • Attach simple compute with data • Attach lock with data • Enable local memory communication • Smart Memory Architecture • Hybrid memory – eDRAM + DDR3-DRAM • Serial I/O
Memory Memory Memory Memory compute compute compute compute Interconnection network P P P P P P P P P P P P Introduction to Smart Memory Memory Memory Memory • What is the real problem? • Compute occurs far away from data • Lock acquire/release occurs far from data • Solution: Make memory smarter by: Interconnection network Fortunately, compute for packet processing jobs are very modest! P P P P P P P P P P P P Enabling local communication Managing lock close to data Keeping compute close to data
Memory Memory Memory Memory compute compute compute compute Interconnection network P P P P P P P P P P P P Introduction to Smart Memory Memory Memory Memory • What is the real problem? • Compute occurs far away from data • Lock acquire/release occurs far from data Interconnection network Fortunately, compute for packet processing jobs are very modest! P P P P P P P P P P P P • Smart Memory Advantages • (Get more off fewer transactions!) • Lower I/O bandwidth • Lower processing latency • Higher IPC • Significantly higher single counter/queue performance
Overview • High Performance Packet Processing Challenges • Memory bandwidth and latency • Limited I/O bandwidth • Solution – Smart Memory • Attach simple compute with data • Attach lock with data • Enable local memory communication • Smart Memory Architecture • Hybrid memory – eDRAM + DDR3-DRAM • Serial chip I/O
Smart Memory Capacity and Bandwidth @100G 40 20 10 5 2.5 1.25 .62 .31 .15 DPI (string) DPI (regex) ACL (algorithmic) FIB (algorithmic) Memory bandwidth (Billion accesses / packet) Queuing/ Scheduling Layer2 fwding Statistics/ Counter Basic Layer2 Packet Buffer Video Buffer 2 4 8 16 32 64 128 256 512+ Memory Capacity (MB)
Smart Memory Capacity and Bandwidth @100G 40 20 10 5 2.5 1.25 .62 .31 .15 DPI (string) DPI (regex) Smart Memory uses intelligent algorithms to split the data-structures 64 banks eDRAM ACL (algorithmic) FIB (algorithmic) Memory bandwidth (Billion accesses / packet) Queuing/ Scheduling Layer2 fwding Statistics/ Counter Basic Layer2 Packet Buffer 8 channels of DDR3-DRAM Video Buffer 2 4 8 16 32 64 128 256 512+ Memory Capacity (MB)
P P P P P P P P P P P P eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine P P P P Smart Memory High Level Architecture DDR3 DRAM Global interconnect: provides fair communication between processors and smart memory Packet processor complex DRAM SM Engine eDRAM eDRAM SM engine SM engine eDRAM SM engine Local interconnect: provides local communication between smart memory blocks Smart Memory complex
P P P P P P P P P P P P eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM eDRAM data SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine SM engine FIB key P P P P Smart Memory High Level Architecture DDR3 DRAM Packet processor complex FIB key result DRAM SM Engine read Computation occurs close to memory reducing latency Requires fewer memory transactions eDRAM eDRAM read SM engine SM engine eDRAM Split tables into eDRAM and DRAM SM engine Smart Memory complex
I/O Technology Choice in Smart Memory • Smart Memory reduces the chip I/O bandwidth significantly • How to further optimize it? - Bandwidth, latency and I/O bandwidth gap is growing - On-chip bandwidth is much higher than memory I/O Smart Memory use serial I/O - 4X throughput than RLDRAM and QDR - 3X fewer pins than DDR3 and DDR4 - 2.5X reduces I/O power Based on MoSys data
High Speed Line Card with Smart Memory 540+ w 212- w Power 472+ cm2 148 cm2 Area Traditional Line Card Line Card with SM 2520 $ 5600+ $ Cost 2-3 times
Concluding Remarks • Packet Processing Bottlenecks • Data away from compute • I/O and memory bandwidth • Smart Memory • Keep compute close to data • Keep locking close to data • Provide inter-memory connect • Advantages • Reduced chip I/O bandwidth • High performance and low latency • Feature rich, flexible and programmable • Lower cost • One chip for several functions
Thank You www.huawei.com