Embedded DRAM for a Reconfigurable Array

Embedded DRAMfor a Reconfigurable Array S.Perissakis, Y.Joo1, J.Ahn1, A.DeHon, J.Wawrzynek University of California, Berkeley 1LG Semicon Co., Ltd

Outline • Reconfigurable architecture overview • Motivation for on-chip DRAM • Configurable Memory Block (CMB) • Evaluation • Conclusion

CPU Long Term Architecture Goal • On-chip CPU • LUT-based compute pages • DRAM memory pages • Fat pyramid networkfat tree + shortcuts

Long Term Architecture Goal CPU CPU Reconfigure K e r n e l 1 K e r n e l 2 ( p r o d u c e r ) ( c o n s u m e r )

Motivation • Stream buffersReduce reconfiguration frequency • Configuration memorySpeed up reconfiguration • Application memorySpeed up individual kernels Need large on-chip memory for:

Challenges DRAM offers increased density (10X to 20X that of SRAM), but: • Harder to use • Row/Col accesses & variable latency • Refresh • Lower performance • Increased access latency Q: Is it worth the trouble ?

CPU Trumpet test chip • One compute page • One memory page • Corresponding fraction of network Trumpet

CMB Functions • Configuration source • State source/sink • Data store • Input/output

CMB Overview Ctl[1:0] Cmd CMB Controller Addr[9:0] From host Ctl[1:0] Addr[17:0] DRAM Macro Tree[159:0] From compute DQ[127:0] page Short[159:0] [127:0] [63:0] Rate Address & Stall Retiming Matching Data Xbars Buffers Registers

DRAM Macro • 0.25µm, 4 metal eDRAM process • 1 to 8 Mbits (2 Mbits in test chip) • 128-bit wide SDRAM interface • Up to 125 MHz clock  2 GB/s peak B/W • 36ns/12ns row/col latencies • Row buffers to hide precharge & refresh Designed by LG Semicon

SRAM Abstraction • SRAM-like interfaceReq, R/W, Address, Data • Row buffers  simple direct-mapped cache • 6-cycle minimum latency, pipelined • Misses handled by logic stalls • 10-cycle miss latency “hidden” from logic

Stalls • Stall sources: • Row buffer miss (10 cycles) • Write after read (4 cycles) • DRAM/logic clock alignment (1 cycle) • Refresh (Halt from host) • Multicycle stall distribution

DRAM macro Output CMB Input Stall Buf logic Stall Buf User logic Stall Buffers • Memory page is never stalled • Must buffer read data during stall • Must buffer requests during stall distribution

Trumpet Test Chip • 0.25 DRAM, 0.4 logic • 2 Mbits + 64 LUTs • 125 MHz operation • 1 GB/sec peak bandwidth • 10 sec reconfiguration • 10 x 5 mm2 die • 1 W @ 125 MHz CMB Compute Page

DRAM core Fuse Datapath SDRAM i/f DRAM Macro controller Datapath Controller CMB Logic clock CMB Area Breakdown • 13.95 mm2 total • 2 Mbits capacity 147 Kbits/mm2 average densityCompare to 700-900 Kbits/mm2 commodity DRAM

Using a Custom Macro • Existing: • 13.95 mm2 • 147 Kbits/mm2 • Custom: • 9.4 mm2 • 218 Kbits/mm2

Comparison to SRAM CMB • DRAM (custom macro) 218 Kb/mm2 • SRAM (equal area)  25 Kb/mm2 With typical SRAM core densities and:  No stall buffers  Simplified controller  Close to 1 order of magnitude density advantage for DRAM

Performance • Configuration / state swap: peak 1 GB/s • User accesses: dependent on access patterns • Peak if high locality • Near peak for sequential patterns (62-93%) • Column latency exposed when dependencies exist, or on mixed R/W • Row latency exposed on random accesses

Row Column Performance (example) 8 Input image 8 Scanline order Row: ~ 4 misses / DCT block 8x8 DCT block 1 Kbit = 1 DRAM row Col: 2 misses / DCT block  73% efficiency

Refresh Overhead • 8 to 16 ms retention time expected • 2.5% to 5.0% bandwidth loss • Can reduce by refreshing only active part of memory • May skip refresh for short-lived data

Conclusion • Q: Is on-chip DRAM advantageous to SRAM ? • Our experience so far: • User-friendly abstraction possible • Can maintain density advantage • Effect on application performance: • Large buffer space  less frequent reconfiguration • High bandwidth  faster reconfiguration • Effect on individual kernels often limited by DRAM core latency

Embedded DRAM for a Reconfigurable Array

Embedded DRAM for a Reconfigurable Array

Presentation Transcript

Automatic Generation of Systolic Array Designs For Reconfigurable Computing

A 0.6V 205MHz 19.5ns tRC 16Mb Embedded DRAM

HSRA: High-Speed, Hierarchical Synchronous Reconfigurable Array

DRAM

Reconfigurable Embedded Processor Peripherals

ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

A distributed software-centric architecture for reconfigurable embedded systems

A Compound Reconfigurable Microstrip Parasitic Array

Delay Aware, Reconfigurable Security for Embedded Systems

DRIM: A Low Power Dynamically Reconfigurable Instruction Memory Hierarchy for Embedded Systems

High Performance, Low Power Reconfigurable Processor for Embedded Systems

A SystemC-based methodology for the simulation of dynamically reconfigurable embedded systems

The Systolic Ring : A Scalable Dynamically Reconfigurable Core for Embedded Systems

VLSI Implementation of Reconfigurable Cells for RFU in Embedded Processors

A Low-Energy Reconfigurable Fabric For Embedded Computing

DRRA Dynamically Reconfigurable Resource Array

SPACERAM: An embedded DRAM architecture for large-scale spatial lattice computations

SmartCell: Architecture, Design and Performance Analysis for Reconfigurable Embedded Computing

Delay Aware, Reconfigurable Security for Embedded Systems

A Controller Area Network Layer for Reconfigurable Embedded System

SPACERAM: An embedded DRAM architecture for large-scale spatial lattice computations

Balancing Interconnect and Computation in a Reconfigurable Array