310 likes | 413 Views
The Systolic Ring : A Scalable Dynamically Reconfigurable Core for Embedded Systems. Pascal BENOIT , G. SASSATELLI, M. ROBERT, L. TORRES, G. CAMBON, T. GIL. controller USB2.0. PCI-X. PCI. MPEG2. Arbitrage. Introduction - SoC architectures. REUSE. Intellectual Properties (IP) Cores.
E N D
The Systolic Ring : A Scalable Dynamically Reconfigurable Core for Embedded Systems Pascal BENOIT, G. SASSATELLI, M. ROBERT, L. TORRES, G. CAMBON, T. GIL
controller USB2.0 PCI-X PCI MPEG2 Arbitrage Introduction - SoC architectures REUSE Intellectual Properties (IP) Cores RECONFIGURATION RECONFIGURATION SOFTWARE HARDWARE Static / Dynamic CPU CPU DSP RAM1 Reconfigurable RAM2 RAM3 Reconfigurable Arbitration ROM Flash TEST (BIST) RECONFIGURATION TEST Reconfigurable Interconnect (Network on Chip...) Built In Selt Test
controller USB2.0 PCI-X PCI MPEG2 Arbitrage Introduction Multimedia Data flow oriented applications RECONFIGURATION HARDWARE Static / Dynamic CPU CPU DSP RAM1 Reconfigurable RAM2 RAM3 Reconfigurable Arbitration ROM Flash TEST (BIST)
Fine grain: Granularity: BIT adapted to Prototyping, Encryption High reconfiguration over-cost Low Functional frequencies Coarse Grain: Granularity: WORD adapted to DSP, data flow oriented processing Low reconfiguration over-cost High level of performances ALU, MULT MUXes Registers Introduction What kind of base block is suitable for Multimedia?
Coarse Grain: Granularity: WORD adapted to DSP, data flow oriented processing Low reconfiguration over-cost High level of performances SYSTOLIC RING Coarse Grain Dynamically Reconfigurable Architecture Introduction What kind of base block is suitable for Multimedia? ALU, MULT MUXes Registers
Outline The Systolic Ring : A Scalable Dynamically Reconfigurable Core for Embedded Systems • The Systolic Ring • Building Block • Operative Layer Topology • System Overview • Features • Application Example • 8*8 2D DCT • Structural Mapping • Performance Comparisons • Conclusion
Operative Layer RAM Configuration Layer Configware Dnode Configuration Sequencer The Systolic Ring -System Overview Two-layers based reconfigurable architecture Coarse Grain Dynamically Reconfigurable Architecture
Data processing oriented block ALU + Multiplier MAC) Programmable component Local Sequencer Dynamic and autonomous configuration management one instruction per cycle The Systolic Ring - Building block DNODE (Data Node)
Switch Dnode Dnode Layer organisation of Dnodes The Systolic Ring - Dnode Clusters Macro Node Direct data injection I/O
Dnode Dnode Switch Dnode Dnode The Systolic Ring -Switch components Full connectivity between 2 layers Layer n-1 Layer n
Switch Switch Dnode Dnode Switch Dnode Dnode Dnode Dnode Dnode Dnode Switch Switch Dnode Dnode Dnode Dnode Switch Switch Dnode Dnode Switch Dnode Dnode Customisable... The Systolic Ring -Operative Layer Topology I/O Ring Structure I/O I/O I/O I/O I/O I/O I/O
Switch Switch Dnode Dnode Switch Dnode Dnode Dnode Dnode Dnode Dnode Switch Switch Dnode Dnode Dnode Dnode Switch Switch Dnode Dnode Switch Dnode Dnode The Systolic Ring -Operative Layer Topology Data Flows Forward Data Flow Unidirectional data transit between successive layers (circular pipeline) Forward Data Flow Reverse Data Flow Feedback pipeline network for recursive algorithms
The Systolic Ring -Operative Layer Topology Data Flows Switch Switch Dnode Dnode Switch Forward Data Flow Dnode Dnode Unidirectional data transit between successive layers (circular pipeline Dnode Dnode Dnode Dnode Reverse Data Flow Switch Switch Reverse Data Flow Dnode Dnode Dnode Dnode Feedback pipeline network for recursive algorithms Switch Switch Dnode Dnode Switch Dnode Dnode
The Systolic Ring -Operative Layer Topology Data Flows Switch Switch Dnode Dnode Switch Forward Data Flow Dnode Dnode Unidirectional data transit between successive layers (circular pipeline Dnode Dnode Dnode Dnode Reverse Data Flow Switch Switch Reverse Data Flow Dnode Dnode Dnode Dnode Feedback pipeline network for recursive algorithms Switch Switch Dnode Dnode Switch Dnode Dnode
The Systolic Ring -Operative Layer Topology Data Flows Switch Switch Dnode Dnode Switch Forward Data Flow Dnode Dnode Unidirectional data transit between successive layers (circular pipeline Dnode Dnode Dnode Dnode Reverse Data Flow Switch Switch Reverse Data Flow Dnode Dnode Dnode Dnode Feedback pipeline network for recursive algorithms Switch Switch Dnode Dnode Switch Dnode Dnode
Operative Layer Data RAM Configuration Layer Configware Configuration Sequencer Management Code The Systolic Ring -System Overview Not a stand-alone solution General Purpose Processor Coprocessor for data flow oriented applications
The Systolic Ring Systolic Ring Features • RING-8 (8 Dnodes) • 0.18µ technology • 3.3 mm ² • 200 MHz • 1600 MIPS • 1600 MMACs / s Switch 1 Switch 2 N2,2 N1,1 N1,2 N2,1 BN1 BN3 BN2 BN4 Switch 3 Switch 4 N4,1 N3,2 N4,2 N3,1 Operative Layer Layout • Process geometry dropping increase Dnode #
Application example -8*82DDCT Even-Odd frequency decomposition
Application example -8*82DDCT Cycle 0 x0, x7, x0, x7 x0 x7 x0 x7
Application example -8*82DDCT x1, x6, x1, x6 x1 x6 x1 x6 x0+x7 x0-x7
Application example -8*82DDCT Cycle 1 x1, x6, x1, x6 x1 x6 x1 x6 x0+x7 x0-x7 1/8, MAC MAC
Application example -8*82DDCT Cycle 2 x2, x5, x2, x5 x2 x5 x2 x5 x1+x6 x1-x6 1/8, MAC MAC
Application example -8*82DDCT Cycle 3 x3, x4, x3, x4 x3 x4 x3 x4 x2+x5 x2-x5 1/8, MAC MAC
Application example -8*82DDCT Cycle 4 x3+x4 x3-x4 1/8, MAC MAC
Application example -8*82DDCT Cycle 5 Clear 2 transformed samples computed each 5 clock cycles Clear z0 z1 8*8 2D transformed samples each 320 clock cycles
Application example -DCT 2D 8*8 64*64 image example DCT 2D 8*8 performed by 4 Dnodes RING-N implementation ( N Dnodes)
Application example -DCT 2D 8*8 64*64 image example - Comparisons DCT Core Xilinx Pentium IV Intel TMS320C62 TI RING-16 RING-64 Cycles # 21248 10240 4171 5120 1280 1200 300 200 200 f (MHz) 80* 17.7 34.1 6.4 12.8 52.1 Proc. Time (µs) SSE2 Matrix *Device dependant Comment Even-Odd decomposition Coarse Grain Reconfigurable VLIW Type Fine Grain Reconfigurable Super scalar Only Processing time !!
Application example -DCT 2D 8*8 64*64 image example - Comparisons µs 25000 40 52.1 35 20000 # cycles 30 µs Processing Time 25 15000 20 10000 15 10 5000 5 0 0 Pentium IV DCT Core C62 RING 16 RING 64
Conclusion • DESIGN • Reconfigurable IP Core for SoC • Assembling Software • RING-8 prototype • FEATURES • Customisable IP Core • Good performance / area trade-off : Ring-8@200MHz (0.18µ) • 3.3 mm² • 1600 MIPS • Results for DCT, Wavelet Transform, Motion Estimation