Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator

Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator F. Mehdipour, Hiroaki Honda*, H. Kataoka, K. Inoueand K. Murakami Graduate School of Information Science and Electrical Engineering, Kyushu University, Japan *Institute of Systems, Information Technologies and Nanotechnologies (ISIT), Fukuoka, Japan E-mail: farhad@c.csce.kyushu-ua.c.jp

Agenda Introduction SFQ-LSRDP General Architecture The Design Procedure and Tool Chain Input/ Output Nodes Placement Area Minimization Experimental Results Conclusions

CREST-JST SFQ-RDP Project (2006~): A Low-power, high-performance reconfigurable processor based on single-flux quantum circuits Superconducting Research Lab. (SRL) SFQ process Yokohama National Univ. SFQ-FPU chip, cell library Nagoya Univ. SFQ-RDP chip, cell library, and wiring Nagoya Univ. CAD for logic design and arithmetic circuits Dr. S. Nagasawa et al. Prof. N. Yoshikawa et al. Prof. A. Fujimaki et al. Prof. N. Takagi (Leader) et al. Kyushu Univ. Architecture, Compiler and Applications SFQ-LSRDP Prof. K. Murakami et al.

Goals Discovering appropriate scientific applications Developing compiler tools Developing performance analyzing tools Designing and Implementing SFQ-LSRDP architecture considering the features and limitations of SFQ circuits

How a reconfigurable processor works GPP Non-critical code Computation-intensive (critical) code LSRDP Non-critical code ... PE PE PE PE Computation-intensive (critical) code ORN … LSRDP ... Non-critical code PE PE PE PE . . . ORN ... PE PE PE PE Application code Main Memory

Single-flux quantum (SFQ)against CMOS • CMOS main issues in implementing a large accelerator: • High electric power consumption • High heat radiation • Difficulties in high-density packing SFQ Features: • High-speed switching and signal transmission • Low power consumption • Compact implementation (smaller area) • Suitable for pipeline processing of data stream

Outline of large-scale reconfigurable data-path (LSRDP) processor • Features: • Handling data flow graphs (DFGs) extracted from scientific applications • Pipeline execution • Burst transfer of input /output rearranged data from/to memory • Reduced no. of memory accesses (alleviating the memory wall problem) • Reconfigurable data-path components: • A matrix of large number of floating-point Functional Units (FUs) • Reconfigurable Operand Routing Network : (ORN) • Dynamic reconfiguration facilities • Streaming Buffer (SB) for I/O ports LSRDP GPP ... PE PE PE PE ORN : Operand Routing Network : : : : ... PE PE PE PE ORN ... PE PE PE PE SB SMAC Main Memory Scratchpad Memory

SFQ-LSRDP General Architecture

4 7 15 13 12 LSRDP architecture Input ports • Processing Elements • FU (Functional Unit): implements basic 64-bit double-precision floating point operations including: ADD/SUB and MUL • TU(transfer unit): as a routing resource for transferring data b/w inconsecutive rows MUL Node 15 TU FU FU TU FU FU TU TU FU TU PE including two components TU Four functionalities Output ports

FU - FU TU - FU TU - FU - - - TU - TU TU TU TU - FU TU TU FU TU PE Basic arch. 3-inps/2-outs FU TU PE arch. I 4-inps/3-outs TU FU TU TU TU PE structures FU - FU TU TU TU TU TU - TU TU-TU TU PE arch. II 3-inps/3-outs FU TU

A A A A A A A A A A A A A A A A A A A A T T T T T T T T T T T T T T T T T T T T ADD/SUB TU M M M M M M M M M M M M M M M M M M M M MUL Layout types- Type I W … Each PE implements ADD/SUB and MUL ORN … H ORN M : MUL … . . . A : ADD/SUB T : Transfer Unit ORN … Flexible but consumes a lot of resources

M A A A M M A M M A A M A A A A M M A A T T T T T T T T T T T T T T T T T T T T … ADD/SUB TU MUL TU Layout types- Type II W Each PE implements ADD/SUB or MUL Each PE implements ADD/SUB or MUL ORN … ORN … . . . H ORN …

Maximum connection length (MCL)-Definition MCL:maximum horizontal distance b/wtwo PEs located in two subsequent rows

An ORN structure ORN 2bit shiftregister ORN is consisted of2-bit shift registers, 1-by-2 and 2-by-2 cross bar switches A. Fujimaki, et al., Demonstration of an SFQ-Based Accelerator Prototype for a High-Performance Computer,” ASC08, 2008.

Dynamic reconfiguration architecture • Three bit-stream lines for dynamic reconfiguration of: • Immediate registers (64bit) in each PE • Selector bits for muxes selecting the input data of FUs • Cross-bar switches in ORNs

What should be decided during the design procedure Width and Height ? The number of I/O ports? Maximum Connection Length (MCL)? ORN size and structure? Layout: FU types (ADD/SUB and MUL)? Reconfiguration mechanism? (PE, ORN, Immediate data) On-chip memory configuration?

The Design Procedure and Tool Chain

Compiler and design flow • DFGs are manually generated • DFG mapping results are employed for: • Analyzing LSRDP architecture statistics (a quantitative approach) • Generating LSRDP configuration bit-streams

Benchmark applications • Finite differential method calculation of2nd order partial differential equations • 1dim-Heat equation(Heat) • 1dim-Vibration equation (Vibration) • 2dim-Poisson equation (Poisson) • Quantum chemistry application • Recursive parts of Electron Repulsion Integral calculation(ERI-Rec) Types of operations in the calculations: ADD/SUB and MUL

DFG extraction- Heat equation • 1-dim. heat equation for T(x,t) • Calculation by Finite DifferenceMethod (FDM) (A is const.) Basic DFG Basic DFG can be extended to horizontal and vertical directions to make a larger DFG

A sample DFG - Heat Inputs: 32 Outputs: 16 Operations: 721 Immediates: 364 A sample DFG (Heat)

DFG mapping flow Longest connections MCL= 2

Placing Input/Output Nodes

Fan-out based I/O nodes placement • ni: the number of children of input node i • Ci1, Ci2, Ci3, Ci,ni • X: location of the input node i • Total Connection Length: TCL= |Ci1-X|+ |Ci2-X|+…|Ci,ni-X| • Objective: MinimizeTCL • ni= 1 X= Ci1 • ni= 2 Ci1 <= X <= Ci2 • ni= 3 X = Ci2 • ni>=2 X = Cij, j=2…ni-1

One main reason for the large MCL Inputs Ports are far from each other

Proximity-factor based placement • Proximity factor indicates how far a pair of input ports should be located from each other • For a pair of input nodes • The larger number of closer descendants, higher proximity factor is assigned • Si,j: a set of common descendants for input nodes i and j • Dk,i(=Dk,j): distance of common descendant node k to the input nodes i and j (it is equal to ASAP execution level of the node)

I3 I1 I2 3 2 1 4 5 6 7 Proximity factor-Example Inputs nodes I1 and I2 should be located closer than I3

Input nodes placement alg.: Example if C(l)> C(r) l= l+1, L[l]=j else r= r+1, L[r]=j Placing the 1st input node with the highest proximity factor … … N/2-3 N/2-2 N/2-1 1 N/2+1 N/2+2 N/2+3 Placing the 2nd input node with the highest proximity factor … … N/2-3 N/2-2 2 1 N/2+1 N/2+2 N/2+3

Input ports placement alg.: Example Placing i-th input node … … N/2-K … 2 1 3 … N/2+M l r If C(l)> C(r): … … i … 2 1 3 … N/2+M r l If C(r)> C(l): … N/2-K … 2 1 3 … i l r

Area Minimization

Estimating the area of a PE Area(FU)= Area(ADD/SUB)= Area(MUL) Area(TU)= Area(MUX)~ 0.1 Area (FU) FU TU FU TU TU PE arch. I PE basic arch Layout I: Area(PE)= 2.2x Area(FU), Layout II: Area(PE)= 1.2x Area(FU) Layout I: Area(PE)= 2.1x Area(FU), Layout II: Area(PE)= 1.1x Area(FU) C A B A B C FU TU op TU TU TU PE arch. II mux sel Layout I: Area(PE)= 2.2x Area(FU) Layout II: Area(PE)= 1.2x Area(FU)

Estimating the ORN area-PE Basic arch. Number of rows = 1.5×W FU TU Basic arch. 3-inps/2-outs MCL= 1 Number of columns = 4×MCL Area (ORN)= 1.5 x W x (4 x MCL) x Area (CB) W: the no. of the PEs in a RDP row

Estimating the ORN area-PE arch. I TU FU TU Number of rows = 2×W PE arch. I 4-inps/3-outs MCL= 1 Number of columns = 6×MCL+2 Area (ORN)= 2 x W x (6 x MCL+ 2) x Area (CB)

Estimating the ORN area-PE arch. II Number of rows = 1.5×W FU TU TU TU PE arch. II 3-inps/3-outs MCL= 2 Number of columns = 4×MCL+1 Area (ORN) = 1.5 x W x (4 x MCL + 1) x Area (CB)

A modified connection length measurement • New measurement technique for the net length src Connection length measurement: initialC.L.= dh modified  C.L.= dh/ dv dv dest dh C.L.(previous)= 3 C.L.(new)=3 src dest1 C.L.(previous)= 3 C.L.(new)=1 dest2

A modified connection length measurement- Example Parent 2 is chosen when C.L. is measured as dh/dv MCL= 1 Parent 1 dh dh/dv 0, 4 0, 4/3 1, 3 1, 1 2, 2 2, 2/3 3, 1 3,1/3 4, 0 4, 0 is chosen when C.L. is measured as dh MCL= 2 dh dh/dv 0, 4 0, 1 1, 3 0.5, 0.75 2, 2 1, 0.5 3, 1 3/2, 1/4 4, 0 2, 0

MCL minimization- Using a MCL threshold • A maximum threshold is assumed for the MCL • During the placement process: • For each CL larger than the threshold, the vertical distance increases as: • dv= CL/MCL_Threshold PE with the min. C.L to the source src • max permitted length= 2 • dh =3 > max permitted length • dv= 1 dest dest dv= dv+ [3/2]=dv+1= 2

Basic placement and routing vs. integrated placement and routing DFG DFG Placing Input Nodes using PF-based alg. Placing Input Nodes LSRDP Architecture Description LSRDP Architecture Description Placing Operational Nodes & Routing Nets (node by node) Placing Operational & Output Nodes Placing Output Nodes Routing Nets Final Map Final Map Routing Output Nets Routing IO Nets Basic Placement and Routing Flow Integrated Placement and Routing Flow

Experimental Results

Specifications of the benchmark DFGs

Evaluation results for various architectures-MCL and ORN sizes S2 results in smaller MCL and ORN size for both layout types

Evaluation results for various architectures-no. of utilized PEs By using lhv, larger number of RDP rows are utilized larger number of PEs will be employed for S2

FU TU TU FU TU FU TU TU TU Basic PE arch. 3-inps/2-outs PE arch. I 4-inps/3-outs PE arch. II 3-inps/3-outs Evaluation results for various architectures-overall LSRDP area (KJJ) S2 results in smaller overall area in terms of KJJ for both layout types Layout II results in smaller area PE arch. II gives smaller area

A sample ORN implementation Block diagram of a high frequency test bench clkin_hf ladder clkin_lfout clkin_lfin circuit under test data_in data_out input shift register output shift register A photograph of a chip with 1-to-3 ORN prototype test bench circuit under test ladder 5 mm input shift register output shift register

Conclusions • SFQ-LSRDP is a basic core of a high-performance low-power computer • Data Flow Graphs (DFGs) extracted from scientific applications are mapped on the LSRDP • LSRDP micro-architecture is designed based on characteristics of DFGs via a quantitative approach • LSRDP is promising for resolving issues originated from CMOS technology as well as achieving remarkable performance Acknowledgement: This research was supported in part by Core Research for Evolutional Scienceand Technology (CREST) of Japan Scienceand Technology Corporation (JST).

Thanks for your attention! Any questions?

Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator

Developing an Architecture for a Single-Flux Quantum Based Reconfigurable Accelerator

Presentation Transcript

A Reconfigurable Architecture for Load-Balanced Rendering

A Computer Architecture For Quantum Programming

Single Photon Source for Quantum Communication

Heterogenous reconfigurable architecture

ALU-Array based Reconfigurable Accelerator for Energy Efficient Executions

A distributed software-centric architecture for reconfigurable embedded systems

An Architecture for Reconfigurable Computing in Space

A Hybrid Design Space Exploration Approach for a Coarse-Grained Reconfigurable Accelerator

An FPGA Based Graph Coloring Accelerator

Developing a Space-based Architecture for Climate Monitoring

Optimizing the Architecture of SFQ-RDP (Single Flux Quantum- Reconfigurable Datapath)

Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing

Developing an evidence-based approach for forestry

Reconfigurable FFT architecture

Reconfigurable FFT architecture

An Architecture for a QoS-based Mobile Agent System

Developing an XBRL Reporting Architecture

DEVELOPING A SINGLE PLAN FOR STUDENT ACHIEVEMENT

Body Bias Grain Size Exploration for a Coarse Grained Reconfigurable Accelerator

A Reconfigurable FPGA Architecture for DSP Transforms