Farhad Mehdipour , Hiroaki Honda, Hiroshi Kataoka , Koji Inoue, Kazuaki Murakami

Hardware and Software Requirements for Implementing a High-Performance Superconductivity Circuits-Based Accelerator FarhadMehdipour, Hiroaki Honda, Hiroshi Kataoka, Koji Inoue, Kazuaki Murakami Kyushu University, Japan

CREST-JST (2006~): Low-power,high-performance, reconfigurable processor using single-flux quantum (SFQ) circuits Superconducting Research Lab. (SRL) SFQ process Yokohama National Univ. SFQ-FPU chip, cell library Nagoya Univ. SFQ-RDP chip, cell library, and wiring Nagoya Univ. CAD for logic design and arithmetic circuits S. Nagasawa et al. N. Yoshikawa et al. A. Fujimaki et al. N. Takagi (Leader) et al. Kyushu Univ. Architecture, Compiler and Applications K. Murakami K. Inoue H. Honda F. Mehdipour H. Kataoka SFQ-LSRDP Our mission: Architecture, compiler and application development

Outline of Large-Scale Reconfigurable Data-Path (LSRDP) Processor SFQ Features: • High-speed switching and signal transmission • Low power consumption • Compact implementation (smaller area) • Suitable for pipeline processing

… … … … … … conf. bit-stream … … … … … … GPP GPP GPP … … … … How it works Memory Controller Memory Controller inst; inst; … conf_LSRDP ( ); Loop: rearrange_input_data ( ); set_IO_info ( ); run_LSRDP ( ); inst; … sync_lsrdp ( ); rearrange_output_data ( ); End_Loop inst; … Buffers Memory inst inst inst conf_LSRDP(); rearrange_input_data () set_IO_info ( ); sync_lsrdp ( ); Waiting for the LSRDP rearrange_output_data ( ) run_LSRDP ( ); GPP LSRDP terminating the operation LSRDP Buffers

TU FU TU FU TU FU TU TU TU PE arch. I 4-inps/3-outs PE arch. II 3-inps/3-outs Basic PE arch. 3-inps/2-outs Architecture Exploration LSRDP Layouts PE structures ORN structures Number of rows = 2×M Number of rows = 1.5×M Number of rows = 1.5×M MCL= 1 MCL= 1 Number of columns = 6×MCL+2 Number of columns = 4×MCL MCL= 2 Number of columns = 4×MCL+1

LSRDP Tool Chain Modifying application code Inserting LSRDP instructions in the code Application C code Modified application code 1 1 2 1 LSRDP architecture description LSRDP library file Function definitions & declarations DFG Extraction 1 ISAcc or COINS compiler 2 2 1 Placing and Routing Tool Data flow graphs 2 binary code 2 1: flow of the assembly code generation for GPP 2: flow of configuration bit-stream generation for the LSRDP Configuration file + various text & schematic reports Simulator Performance evaluation

DFG Placing Input Nodes LSRDP Architecture Description Placing Operational & Output Nodes Routing Nets Routing IO Nets Final Map Mapping DFGs onto LSRDP Longest connections

Global routing algorithms Routing DFG connections between source and destination PEs exhaustive search-based very time consuming branch and bound alg. Very fast src src vacant fully- occupied dest dest

… FU FU FU FU FU FU FU FU T T T T T T T T i-th row ORN … (i+1)-th row Micro-Routing-Problem Definition • Inputs • LSRDP basic specifications • Layout, Width (W), MCL, PE arch., and etc. • List of connections b/w consecutive rows • ORN structure including • The number of CBs and T2s in each row • The number of CB rows • Topology of connections among CBs • Output • Detailed routes via cross-bar switches • The list of CBs used for routing each connection • Configuration of CBs A micro-routing algorithm has been implemented for the LSRDP with underlying layout II and PE arch. III

ORN Micro-routing CB: 2-input/2-output 2 Example PE１ PE 5 CB 1 (CB) 1 1 (PE1 PE 5) (PE2 PE5, PE6, PE7) (PE3 PE6, PE8 ) (PE4 PE7, PE8) 2 CB ½CB PE 2 - 2 PE 6 3 CB 2 2 CB 4 ½CB 1/2CB: 1-input/2-ouput PE 3 2 3 PE 7 CB Micro-nets 2 3 2 ½CB 3 ½CB CB 10 11 00 01 10 11 00 01 PE 4 PE 8 3 4 CB 4 3 4 ½CB CB 4 4 (CB)

6 6 6 6 6 7 6 5 8 4 9 8 10 5 6 4 7 9 10 11 11 12 12 12 12 12 12 12 12 6 6 6 6 6 6 12 7 7 7 7 12 7 7 7 7 7 7 12 … 8 8 8 8 8 8 8 8 8 8 18 17 17 17 17 17 17 17 18 9 9 9 17 17 9 18 18 18 18 18 18 18 17 18 17 9 18 9 9 9 9 9 18 10 10 10 10 20 20 20 20 20 20 20 20 … 10 10 10 10 10 10 18 20 20 20 18 11 11 11 11 18 11 11 11 11 11 11 18 24 12 12 12 12 24 24 24 24 24 24 24 18 12 12 24 12 24 12 12 24 12 18 25 25 25 25 24 25 25 25 13 13 13 13 … 25 25 13 13 13 24 13 13 25 13 25 24 14 14 14 14 14 14 14 14 24 14 14 15 15 24 15 15 24 24 15 15 15 15 15 15 31 31 31 31 31 31 31 31 31 31 31 16 16 16 16 32 32 32 32 32 32 32 32 16 16 16 16 16 16 32 32 … 32 17 17 17 17 17 17 17 17 17 17 18 18 18 18 18 18 18 18 18 18 ORN Micro-Routing Example: Heat 8x2- ORN b/w 3rd and 4th Rows PEs in 4th row PEs in 3rd Row

Specifications of Attempted DFGs

Example of a DFG MappingVibration- 8x2

Results of routing nets using the proposed algorithms

Thank You for Your Attention! Any Questions!

10TFLOPS SFQ-RDP computer 4.2 K SFQ 0.5μm process CMOS CPU (One Chip) ORN 2TB memory module （FB-DIMM [DDR3@1333MHz, 128GB] ×16 modules） ... FPU SFQ RDP （32 PE×32 chips）（2.5 GFLOPS／PE) ORN : : : : ... ORN ... ORN Streaming memory Access controller SB : : : ... : 1024FPU@MCM （３４chips）×4MCM SMAC SMAC SMAC Memory bandwidth per MCM：256GB/ｓ (=16GB/s ×16 channels)

FU FP TU TU TU TU TU TU • Development of RDPArchitecture Chip Micro-architecture: • Two types of PEs: FＰＡ and ＦＰＭ • PE layout: Checkered pattern • PE：Two Inputs（A,B,C）→ Three Outputs （A(*B),B,C） • Threescales of RDP (Small, Medium and Large-Scales) • ＴＵ：Data Through

Development of RDP Complier Modifyingapplication code Manual: Inserting LSRDP instructions in the code Application C code Modified code 1 1 2 1 RDP architecture description RDP library file Functions definition & declaration DFG Extraction Semi-manual 1 ISAcc or COINScompiler 2 2 1 Placement and Routing Tool 2 Data flow graphs .asm code for MIPS-based GPP 2 1: flow of the assembly code generation for GPU 2: flow of configuration bit-stream generation for the RDP Configuration file + various text and schematic reports Simulator Performance evaluation

Development of RDP Oriented Algorithms • One-dimensional heat and vibrational equations • Two-dimensional heat and FDTD equations • Two-Electron Repulsion Integral calculation in quantum chemistry • Runge-Kutta calculation for ordinary differential equation • Performance Evaluation • Two-dimensional heat equation(1024x1024 mesh） • SFQ-RDP1): 50.6GFlop/s vs. GPU2): 63.0GFlop/s 1) Evaluation method: RDP: - Execution time model, - DFG has 21 inputs,9 outputs, and 63 operations GPP: - Cycle-accurate processor simulator - BW: 159.0GB/s 2) T.Aoki, and A. Nukada,“CUDA programming premier,“ Kougakusya, ISBN-10:4777514773, 2009 (in Japanese). 19

Farhad Mehdipour , Hiroaki Honda, Hiroshi Kataoka , Koji Inoue, Kazuaki Murakami