Processor Architectures and Program Mapping

Processor Architectures and Program Mapping Exploiting DLP SIMD architectures TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman

flexibility efficiency DSP Programmable CPU Programmable DSP Application specific instruction set processor (ASIP) Application specific processor Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

SIMD Performance Computational efficiency [MOPS/W] 106 105 Application specific cores 104 SIMD 103 102 Programmable processors 101 [Roza] 100 0.13 0.07 0.25 0.5 1 2 Feature size [um] Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

VLIW = Very Long Instruction Word architecture Instruction format: operation 1 operation 2 operation 3 operation 4 operation 5 What are we talking about? ILP = Instruction Level Parallelism = ability to perform multiple operations (or instructions), from a single instruction stream, in parallel Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

SIMD: Topics Overview • Enhance performance: architecture methods • Data Level Parallelism • Application area • Subword parallelism • Locally connected SIMDs • Xetal • Fully connected SIMDs • Imagine • Communication in SIMD processors • RCSIMD • DCSIMD Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Enhance performance: 3 architecture methods • (Super)-pipelining • Powerful instructions • MD-technique • multiple data operands per operation • MO-technique • multiple operations per instruction • Multiple instruction issue Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Characteristics of Media Applications • Poorly matched to conventional architectures • Caches • Instruction-Level Parallelism • Few arithmetic units • Well-matched to modern VLSI technology • Lots (100’s - 1000’s) of ALUs fit on a single chip Communication bandwidth is the scarce resource Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Architecture methodsPowerful Instructions (1) MD-technique • Multiple data operands per operation • SIMD: Single Instruction Multiple Data Vector instruction: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; c = a + 5*b Assembly: set vl,64 ldv v1,0(r2) mulvi v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv v3,0(r3) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

SIMD Execution Method time node1 node2 node-K Instruction 1 Instruction 2 Instruction 3 Instruction n Architecture methodsPowerful Instructions (1) SIMD computing • Exploit data locality of e.g. image processing applications • Effect on code size? • Effect on power consumption? Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

* * * * Architecture methodsPowerful Instructions (1) • Sub-word parallelism • SIMD on restricted scale: • Used for Multi-media instructions • Motivation: use a powerful 64-bit alu as 4 x 16-bit alus • Examples • MMX, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow, Trimedia II • Example: i=1..4|ai-bi| Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Instructions Bus PE0 PE1 PE2 PE319 One wide port Memory LC-SIMD LC-SIMD (Locally connected; e.g. Xetal, Imap)  long communication delays: shift operations Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Instructions Bus PE0 PE1 PE2 PE319 Fully Connected Communication Network FC-SIMD FC-SIMD (Fully Connected; Imagine)  expensive communication network Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

LC: Xetal Objectives High-degree of system integration CMOS imaging + DSP low cost camera systems Low power consumption mobile & remote sensing Flexibility programmable DSP and control functions Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Xetal Architecture 1

Global Controller  tuned for Xetal Archit.  functions  loop/iteration control  system synchronization  exposure-time control  white balancing . . . Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Xetal Architecture 1

Parallel Processing (SIMD)  2 columns /processor  neighbour communication  low-speed clock (16 MHz)  clock gating  shared address decoding  minimal memory read access  LOW-POWER Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Parallel Processing (Contd.) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Xetal Specs & Performance Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Simulation Results(1-input) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Simulation Results(1-output) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Simulation Results (2) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Imagine • Combining DLP (SIMD) and ILP (VLIW) • toplevel SIMD • per PE: VLIW Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Render Encode/Decode Imagine: Representative Applications • Stereo Depth Extraction • Polygon Rendering • MPEG Encoding/Decoding 101100 010110 001001 Encoded 2D Data 2D Video Stream Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Input Data Kernel Stream Output Data Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Stream Processing • Little data reuse (pixels never revisited) • Highly data parallel (output pixels not dependent on other output pixels) • Compute intensive (60 arithmetic ops per memory reference) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Stream Architecture Provides Data Bandwidth Hierarchy SIMD/VLIW Control SDRAM ALU Cluster ALU Cluster ALU Cluster SDRAM ALU Cluster Stream Register File ALU Cluster SDRAM ALU Cluster ALU Cluster SDRAM ALU Cluster Peak BW: 2GB/s 32GB/s 544GB/s Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Application Data: Bandwidth Usage SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 2GB/s 32GB/s 544GB/s Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Stream Register File: Details Arbiter To/From: Arithmetic Clusters, I/O, Interprocessor communication, and Main Memory SRF: Single-ported 128KB SRAM (1024 x 32W) Stream buffers 32W/cycle Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

+ + * * Arithmetic Cluster: Details Intercluster Network • Units support floating-point / 32-bit / dual 16-bit / quad 8-bit instructions • 4-cycle pipelined FMUL,FADD,FSUB,FTOI,ITOF,FFRAC • 17-cycle FDIV (pipelined for 1 FDIV every 7 cycles) Local Register File + / CU To SRF Cross Point From SRF Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Stream Controller Network The Imagine Stream Processor SDRAM SDRAM SDRAM SDRAM Streaming Memory System Network Host Stream Register File: 32kW SRAM Interface Processor Microcontroller: 2K VLIW Instrs ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Imagine Floorplan • 22 million transistors • 500 MHz • TI GS30KA: • 0.15 mm Ldrawn • 0.13 mm Leff • CMOS process Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

StereoDepthExtraction(…) { // Load Input Images ... // Run Kernels convolve7x7 (RawImage,ConvImage); convolve3x3 (ConvImage,Conv2Image); ... // Store Output } Imagine Programming Environment Convolve7x7(…) { ... while(!In.empty()) { ... p0 = k0 * in10; p12 = k21 * in32; p34 = k43 * in54; p56 = k65 * in76; sum = (p0 + p12) + (p34 + p56); ... } } Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

RC-SIMD • Imagine support full interconnect between PEs • Do we need this expensive interconnect? • Alternative: RC-SIMD Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Basic template of communication architecture Instructions Bus PE0 PE1 PE2 PE3 0 0 0 S0 S1 S2 1 1 1 1 1 1 0 0 0 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

LD 0 LD +1 LD +2 LD +3 * C0 * C1 * C2 * C3 + ST Example • 4-tap filter Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Example PE0 PE1 PE2 PE3 0 0 0 S0 S2 S1 1 1 1 Resource sharing conflict How to solve???? Pipeline (shift 1 cycle) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

delay delay delay PE0 PE1 PE2 PE3 0 0 0 S0 S1 S2 1 1 1 RC-SIMD: Basic architecture • Schedule with delay-line Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

delay delay delay PE0 PE1 PE2 PE3 0 0 0 S0 S1 S2 1 1 1 Conflict model • Schedule PE0 (using FACTS) 0 0 Ld +2 S0 S1 0 0 -1 1 0 S1 S2 0 Node: resource usage Sequence edge: timing dependency Fact tools Move problem From hardware to software Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

delay delay delay PE0 PE1 PE2 PE3 0 0 0 S0 S1 S2 1 1 1 Basic architecture • Valid schedule Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Drawback PE 0 PE 1 PE 2 PE 3 PE 319 Ins 1 • 319 cycle between PE0 & PE319 • Size of conflict model (compile time) Ins 2 Ins 1 Ins 3 Ins 2 Ins 1 Ins 4 Ins 3 Ins 2 Ins 1 Ins 4 Ins 3 Ins 2 Ins 4 Ins 3 Ins 4 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Update Architecture PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 5 PE 6 Cycle 1 Ins 1 Ins 1 Cycle 2 Ins 2 Ins 1 Ins 2 Ins 1 Cycle 3 Ins 3 Ins 2 Ins 1 Ins 3 Ins 2 Ins 1 Cycle 4 Ins 4 Ins 3 Ins 2 Ins 1 Ins 4 Ins 3 Ins 2 Ins 1 Cycle 5 Ins 4 Ins 3 Ins 2 Ins 4 Ins 3 Ins 2 Cycle 6 Ins 4 Ins 3 Ins 4 Ins 3 Cycle 7 Ins 4 Ins 4 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

delay delay delay delay PE0 PE1 PE2 PE3 PE4 0 0 0 0 S0 S1 S2 S3 1 1 1 1 0 0 0 0 1 1 1 1 Updated RC-SIMD Architecture Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Results of mapping several kernels Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Imap Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Difficult SIMD Applications • Algorithms need Dynamic communication: • lens distortion • bucket processing • Mirroring,… Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

DC-SIMD Architecture PE_1 PE_2 PE_3 PE_4 PE_5 PE_6 PE_7 Bus_0 R1 R4 R7 Bus_1 R2 R5 Bus_2 R3 R6 PE_6  PE_3 PE_4  PE_2 V dst-add data src-add Message format Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

DC-SIMD Architecture PE_1 PE_2 PE_3 PE_4 PE_5 PE_6 PE_7 Bus_0 R1 R4 R7 Bus_1 R2 R5 Bus_2 R3 R6 Larger distance: PE_7  PE_1 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

DC-SIMD Architecture PE_1 PE_2 PE_3 PE_4 PE_5 PE_6 PE_7 Bus_0 R1 R4 R7 Bus_1 R2 R5 Bus_2 R3 R6 Priority PE_7  PE_5 PE_6  PE_2 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

DC-SIMD: arbitration PEid PE Read data Read: xor V des-add data src-add write: give priority to further PES PEn PEn+1 PEn+2 Next reg. V des-add data src-add n+2 : 2.v n+1 : (2+v).1 n : (1+2+v).0 Select (ab) a=v’.2’ b=a’.v’+a.1’ Buffer instruction: Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Processor Architectures and Program Mapping