560 likes | 580 Views
Processor Architectures and Program Mapping. Exploiting DLP SIMD architectures. TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman. flexibility. efficiency. DSP. Programmable CPU. Programmable DSP. Application specific instruction set processor (ASIP).
E N D
Processor Architectures and Program Mapping Exploiting DLP SIMD architectures TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman
flexibility efficiency DSP Programmable CPU Programmable DSP Application specific instruction set processor (ASIP) Application specific processor Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
SIMD Performance Computational efficiency [MOPS/W] 106 105 Application specific cores 104 SIMD 103 102 Programmable processors 101 [Roza] 100 0.13 0.07 0.25 0.5 1 2 Feature size [um] Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
VLIW = Very Long Instruction Word architecture Instruction format: operation 1 operation 2 operation 3 operation 4 operation 5 What are we talking about? ILP = Instruction Level Parallelism = ability to perform multiple operations (or instructions), from a single instruction stream, in parallel Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
SIMD: Topics Overview • Enhance performance: architecture methods • Data Level Parallelism • Application area • Subword parallelism • Locally connected SIMDs • Xetal • Fully connected SIMDs • Imagine • Communication in SIMD processors • RCSIMD • DCSIMD Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Enhance performance: 3 architecture methods • (Super)-pipelining • Powerful instructions • MD-technique • multiple data operands per operation • MO-technique • multiple operations per instruction • Multiple instruction issue Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Characteristics of Media Applications • Poorly matched to conventional architectures • Caches • Instruction-Level Parallelism • Few arithmetic units • Well-matched to modern VLSI technology • Lots (100’s - 1000’s) of ALUs fit on a single chip Communication bandwidth is the scarce resource Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Architecture methodsPowerful Instructions (1) MD-technique • Multiple data operands per operation • SIMD: Single Instruction Multiple Data Vector instruction: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; c = a + 5*b Assembly: set vl,64 ldv v1,0(r2) mulvi v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv v3,0(r3) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
SIMD Execution Method time node1 node2 node-K Instruction 1 Instruction 2 Instruction 3 Instruction n Architecture methodsPowerful Instructions (1) SIMD computing • Exploit data locality of e.g. image processing applications • Effect on code size? • Effect on power consumption? Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
* * * * Architecture methodsPowerful Instructions (1) • Sub-word parallelism • SIMD on restricted scale: • Used for Multi-media instructions • Motivation: use a powerful 64-bit alu as 4 x 16-bit alus • Examples • MMX, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow, Trimedia II • Example: i=1..4|ai-bi| Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Instructions Bus PE0 PE1 PE2 PE319 One wide port Memory LC-SIMD LC-SIMD (Locally connected; e.g. Xetal, Imap) long communication delays: shift operations Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Instructions Bus PE0 PE1 PE2 PE319 Fully Connected Communication Network FC-SIMD FC-SIMD (Fully Connected; Imagine) expensive communication network Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
LC: Xetal Objectives High-degree of system integration CMOS imaging + DSP low cost camera systems Low power consumption mobile & remote sensing Flexibility programmable DSP and control functions Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Global Controller tuned for Xetal Archit. functions loop/iteration control system synchronization exposure-time control white balancing . . . Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Parallel Processing (SIMD) 2 columns /processor neighbour communication low-speed clock (16 MHz) clock gating shared address decoding minimal memory read access LOW-POWER Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Parallel Processing (Contd.) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Xetal Specs & Performance Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Simulation Results(1-input) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Simulation Results(1-output) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Simulation Results (2) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Imagine • Combining DLP (SIMD) and ILP (VLIW) • toplevel SIMD • per PE: VLIW Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Render Encode/Decode Imagine: Representative Applications • Stereo Depth Extraction • Polygon Rendering • MPEG Encoding/Decoding 101100 010110 001001 Encoded 2D Data 2D Video Stream Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Input Data Kernel Stream Output Data Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Stream Processing • Little data reuse (pixels never revisited) • Highly data parallel (output pixels not dependent on other output pixels) • Compute intensive (60 arithmetic ops per memory reference) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Stream Architecture Provides Data Bandwidth Hierarchy SIMD/VLIW Control SDRAM ALU Cluster ALU Cluster ALU Cluster SDRAM ALU Cluster Stream Register File ALU Cluster SDRAM ALU Cluster ALU Cluster SDRAM ALU Cluster Peak BW: 2GB/s 32GB/s 544GB/s Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Application Data: Bandwidth Usage SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 2GB/s 32GB/s 544GB/s Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Stream Register File: Details Arbiter To/From: Arithmetic Clusters, I/O, Interprocessor communication, and Main Memory SRF: Single-ported 128KB SRAM (1024 x 32W) Stream buffers 32W/cycle Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
+ + * * Arithmetic Cluster: Details Intercluster Network • Units support floating-point / 32-bit / dual 16-bit / quad 8-bit instructions • 4-cycle pipelined FMUL,FADD,FSUB,FTOI,ITOF,FFRAC • 17-cycle FDIV (pipelined for 1 FDIV every 7 cycles) Local Register File + / CU To SRF Cross Point From SRF Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Stream Controller Network The Imagine Stream Processor SDRAM SDRAM SDRAM SDRAM Streaming Memory System Network Host Stream Register File: 32kW SRAM Interface Processor Microcontroller: 2K VLIW Instrs ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Imagine Floorplan • 22 million transistors • 500 MHz • TI GS30KA: • 0.15 mm Ldrawn • 0.13 mm Leff • CMOS process Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
StereoDepthExtraction(…) { // Load Input Images ... // Run Kernels convolve7x7 (RawImage,ConvImage); convolve3x3 (ConvImage,Conv2Image); ... // Store Output } Imagine Programming Environment Convolve7x7(…) { ... while(!In.empty()) { ... p0 = k0 * in10; p12 = k21 * in32; p34 = k43 * in54; p56 = k65 * in76; sum = (p0 + p12) + (p34 + p56); ... } } Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
RC-SIMD • Imagine support full interconnect between PEs • Do we need this expensive interconnect? • Alternative: RC-SIMD Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Basic template of communication architecture Instructions Bus PE0 PE1 PE2 PE3 0 0 0 S0 S1 S2 1 1 1 1 1 1 0 0 0 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
LD 0 LD +1 LD +2 LD +3 * C0 * C1 * C2 * C3 + ST Example • 4-tap filter Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Example PE0 PE1 PE2 PE3 0 0 0 S0 S2 S1 1 1 1 Resource sharing conflict How to solve???? Pipeline (shift 1 cycle) Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
delay delay delay PE0 PE1 PE2 PE3 0 0 0 S0 S1 S2 1 1 1 RC-SIMD: Basic architecture • Schedule with delay-line Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
delay delay delay PE0 PE1 PE2 PE3 0 0 0 S0 S1 S2 1 1 1 Conflict model • Schedule PE0 (using FACTS) 0 0 Ld +2 S0 S1 0 0 -1 1 0 S1 S2 0 Node: resource usage Sequence edge: timing dependency Fact tools Move problem From hardware to software Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
delay delay delay PE0 PE1 PE2 PE3 0 0 0 S0 S1 S2 1 1 1 Basic architecture • Valid schedule Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Drawback PE 0 PE 1 PE 2 PE 3 PE 319 Ins 1 • 319 cycle between PE0 & PE319 • Size of conflict model (compile time) Ins 2 Ins 1 Ins 3 Ins 2 Ins 1 Ins 4 Ins 3 Ins 2 Ins 1 Ins 4 Ins 3 Ins 2 Ins 4 Ins 3 Ins 4 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Update Architecture PE 0 PE 1 PE 2 PE 3 PE 4 PE 5 PE 5 PE 6 Cycle 1 Ins 1 Ins 1 Cycle 2 Ins 2 Ins 1 Ins 2 Ins 1 Cycle 3 Ins 3 Ins 2 Ins 1 Ins 3 Ins 2 Ins 1 Cycle 4 Ins 4 Ins 3 Ins 2 Ins 1 Ins 4 Ins 3 Ins 2 Ins 1 Cycle 5 Ins 4 Ins 3 Ins 2 Ins 4 Ins 3 Ins 2 Cycle 6 Ins 4 Ins 3 Ins 4 Ins 3 Cycle 7 Ins 4 Ins 4 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
delay delay delay delay PE0 PE1 PE2 PE3 PE4 0 0 0 0 S0 S1 S2 S3 1 1 1 1 0 0 0 0 1 1 1 1 Updated RC-SIMD Architecture Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Results of mapping several kernels Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Imap Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Imap Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
Difficult SIMD Applications • Algorithms need Dynamic communication: • lens distortion • bucket processing • Mirroring,… Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
DC-SIMD Architecture PE_1 PE_2 PE_3 PE_4 PE_5 PE_6 PE_7 Bus_0 R1 R4 R7 Bus_1 R2 R5 Bus_2 R3 R6 PE_6 PE_3 PE_4 PE_2 V dst-add data src-add Message format Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
DC-SIMD Architecture PE_1 PE_2 PE_3 PE_4 PE_5 PE_6 PE_7 Bus_0 R1 R4 R7 Bus_1 R2 R5 Bus_2 R3 R6 Larger distance: PE_7 PE_1 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
DC-SIMD Architecture PE_1 PE_2 PE_3 PE_4 PE_5 PE_6 PE_7 Bus_0 R1 R4 R7 Bus_1 R2 R5 Bus_2 R3 R6 Priority PE_7 PE_5 PE_6 PE_2 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman
DC-SIMD: arbitration PEid PE Read data Read: xor V des-add data src-add write: give priority to further PES PEn PEn+1 PEn+2 Next reg. V des-add data src-add n+2 : 2.v n+1 : (2+v).1 n : (1+2+v).0 Select (ab) a=v’.2’ b=a’.v’+a.1’ Buffer instruction: Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman