430 likes | 502 Views
A Data-Driven Approach for Pipelining Sequences of Data-Dependent LOOPs. João M. P. Cardoso. Portugal. ITIV, University of Karlsruhe, July 2, 2007. Motivation. Many applications have sequences tasks E.g., in image and video processing algorithms Contemporary FPGAs
E N D
A Data-DrivenApproachfor Pipelining Sequences of Data-Dependent LOOPs João M. P. Cardoso Portugal ITIV, University of Karlsruhe, July 2, 2007
Motivation • Many applications have sequences tasks • E.g., in image and video processing algorithms • Contemporary FPGAs • Plenty of room to accommodate highly specialized complex architectures • Time to creatively “use available resources” than to simply “save resources”
Motivation • Computing Stages • Sequentially Task A Task B Task C TIME
Motivation • Computing Stages • Concurrently Task A Task B Task C TIME
Outline • Objective • Loop Pipelining • Producer/Consumer Computing Stages • Pipelining Sequences of Loops • Inter-Stage Communication • Experimental Setup and Results • Related Work • Conclusions • Future Work
Objectives • To speed-up applications with multiple and data-dependent stages • each stage seen as a set of nested loops • How? • Pipelining those sequences of data-dependent stages using fine-grain synchronization schemes • Taking advantage of field-custom computing structures (FPGAs)
Attempt to overlap loop iterations Significant speedups are achieved But how to pipeline sequences of loops? Loop Pipelining I3 I4 ... I1 I2 I1 I2 I3 I4 ... time
Computing Stages • Sequentially Producer: ...A[2]A[1]A[0] Consumer: A[0]A[1]A[2]...
A[0] A[1] A[2] A[3] ... Computing Stages • Concurrently • Ordered producer/consumer pairs • Send/receive Producer: ...A[2]A[1]A[0] Consumer: A[0]A[1]A[2]... FIFO with N stages
Computing Stages • Concurrently • Unordered producer/consumer pairs • Empty/Full table Empty/full data Producer: ...A[3]A[5]A[1] Consumer: A[3]A[1]A[5]...
Main Idea • FDCT Intermediate data Data output Data Input 0 1 2 3 4 5 6 7 8 Loop 1 Loop 2 16 Loop 3 24 32 40 48 Global FSM 56 Execution of Loops 1, 2 Execution of Loop 3 Intermediate data array time
Main Idea • FDCT • Out-of-order producer/consumer pairs • How to overlap computing stages? 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 8 8 16 16 24 24 32 32 40 40 48 48 56 56
Main Idea • Pipelined FDCT Data input Intermediate data ( dual-port RAM ) Data output Loop 1 Loop 2 Loop 3 0 1 2 3 4 5 6 7 8 FSM 2 16 24 FSM 1 Dual-port 1-bit table ( empty/full ) 32 40 48 Execution of Loops 1, 2 56 Intermediate data array Execution of Loop 3 time
Main Idea Memory Memory Memory Task A Task B
Possible Scenarios • Single write, single read • Accepted without code changes • Single write, multiple reads • Accepted without code changes (by using an N-bit table) • Multiple writes, single read • Need code transformations • Multiple writes, multiple reads • Need code transformations
Inter-Stage Communication • Responsible to: • Communicate data between pipelined stages • Flagdata availability • Solutions • Perfect associative memory • Cost too high • Memory for data plus 1-bit table (each cell represents full/empty information) • Size of the data set to communicate • Decrease size using hash-based solution Empty/full data
Inter-Stage Communication • Memory plus 1-bit table … boolean tab[SIZE]={0, 0,…, 0}; … for(i=0; i<num_fdcts; i++){ //Loop 1 for(j=0; j<N; j++){ //Loop 2 // loads // computations // stores tmp[48+i_1] = F6 >> 13; tab[48+i_1] = true; tmp[56+i_1] = F7 >> 13; tab[56+i_1] = true; i_1++; } i_1 += 56; } i_1 = 0; for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[i_1]; if(!tab[i_1]) goto L1; L2: f1 = tmp[1+i_1]; if(!tab[1+i_1]) goto L2; // remaining loads // computations … // stores i_1 += 8; }
Inter-Stage Communication • Hash-based solution: … boolean tab[SIZE]={0, 0,…, 0}; … for(i=0; i<num_fdcts; i++){ //Loop 1 for(j=0; j<N; j++){ //Loop 2 // loads // computations // stores tmp[H(48+i_1)] = F6 >> 13; tab[H(48+i_1)] = true; tmp[H(56+i_1)] = F7 >> 13; tab[H(56+i_1)] = true; i_1++; } i_1 += 56; } i_1 = 0; for (i=0; i<N*num_fdcts; i++){ //Loop 3 L1: f0 = tmp[H(i_1)]; if(!tab[H(i_1)]) goto L1; L2: f1 = tmp[H(1+i_1)]; if(!tab[H(1+i_1)]) goto L2; // remaining loads // computations … // stores i_1 += 8; }
0 0 1 1 A[1] A[1] 0 0 0 0 0 0 1 1 A[5] A[5] 0 0 0 0 Inter-Stage Communication • Hash-based solution • We did not want to include additional delays in the load/store operations • Use H(k) = k MOD m • When m is a multiple of 2*N, • H(k) can be implemented by just using the least log2(m) significant bits of K to address the cache (translates to simple interconnections) H H
Inter-Stage Communication • Hash-based solution: H(k) = k MOD m • Single read (L=1) • R = 1 • = 0 • a) write • b) read • c) empty/full update
Inter-Stage Communication • Hash-based solution: H(k) = k MOD m • Multiple reads (L>1) • R = 11...1 (L) • >>= R • a) write • b) read • c) empty/full update
Buffer size calculation • By monitoring behavior • of communication component • For each read and write • determine the size of the buffer needed to avoid collisions • Done during RTL simulation
Experimental Setup • Compilation flow • Uses our previous work on compiling algorithms in a Java subset to FPGAs
Experimental Setup • Simulation back-end Library of Operators (JAVA) datapath.xml fsm.xml rtg.xml fsm.xml datapath.xml XSLTs to dotty to hds to vhdl to java to dotty to vhdl to java datapath.hds fsm.java rtg.java HADES fsm.class rtg.class ANT build file I/O data ( RAMs and Stimulus )
Experimental Results • Benchmarks
Experimental Results • FDCT (speed-up achieved by Pipelining Sequences of Loop)
Experimental Results • What does happen with buffer sizes?
Experimental Results • Adjust latency of tasks in order to balance pipeline stages: • Slowdown tasks with higher latency • Optimization of slower tasks in order to reduce their latency • Slowdown of producer tasks usually reduces the size of the inter-stage buffers
Experimental Results • Buffer sizes +2 cycles per iteration of the producer +1 cycle per iteration of the producer original +Optimizations in the consumer Optimizations in the producer original
Experimental Results • Buffer sizes
Experimental Results • Resources and Frequency (Spartan-3 400)
Producer: Consumer: A[0] A[1] A[2] A[3] ... A[0] A[1] A[2] A[3] ... ... A[1] A[0] A[1] A[0] A[3] A[2] ... A[0] A[1] A[2] A[3] ... A[0] A[3] A[1] A[4] A[2] A[5] ... A[0] A[1] A[2] A[3] A[4] A[5] ... ... A[2] A[3] A[0] A[1] ... A[5] A[6] A[7] A[8] A[9] A[0] A[1] A[2] A[3] A[4] Related Work • Previous approach (Ziegler et al.) • Coarse-grained communication and synchronization scheme • FIFOs are used to communicate data between pipelining stages • Width of FIFO stages dependent on producer/consumer ordering • Less applicable time
Conclusions • We presented a scheme to accelerate applications, pipelining sequences of loops • I.e., Before the end of a stage (set of nested loops) a subsequent stage (set of nested loops) can start executing based on data already produced • Data-driven scheme is used based on empty/full tables • A scheme to reduce the size of the memory buffers for inter-stage pipelining (using a simple hash function) • Depending on the consumer/producer ordering, speedups close to theoretical ones are achieved • as if stages are concurrently and independently executed
address_in data_in (a) (a) (a) L N R H M (b) (b) H T (c) hit/miss data_out address_out Future Work • Research other hash functions • Study slowdown effects • Apply the technique in the context of Multi-Core Systems Processor Core A Processor Core B Memory Memory
Acknowledgments • Work partially funded by • CHIADO - Compilation of High-Level Computationally Intensive Algorithms to Dynamically Reconfigurable COmputing Systems • Portuguese Foundation for Science and Technology (FCT), POSI and FEDER, POSI/CHS/48018/2002 • Based on the work done by Rui Rodrigues • In collaboration with Pedro C. Diniz
A Data-Driven Approach for Pipelining Sequences of Data-Dependent Loops technologyfrom seed