340 likes | 431 Views
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories. Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J Ramanujam 2 Atanas Rountev 1 P Sadayappan 1
E N D
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran1 Uday Bondhugula1 Sriram Krishnamoorthy 1 J Ramanujam2 Atanas Rountev1 P Sadayappan1 1Department of Computer Science & Engineering The Ohio State University 2Department of Electrical and Computer Engineering Louisiana State University
Talk Outline Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Emergence of Multi-core Architectures • Single-processor performance • Improved by ~50%/yr for almost two decades • Clock speed, ILP, … • Clock speed increased over 100x • Limits to single-processor performance growth • Increase in power density • Flattening of clock speed due to power limitation • Transistor density continues to rise unabated • Multiple cores are now the best option for sustained performance growth Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Scratchpad Memories (1/2) • Need to optimize memory bandwidth and latency in multi-core architectures • Traditional solution: introducing a cache hierarchy • Drawback • Caches are hardware-managed - difficult to model miss behavior and to predict program execution times • Solution in many modern architectures: fast on-chip explicitly managed memory - scratchpad memory (local memory store) Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Scratchpad Memories (2/2) • Scratchpads • Software-managed • Control over data movement • Easier to model performance • Burden on programmer/compiler to manage and utilize • Lower power per chip area required compared to cache • Some modern architectures having scratchpad memories • GPU • Cell • MPSoC Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Talk Outline Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Challenges • Effective management of on-chip scratchpads in multi-core architectures • Utilize limited capacity of scratchpad • Optimize data movement • Effective computation mapping in many-core architectures with multiple levels of parallelism • Exploit available parallelism • Account for scratchpad capacity constraints Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Talk Outline Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Data Management Issues • Orchestration of data movement between off-chip global and on-chip scratchpad memory • Decisions on • What data elements to move in and out of scratchpad • When to move data • How to move data • How to access the data elements copied to scratchpad Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Overview of Automatic Data Management Approach (1/2) • Allocation of storage space (as arrays) in the scratchpad memory for local copies • Determination of access functions of arrays in scratchpad memories • Generation of code for moving data between scratchpad (local) and off-chip (global) memories Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Overview of Automatic Data Management Approach (2/2) • Targeted at affine programs • Dense arrays • Loop bounds – affine functions of outer loop variables, constants and program parameters • Array access functions - affine functions of surrounding loop variables, constants and program parameters • Developed using polyhedral model • an algebraic framework for representing affine programs – statement domains, dependences, array access functions – and affine program transformations Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Polyhedral Model . . 0 1 1 0 1 0 0 1 + + ₣1a (xS1)= 1 0 -1 i j 1 . -1 0 4 IS1 = 0 1 -2 0 -1 4 ≥ 0 i j i j 0 0 i j 0 0 . 0 -1 1 0 0 1 + ₣3a (xS1)= ₣2a (xS1)= for (i=1; i<=4; i++) for (j=2; j<=4; j++) S1: a[i][j] = a[j][i] + a[i][j-1]; j≥2 j≤4 i (m,m) i≤4 i j xS1= i≥1 (0,0) j DS1a = ₣1a IS1 Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Automatic Data Allocation • Given a program block, identify the storage space needed for each non-overlapping accessed region of all arrays • Access functions of array references may be non-uniformly generated • For architectures (e.g. nVIDIAGeForce GPU) supporting direct data access from off-chip memory • Estimate extent of reuse of data to determine whether or not to copy to scratchpad Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Algorithm and Illustration Array A 28 for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) { A[i ][ j+1] = A[i+j ][ j+1] *3; for (k=11;k<=20;k++) B[i ][ j+k] = A[i ][k] + B[i+j ][k]; } } 20 14 10 11 20 Local Array LA1: lb ( i ) = 20; ub( i ) = 28 lb ( j ) = 11; ub( j ) = 15 • Find the set of all data spaces accessed by all references to an array • Access function of the reference • Iteration space of the statement that holds the reference • Partition the set of all data spaces into maximal disjoint non-overlapping subset of data spaces Local Array LA0: lb ( i ) = 10; ub( i ) = 14 lb ( j ) = 11; ub( j ) = 20 • Find the bounding box of each partition of data spaces • Local memory array for each bounding box Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Accessing Arrays in Scratchpad for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) { LA0[i-10][j+1-11] = LA1[i+j-20][j+1-11]*3; for (k=11;k<=20;k++) LB0[i-10][j+k-21] = LA0[i-10][k-11] + LB1[i+j-20][k-11]; } } for ( i=10;i<=14;i++) { for ( j=10;j<=14;j++) { A[i ][ j+1] = A[i+j ][ j+1]*3; for (k=11;k<=20;k++) B[i ][ j+k] = A[i ][k] + B[i+j ][k]; } } • Array dimension in scratchpad may be lower than original array dimension, depending on accessed data • Access function in local memory array • Original access function or reduced access function with offsets – lower bounds (in each dimension) of scratchpad array Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Data Movement Code Generation /* Data Move in code */ for (i=10;i<=14;i++) { for (j=11;j<=20;j++) LA0[i-10][j-11] = A[i][j] ; } for (i=20;i<=28;i++) { for (j=max(i-13,11);j<=min(15,i-9); j++) LA1[i-20][j-11] = A[i][j] ; } /* Data Move out code */ for (i=10;i<=14;i++) { for (j=11;j<=15;j++) A[i][j] = LA0[i-10][j-11]; } • Generation of loop structure • Scanning of polytopes (using CLooG - a tool for code generation) corresponding to data spaces of • read references – for moving data into scratchpad • write references – for moving data out of scratchpad • Generation of loop body (data movement statement) • Copy from a location in scratchpad buffer to off-chip memory location or vice versa Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Talk Outline Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
GPU architecture . . . Scratchpad Scratchpad Scratchpad Off-chip memory • Architectural components • Slow off-chip (global) memory • Two levels of parallelism • Set of multiprocessors • Set of processor cores in each multiprocessor • Scratchpad on each multiprocessor, shared by its processor cores Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Multi-level Tiling Approach • Tiling transformation framework recently developed at OSU by Bondhugula (CC-08, PLDI-08) • Finds tiling transformations or hyperplanes • for sequences of imperfectly nested loops • enables communication minimal parallelization and locality optimization • Identifies loops to tile for parallelism and data locality • Multiple levels of tiling • for exploiting parallelism across multiple parallel levels • Additional tiling (sequential) at each level with scratchpad memory • If data required by tile executing at the level exceeds memory • Data movement at the start and end of each sequential tile • Synchronization points to ensure consistency Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Example // Tiling to distribute at the outer level FORALL iT = 1, Ni, Ti FORALL jT = 1, Nj, Tj // Tiling to satisfy scratchpad memory limit FOR i' = iT, min(iT+Ti-1,Ni), ti' FOR j' = jT, min(jT+Tj-1,Nj), tj' FOR k' = 1, WS, tk' FOR l'= 1, WS, tl' FORALL i = 1, Ni FORALL j = 1, Nj FOR k = 1, WS FOR l = 1, WS S1 END FOR END FOR END FORALL END FORALL <Data move in Code> // Tiling to distribute at the inner level FORALL it = i', min(i'+ti'-1,Ni), ti FORALL jt = j', min(j'+tj'-1,Nj), tj FOR i = it, min(it+ti-1,Ni) FOR j = jt, min(jt+tj-1,Nj) FOR k = k', min(k'+tk'-1,WS) FOR l = l', min(l'+tl'-1,WS) S1 END FOR END FOR END FOR END FOR END FORALL END FORALL <Data move out Code> END FOR END FOR END FOR END FOR END FORALL END FORALL Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Tile Size Determination • Handling scratchpad memory constraints • Cost model for data movement C = Nx (S+ (Vx L)/P) N– Number of data movements S– Sync cost per data movement V– Number of elements per data movement (based on tile sizes) L – Cost to transfer one element P – Number of processes involved in data movement • Tile size search formulation • Constraint: memory requirement within limit • Objective function: minimize data movement cost, C Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Illustration of tile size search formulation • Loop nest of m loops with tile sizes t1, t2,.., tm • nl local arrays • Mj – Memory (as a function of tile sizes) for local array j • V inj and Voutj – Volume (as a function of tile sizes) moved in to and out of local array memory j, respectively • rj – position in the loop nest where the data movement code of array j is placed • Mup - total scratchpad memory Variables: t1, t2,.., tm Memory Constraint: Objective function: Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Talk Outline Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Motion Estimation Kernel (1/2) Machine Information: NVIDIA GeForce 8800 GTX 16 x 8 cores @ 1.35 GHz 768 MB off-chip memory 16 x 16 KB scratchpad Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
1D Jacobi Kernel (1/2) Machine Information: NVIDIA GeForce 8800 GTX 16 x 8 cores @ 1.35 GHz 768 MB off-chip memory 16 x 16 KB scratchpad Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Motion Estimation Kernel (2/2) Machine Information: NVIDIA GeForce 8800 GTX 16 x 8 cores @ 1.35 GHz 768 MB off-chip memory 16 x 16 KB scratchpad Tile size from model Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
1D Jacobi Kernel (2/2) Machine Information: NVIDIA GeForce 8800 GTX 16 x 8 cores @ 1.35 GHz 768 MB off-chip memory 16 x 16 KB scratchpad Tile size from model Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Talk Outline Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Related Work • Scratchpad memory management • Data reuse - Issenin et al. [DAC06] • Allocation for uniformly generated references • Schreiber and Cronquist [HPLTR04] • Anantharaman and Pande [RTSS98] • Kandemir et al. [CAD04] • Improving performance on cached architectures • Ferrante et al. [LCPC92] • Gallivan et al. [ICS88] • Multi-level tiling • Fatahalian et al. [SC06]– various levels of memory • Bikshandi et al. [PPOPP06] and Renganarayanan et al. [SC07, IPDPS07] – parallelism and locality Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Talk Outline Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Summary • Addressed two issues in compiling for modern multi-level parallel architectures with scratchpads • Data management in scratchpad memory • Data allocation • Access in scratchpad • Code generation for data movement • Mapping of computation in regular programs on to multiple levels of parallel units • Experimental evaluation using nVIDIA GPU Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Talk Outline Introduction Challenges Automatic Data Management Multi-level Tiling Experiments Related Work Summary Ongoing and Future Work Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Ongoing and Future Work Developing an end-to-end compiler framework for modern many-core architectures like GPUs Algorithms developed in this work – an integral part of the overall compiler framework Further optimize transformations like tiling, for modern architectures like GPUs, using model-driven empirical search Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008
Thank you Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories, PPoPP 2008