Effective Automatic Parallelization of Stencil Computations *

Effective Automatic Parallelization of Stencil Computations* Sriram Krishnamoorthy1 Muthu Baskaran1, Uday Bondhugula1, Atanas Rountev1, J. Ramanujam2, P. Sadayappan1 1The Ohio State University 2Lousiana State University * Work supported by NSF

Introduction • Stencil computations • Sweep through large data set • Multiple time iterations • Simple load balanced schedule • Tiling – essential to improve data locality • Dependences between tiles • Pipelined execution • Skewed iteration spaces – load imbalance • Solution: Adjust tiling – re-enable concurrent execution

Motivation FOR t = 0 TO T-1 FOR i = 1 TO N-1 A[t,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3 t i

Notation • Iteration space B: n-dim polyhedron • Dependences D: n-dim vectors • Hyperplanes H: • n-dim normal vectors • Tile bounded by pairs of hyperplanes

Approach • Concurrent start in non-tiled iteration space • Identify hyperplanes inhibiting concurrent start in tiled space • Replace one face for each inhibiting pair • Overlapped Tiling – Replace “back-face” • Split Tiling – Replace “front-face”

Concurrent Start: Before Tiling Condition: A boundary that does not carry any dependence

Inter-tile Dependences • Shift vectors • Tile traversal order • Normal to all other hyperplanes • Hyperplane carries dependence • A dependence “pokes” through • Inter-tile dependence vector • Shift vector • Corresponding hyperplane carries dependence

Concurrent Start Inhibition • Concurrent start in original iteration space along a boundary • But that boundary carries an inter-tile dependence A boundary has concurrent start S_j is an inter-tile dependence That boundary carries Inter-tile dependence

Companion Hyperplane • Hyperplane that destroys the inter-tile dependence • Swivel a hyperplane “backward” • Dependences carried by original hyperplane are “neutralized” • Incoming dependences become non-incoming • Outgoing dependences become non-outgoing

Overlapped Tiling • Replace “back face” with companion hyperplane • Additional region is shared with preceding tile • Region of preceding tile that caused the dependence • Each new tile independent of preceding tile (“do-all” parallelism) • Increased computation cost; communication volume

Split Tiling • Replace “front face” with companion hyperplane • Tile split into independent and dependent regions • Execute independent region followed by dependent region • Increased #communications

Experimental Evaluation • Cluster • 2.8 GHz dual-processor Opteron 254 • 1MB L2 cache; 4GB RAM • Linux 2.6.9; Intel compiler (icc) –O3 • Comparison • Two pipelined schedules – along space and time • 1000 time steps • 1 – 32 processors

Pipelined Execution: Parameters 64000 elements; 32 processors Space tile size : 1000 Time tile size : 16

Performance with Problem Size

Weak Scaling • Problem size = #procs * 20000 • Horizontal line – Linear Scaling

Conclusion • Time tiling stencils – crucial for data locality • Might inhibit concurrent execution • Presented: Two approaches to enabling concurrent execution • Ongoing work: Modeling relative benefits of the two approaches

Thank You!

Effective Automatic Parallelization of Stencil Computations *