Effective Automatic Parallelization of Stencil Computations *

Effective Automatic Parallelization of Stencil Computations* Sriram Krishnamoorthy1 Muthu Baskaran1, Uday Bondhugula1, Atanas Rountev1, J. Ramanujam2, P. Sadayappan1 1The Ohio State University 2Lousiana State University * Work supported by NSF

Introduction • Stencil computations • Sweep through large data set • Multiple time iterations • Simple load balanced schedule • Tiling – essential to improve data locality • Dependences between tiles • Pipelined execution • Skewed iteration spaces – load imbalance • Solution: Adjust tiling – re-enable concurrent execution

Motivation FOR t = 0 TO T-1 FOR i = 1 TO N-1 A[t,i]=(A[t,i-1]+A[t,i]+A[t,i+1])/3 t i

Notation • Iteration space B: n-dim polyhedron • Dependences D: n-dim vectors • Hyperplanes H: • n-dim normal vectors • Tile bounded by pairs of hyperplanes

Approach • Concurrent start in non-tiled iteration space • Identify hyperplanes inhibiting concurrent start in tiled space • Replace one face for each inhibiting pair • Overlapped Tiling – Replace “back-face” • Split Tiling – Replace “front-face”

Concurrent Start: Before Tiling Condition: A boundary that does not carry any dependence

Inter-tile Dependences • Shift vectors • Tile traversal order • Normal to all other hyperplanes • Hyperplane carries dependence • A dependence “pokes” through • Inter-tile dependence vector • Shift vector • Corresponding hyperplane carries dependence

Concurrent Start Inhibition • Concurrent start in original iteration space along a boundary • But that boundary carries an inter-tile dependence A boundary has concurrent start S_j is an inter-tile dependence That boundary carries Inter-tile dependence

Companion Hyperplane • Hyperplane that destroys the inter-tile dependence • Swivel a hyperplane “backward” • Dependences carried by original hyperplane are “neutralized” • Incoming dependences become non-incoming • Outgoing dependences become non-outgoing

Overlapped Tiling • Replace “back face” with companion hyperplane • Additional region is shared with preceding tile • Region of preceding tile that caused the dependence • Each new tile independent of preceding tile (“do-all” parallelism) • Increased computation cost; communication volume

Split Tiling • Replace “front face” with companion hyperplane • Tile split into independent and dependent regions • Execute independent region followed by dependent region • Increased #communications

Experimental Evaluation • Cluster • 2.8 GHz dual-processor Opteron 254 • 1MB L2 cache; 4GB RAM • Linux 2.6.9; Intel compiler (icc) –O3 • Comparison • Two pipelined schedules – along space and time • 1000 time steps • 1 – 32 processors

Pipelined Execution: Parameters 64000 elements; 32 processors Space tile size : 1000 Time tile size : 16

Performance with Problem Size

Weak Scaling • Problem size = #procs * 20000 • Horizontal line – Linear Scaling

Conclusion • Time tiling stencils – crucial for data locality • Might inhibit concurrent execution • Presented: Two approaches to enabling concurrent execution • Ongoing work: Modeling relative benefits of the two approaches

Thank You!

Effective Automatic Parallelization of Stencil Computations *

Effective Automatic Parallelization of Stencil Computations *

Presentation Transcript

The Cloud Resolving Storm Simulator: Large-scale Parallel Computations

M249 Automatic Rifle Operators Course

Reflex Physiology

6.1

Automatic learning of morphology

Automatic Indexing

AUTOMATIC CLASSIFICATION OF MEDICAL REPORT

The Cloud Resolving Storm Simulator: Large-scale Parallel Computations

Automatic Translation of Human Languages

OBJECTIVES

DoD Automatic Test Systems Past, Present, Future

Stencil Computations on CPUs

Automatic visitor based room light controller

Towards Interactive and Automatic Refinement of Translation Rules

Automatic Performance Tuning of Sparse Matrix Kernels: Recent Progress

HPCI Centre Presentation

Automatic learning of morphology

The Principle of Automatic Control 自动控制原理

FUNNEL: Automatic Mining of Spatially Coevolving Epidemics

Scalable and transparent parallelization of multiplayer games

M249 Saw