240 likes | 347 Views
Simplifying Parallel Programming with Compiler Transformations. Matt Frank University of Illinois. What I’m ranting about. Transformations that alleviate tedium Analogous to: code generation, register allocation, and instr. Sched (Not really “optimizations”) Mainly:
E N D
Simplifying Parallel Programming with Compiler Transformations Matt Frank University of Illinois mif@illinois.edu
What I’m ranting about • Transformations that alleviate tedium • Analogous to: • code generation, register allocation, and instr. Sched • (Not really “optimizations”) • Mainly: • Loop distribution, reassociation, “scalar” expansion, inspector-executor, hashing. • Cover much more than you might think • || language expressivity mif@illinois.edu
Assumptions • Cache-coherent shared-memory many-cores • (I’m not addressing distributed memory issues) • Synchronization somewhat expensive • Don’t use barriers gratuitously (but don’t avoid at all costs) • Analysis is not my problem • Programmer annotates • Non-determinism is outside realm of this talk • No race detection in this talk either mif@illinois.edu
Compiler Flow Front-end type systems and whole-program analysis New information: Type systems (e.g. DPJ) Domain-specific objects run-time feedback Dependence Graph (PDG) based compiler Program analysis (info about high level program invariants) for more efficient coherence, checkpointing, q.o.s. Feedback Runtime/Execution platform New capabilities: checkpointing, q.o.s. guarantees. mif@illinois.edu
I’m leaving out locality Front-end type systems and whole-program analysis || exposing transformations Tiling, etc. Runtime/Execution platform mif@illinois.edu
What’s enabled? • Loops that contain arbitrary control flow • Including early exits, arbitrary function calls, etc. • Arbitrary iterators (even sequential ones) • Can’t depend on main body of computation though • Arbitrary combinations of data parallel work, scans and reductions • Can use “partial sums” inside loop • Buffered printf mif@illinois.edu
The transformations • Scalar expansion • Eliminates anti, output deps • Can be applied to properly scoped aggregates • Reassociation • Integer reassociation extraordinarily useful • Can use partial sums later in loop! • Loop distribution • Think of it as scheduling • Inspector-executor • As long as the data access pattern is invariant in the loop mif@illinois.edu
You’ve heard of map-reduce doall i(1..n) private j = f(X[i]) total = total + j shared j[n] doall i(1..n) j[i] = f(X[i]) do i(1..n) total = total + j[i] mif@illinois.edu
How ‘bout scan-map? struct { data; *next; } *p; doall p != NULL modify(p->data) p = p->next n=0 do a[n++] = p p = p->next doall i(0..n) modify(a[i]->data) p = p->next mif@illinois.edu
Sparse matrix construction data rows scan int ptr = 0 shared data[float] shared rows[int] doall row (1..n) private j rows[row] = ptr for j in non_zeros(row) data[ptr] = foo(row, j) ptr++ row ptr mif@illinois.edu
Partial Sum Expansion scan int ptr = 0 shared float data[m] shared int rows[n] doall row (1..n) private j rows[row] = ptr for j in non_zeros(row) data[ptr] = foo(row, j) ptr++ scan int ptr[n]# scalar expand ptr shared data[float] shared int rows[n] doall row (1..n) private j ptr[row] = 0 rows[row] = rows[row-1] + ptr[row-1] for j in non_zeros(row) data[rows[row] + ptr[row]] = foo(row, j) ptr[row]++ expand partial sum mif@illinois.edu
Scalar Expansion scan int ptr[n] shared data[float] shared int rows[n] doall row (1..n) private j ptr[row] = 0 rows[row] = rows[row-1] + ptr[row-1] for j in non_zeros(row) data[rows[row] + ptr[row]] = foo(row, j) ptr[row]++ scan int ptr[n] shared data[float] shared int rows[n] doall row (1..n) private j private vector mydata ptr[row] = 0 rows[row] = rows[row-1] + ptr[row-1] for j in non_zeros(row) mydata.pushback(foo(row, j)) ptr[row]++ for j (rows[row], rows[row]+ptr[row]) data[j] = mydata.popfront() and inner loop fission mif@illinois.edu
Outer Loop Fission scan int ptr[n] shared data[float] shared int rows[n] doall row (1..n) private j private vector mydata ptr[row] = 0 rows[row] = rows[row-1]+ ptr[row-1] for j in non_zeros(row) mydata.pushback(foo(row, j)) ptr[row]++ for j (rows[row], rows[row]+ptr[row]) data[j] = mydata.popfront() scan int ptr[n] shared data[float] shared int rows[n] doall row (1..n) private j private vector mydata ptr[row] = 0 for j in non_zeros(row) mydata.pushback(foo(row, j)) ptr[row]++ do row (1..n) rows[row] = rows[row-1] + ptr[row-1] doall row (1..n) for j (rows[row], rows[row]+ptr[row]) data[j] = mydata.popfront() mif@illinois.edu
Concatenation data data rows rows row ptr parallel sequential mif@illinois.edu
printf() is same pattern stdout buffer doall i (1..n) private mystring = s(i) printf(mystring) private mystrings mif@illinois.edu
Sparse array updates doall i(1..n) private j for j in neighbors_of(i) private temp = foo(i, j) x[i]+= temp x[j]+= temp mif@illinois.edu
Becomes doall i(1..n) private j for j in neighbors_of(i) private temp = foo(i, j) continue[hash(i)][myproc].push(i,temp) continue[hash(j)][myproc].push(j,temp) doall p(1..P) for t (1..P) private (ptr,val) = continue[p][t] x[ptr] += val 1 2 3 4 1 2 3 4 the continuation matrix -> mif@illinois.edu
Graph updates doall i(1..n) newvalue = value[i] for pred in predecessors[i] newvalue = f(newvalue, value[pred]) value[i] = newvalue mif@illinois.edu
Inspector Executor Polychronopolous ’88 Saltz ’91 Leung/Zahorjan, ‘93 int wavefront[n] = {0} do i(1..n) wavefront[i] = max(wavefronts[i’s predecessors]) do w(1..maxdepth) doall i suchthat wf[i] = w newvalue = value[i] for pred in predecessors[i] newvalue = f(newvalue, pred[i]) value[i] = newvalue mif@illinois.edu
Limits of what we know doall node in worklist modify graph structure mif@illinois.edu
What I’ve shown you • Scalar expansion • Eliminates anti, output deps • Can be applied to properly scoped aggregates • Reassociation • Integer reassociation extraordinarily useful • Can use partial sums later in loop! • Loop distribution • Think of it as scheduling • Inspector-executor • As long as the data access pattern is invariant in the loop mif@illinois.edu
Where next? • Relieve Tedium • (build the compiler, or frameworks, or …) • Find new patterns • Delauney triangulation • Pick an example application: there will be something new you wish could be transformed automatically • Parallel languages beyond “doall” and “reduce” mif@illinois.edu