Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Compiler and Runtime Support for Parallelizing Irregular Reductions on a Multithreaded Architecture Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Motivation: Irregular Reductions • Frequently arise in scientific computations • Widely studied in the context of distributed memory machines, shared memory machines, distributed shared memory machines, uniprocessor cache • Main difficulty: can’t apply traditional compile-time optimizations • Runtime optimizations: trade-off between runtime costs and efficiency of execution

Motivation: Multithreaded Architectures: • Multiprocessors based upon multithreading • Support multiple threads of execution on each processor • Support low-overhead context switching and thread initiation • Low-cost point-to-point communication and synchronization

Problem Addressed • Can we use multiprocessors based upon multithreading for irregular reductions ? • What kind of runtime and compiler support is required ? • What level of performance and scalability is achieved ?

Outline • Irregular Reductions • Execution Strategy • Runtime Support • Compiler Analysis • Experimental Results • Related Work • Summary

Irregular Reductions: Example for (tstep = 0; tstep < num_steps; tstep++) { for (i = 0; i < num_edges; i++) { node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } }

Irregular Reductions • Irregular Reduction Loops • Elements of LHS arrays may be incremented in multiple iterations, but only using commutative & associative operators • No loop-carried dependences other than those on elements of the reduction arrays • One or more arrays are accessed using indirection arrays • Codes from many scientific & engineering disciplines contain them (simulations involving irreg. meshes, molecular dynamics, sparse codes)

Execution Strategy Overview • Partition edges (interactions) among processors • Challenge: updating reduction arrays • Divide reduction arrays into NUM_PROCS portions – revolving ownership • Execute NUM_PROCS phases on each processor

Execution Strategy • To exploit multithreading, use (k*NUM_PROCS) phases and reduction portions P1 P2 P3 P0 Reduction Portion # 0 1 Phase 0 2 Phase 2 3 4 5 6 7

Execution Strategy (Example) for (phase = 0; phase < k * NUM_PROCS; phase++) { Receive (reduc1_array_portion) from processor PROC_ID + 1; // main calculation loop for(i = loop1_pt[phase];i < loop1_pt[phase + 1];i++{ node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } : : : Send (reduc1_array_portion) to processor PROC_ID - 1; }

Execution Strategy • Make communication independent of data distribution and values of indirection arrays • Exploit MTA’s ability to overlap communication & computation • Challenge: partition iterations into phases (each iteration updates 2 or more reduction array elements)

Execution Strategy (Updating Reduction Arrays) // main calculation loop for (i = loop1_pt[phase]; i < loop1_pt[phase + 1]; i++) { node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } // update from buffer loop for (i = loop2_pt[phase]; i < loop2_pt[phase + 1]; i++ { local_node = lbuffer_out[i]; buffered_node = rbuffer_out[i]; reduc1[local_node] += reduc1[buffered_node]; }

Runtime Processing • Responsibilities • Divide iterations on each processor into phases • Manage buffer space for reduction arrays • Set up second loop // runtime preprocessing on each processor LightInspector (. . .); for (phase = 0; phase < k * NUM_PROCS; phase++) { Receive (. . .); // main calculation loop // second loop to update from buffer Send (. . .); }

0 2 6 8 4 0 7 1 4 0 4 buffers reduc1 remote area Input: 0 nodeptr1 nodeptr2 Output: Phase # 0 1 2 3 nodeptr1_out nodeptr1_out 0 9 nodeptr2_out 1 0 Phase # 1 2 3 copy1_out 4 copy2_out 9

Compiler Analysis • Identify reduction array sections • updated through an associative, commutative operator • Identify indirection array (IA) sections • Form reference groups of reduction array sections accessed through same IA sections • Each reference group can use same LightInspector • EARTH-C compiler infrastructure

Experimental Results • Three scientific kernels • Euler: 2k and 10k mesh • Moldyn: 2k and 10k dataset • sparse MVM: class W (7k), A (14k), & B (75k) matrices • Distribution of edges (interactions) • block • cyclic • block-cyclic (in thesis) • Three values of k (1, 2, & 4) • EARTH-MANNA (SEMi)

Experimental Results(Euler 10k)

Experimental Results(Moldyn 2k) (Moldyn 10k in thesis)

Experimental Results(MVM Class A)

Summary and Conclusions • Execution strategy: frequency and volume of communication independent of contents of indirection arrays • No mesh partitioning or communication optimizations required • Initially incur overheads (locality), but high relative speedups

Gary M. Zoppetti Gagan Agrawal Rishi Kumar