1 / 20

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Compiler and Runtime Support for Parallelizing Irregular Reductions on a Multithreaded Architecture. Gary M. Zoppetti Gagan Agrawal Rishi Kumar. Motivation: Irregular Reductions. Frequently arise in scientific computations

arnaud
Download Presentation

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiler and Runtime Support for Parallelizing Irregular Reductions on a Multithreaded Architecture Gary M. Zoppetti Gagan Agrawal Rishi Kumar

  2. Motivation: Irregular Reductions • Frequently arise in scientific computations • Widely studied in the context of distributed memory machines, shared memory machines, distributed shared memory machines, uniprocessor cache • Main difficulty: can’t apply traditional compile-time optimizations • Runtime optimizations: trade-off between runtime costs and efficiency of execution

  3. Motivation: Multithreaded Architectures: • Multiprocessors based upon multithreading • Support multiple threads of execution on each processor • Support low-overhead context switching and thread initiation • Low-cost point-to-point communication and synchronization

  4. Problem Addressed • Can we use multiprocessors based upon multithreading for irregular reductions ? • What kind of runtime and compiler support is required ? • What level of performance and scalability is achieved ?

  5. Outline • Irregular Reductions • Execution Strategy • Runtime Support • Compiler Analysis • Experimental Results • Related Work • Summary

  6. Irregular Reductions: Example for (tstep = 0; tstep < num_steps; tstep++) { for (i = 0; i < num_edges; i++) { node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } }

  7. Irregular Reductions • Irregular Reduction Loops • Elements of LHS arrays may be incremented in multiple iterations, but only using commutative & associative operators • No loop-carried dependences other than those on elements of the reduction arrays • One or more arrays are accessed using indirection arrays • Codes from many scientific & engineering disciplines contain them (simulations involving irreg. meshes, molecular dynamics, sparse codes)

  8. Execution Strategy Overview • Partition edges (interactions) among processors • Challenge: updating reduction arrays • Divide reduction arrays into NUM_PROCS portions – revolving ownership • Execute NUM_PROCS phases on each processor

  9. Execution Strategy • To exploit multithreading, use (k*NUM_PROCS) phases and reduction portions P1 P2 P3 P0 Reduction Portion # 0 1 Phase 0 2 Phase 2 3 4 5 6 7

  10. Execution Strategy (Example) for (phase = 0; phase < k * NUM_PROCS; phase++) { Receive (reduc1_array_portion) from processor PROC_ID + 1; // main calculation loop for(i = loop1_pt[phase];i < loop1_pt[phase + 1];i++{ node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } : : : Send (reduc1_array_portion) to processor PROC_ID - 1; }

  11. Execution Strategy • Make communication independent of data distribution and values of indirection arrays • Exploit MTA’s ability to overlap communication & computation • Challenge: partition iterations into phases (each iteration updates 2 or more reduction array elements)

  12. Execution Strategy (Updating Reduction Arrays) // main calculation loop for (i = loop1_pt[phase]; i < loop1_pt[phase + 1]; i++) { node1 = nodeptr1[i]; node2 = nodeptr2[i]; force = h(node1, node2); reduc1[node1] += force; reduc1[node2] += -force; } // update from buffer loop for (i = loop2_pt[phase]; i < loop2_pt[phase + 1]; i++ { local_node = lbuffer_out[i]; buffered_node = rbuffer_out[i]; reduc1[local_node] += reduc1[buffered_node]; }

  13. Runtime Processing • Responsibilities • Divide iterations on each processor into phases • Manage buffer space for reduction arrays • Set up second loop // runtime preprocessing on each processor LightInspector (. . .); for (phase = 0; phase < k * NUM_PROCS; phase++) { Receive (. . .); // main calculation loop // second loop to update from buffer Send (. . .); }

  14. 0 2 6 8 4 0 7 1 4 0 4 buffers reduc1 remote area Input: 0 nodeptr1 nodeptr2 Output: Phase # 0 1 2 3 nodeptr1_out nodeptr1_out 0 9 nodeptr2_out 1 0 Phase # 1 2 3 copy1_out 4 copy2_out 9

  15. Compiler Analysis • Identify reduction array sections • updated through an associative, commutative operator • Identify indirection array (IA) sections • Form reference groups of reduction array sections accessed through same IA sections • Each reference group can use same LightInspector • EARTH-C compiler infrastructure

  16. Experimental Results • Three scientific kernels • Euler: 2k and 10k mesh • Moldyn: 2k and 10k dataset • sparse MVM: class W (7k), A (14k), & B (75k) matrices • Distribution of edges (interactions) • block • cyclic • block-cyclic (in thesis) • Three values of k (1, 2, & 4) • EARTH-MANNA (SEMi)

  17. Experimental Results(Euler 10k)

  18. Experimental Results(Moldyn 2k) (Moldyn 10k in thesis)

  19. Experimental Results(MVM Class A)

  20. Summary and Conclusions • Execution strategy: frequency and volume of communication independent of contents of indirection arrays • No mesh partitioning or communication optimizations required • Initially incur overheads (locality), but high relative speedups

More Related