1 / 14

Multi-Grid

This overview discusses the problem description, implementation, shared memory and distributed memory approaches, and performance analysis of the Multi-Grid Esteban Pauli algorithm. It focuses on improving the speed of the algorithm by spreading boundary values faster, coarsening and refining the problem, and solving it at different levels. The implementation details for shared memory and distributed memory paradigms are also discussed, along with their respective advantages and challenges. The performance of the algorithm is analyzed based on a 1024x1024 grid, comparing sequential and parallel execution. The conclusion highlights the suitability of Multi-Grid Esteban Pauli for shared memory and MPI paradigms, and the limitations of other parallel paradigms.

mho
Download Presentation

Multi-Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multi-Grid Esteban Pauli 4/25/06

  2. Overview • Problem Description • Implementation • Shared Memory • Distributed Memory • Other • Performance • Conclusion

  3. Problem Description • Same input, output as Jacobi • Try to speed up algorithm by spreading boundary values faster • Coarsen to small problem, successively solve, refine • Algorithm: • for i in 1 .. levels - 1 • coarsen level i to i + 1 • for i in levels .. 2, -1 • solve level i • refine level i to i – 1 • solve level 1

  4. Problem Description Coarsen Coarsen Solve Refine Solve Refine Solve

  5. Implementation – Key Ideas • Assign a chunk to each processor • Coarsen, refine operations done locally • Solve steps done like Jacobi

  6. Shared Memory Implementations • for i in 1 .. levels - 1 • coarsen level i to i + 1 (in parallel) • barrier • for i in levels .. 2, -1 • solve level i (in parallel) • refine level i to i – 1 (in parallel) • barrier • solve level 1 (in parallel)

  7. Shared Memory Details • Solve is like shared memory Jacobi – have true sharing • /* my_ all locals*/ • for my_i = my_start_i .. my_end_i • for my_j = my_start_j .. my_end_j • current[my_i][my_j][level] = … • Coarsen, Refine access only local – only false sharing possible • for my_i = my_start_i .. my_end_i • for my_j = my_start_j .. my_end_j • current[my_i][my_j][level] = …[level ± 1]

  8. Shared Memory Paradigms • Barrier is all you really need, so should be easy to program in any shared memory paradigm (UPC, OpenMP, HPF, etc) • Being able to control distribution (CAF, GA) should help • If small enough, only have to worry about initial misses • If larger, will push out of cache, have to bring back over network • If have to switch to different syntax to access remote memory, it’s a minus on the “elegance” side, but a plus in that it makes communication explicit

  9. Distributed Memory (MPI) • Almost all work local, only communicate to solve a given level • Algorithm at each PE (looks very sequential): • for i in 1 .. levels - 1 • coarsen level i to i + 1 // local • for i in levels .. 2, -1 • solve level i // see next slide • refine level i to i – 1 // local • solve level 1 // see next slide

  10. MPI Solve function • “Dumb” • send my edges • receive edges • Compute • Smarter • send my edges • compute middle • receive edges • compute boundaries • Can do any other optimizations which can be done in Jacobi

  11. Distributed Memory (Charm++) • Again, do like Jacobi • Flow of control hard to show here • Can send just one message to do all coarsening (like in MPI) • Might get some benefits from overlapping computation and communication by waiting for smaller messages • No benefits from load balancing

  12. Other paradigms • BSP model (local computation, global communication, barrier): good fit • STAPL (parallel STL): not a good fit (could use parallel for_each, but lack of 2D data structure would make this awkward) • Treadmarks, CID, CASHMERe (distributed shared memory): getting a whole page to get just the boundaries might be too expensive, probably not a good fit • Cilk (spawn processes for graph search): not a good fit

  13. Performance • 1024x1024 grid – 256x256 grid, 500 iterations at each level • Sequential time: 42.83 seconds • Left table 4pes • Right table 16 pes

  14. Summary • Almost identical to Jacobi • Very predictable application • Easy load balancing • Good for shared memory, MPI • Charm++: virtualization helps, probably need more data points to see if it can beat MPI • DSM: false sharing might be too high a cost • Parallel paradigms for irregular programs not a good fit

More Related