140 likes | 156 Views
This overview discusses the problem description, implementation, shared memory and distributed memory approaches, and performance analysis of the Multi-Grid Esteban Pauli algorithm. It focuses on improving the speed of the algorithm by spreading boundary values faster, coarsening and refining the problem, and solving it at different levels. The implementation details for shared memory and distributed memory paradigms are also discussed, along with their respective advantages and challenges. The performance of the algorithm is analyzed based on a 1024x1024 grid, comparing sequential and parallel execution. The conclusion highlights the suitability of Multi-Grid Esteban Pauli for shared memory and MPI paradigms, and the limitations of other parallel paradigms.
E N D
Multi-Grid Esteban Pauli 4/25/06
Overview • Problem Description • Implementation • Shared Memory • Distributed Memory • Other • Performance • Conclusion
Problem Description • Same input, output as Jacobi • Try to speed up algorithm by spreading boundary values faster • Coarsen to small problem, successively solve, refine • Algorithm: • for i in 1 .. levels - 1 • coarsen level i to i + 1 • for i in levels .. 2, -1 • solve level i • refine level i to i – 1 • solve level 1
Problem Description Coarsen Coarsen Solve Refine Solve Refine Solve
Implementation – Key Ideas • Assign a chunk to each processor • Coarsen, refine operations done locally • Solve steps done like Jacobi
Shared Memory Implementations • for i in 1 .. levels - 1 • coarsen level i to i + 1 (in parallel) • barrier • for i in levels .. 2, -1 • solve level i (in parallel) • refine level i to i – 1 (in parallel) • barrier • solve level 1 (in parallel)
Shared Memory Details • Solve is like shared memory Jacobi – have true sharing • /* my_ all locals*/ • for my_i = my_start_i .. my_end_i • for my_j = my_start_j .. my_end_j • current[my_i][my_j][level] = … • Coarsen, Refine access only local – only false sharing possible • for my_i = my_start_i .. my_end_i • for my_j = my_start_j .. my_end_j • current[my_i][my_j][level] = …[level ± 1]
Shared Memory Paradigms • Barrier is all you really need, so should be easy to program in any shared memory paradigm (UPC, OpenMP, HPF, etc) • Being able to control distribution (CAF, GA) should help • If small enough, only have to worry about initial misses • If larger, will push out of cache, have to bring back over network • If have to switch to different syntax to access remote memory, it’s a minus on the “elegance” side, but a plus in that it makes communication explicit
Distributed Memory (MPI) • Almost all work local, only communicate to solve a given level • Algorithm at each PE (looks very sequential): • for i in 1 .. levels - 1 • coarsen level i to i + 1 // local • for i in levels .. 2, -1 • solve level i // see next slide • refine level i to i – 1 // local • solve level 1 // see next slide
MPI Solve function • “Dumb” • send my edges • receive edges • Compute • Smarter • send my edges • compute middle • receive edges • compute boundaries • Can do any other optimizations which can be done in Jacobi
Distributed Memory (Charm++) • Again, do like Jacobi • Flow of control hard to show here • Can send just one message to do all coarsening (like in MPI) • Might get some benefits from overlapping computation and communication by waiting for smaller messages • No benefits from load balancing
Other paradigms • BSP model (local computation, global communication, barrier): good fit • STAPL (parallel STL): not a good fit (could use parallel for_each, but lack of 2D data structure would make this awkward) • Treadmarks, CID, CASHMERe (distributed shared memory): getting a whole page to get just the boundaries might be too expensive, probably not a good fit • Cilk (spawn processes for graph search): not a good fit
Performance • 1024x1024 grid – 256x256 grid, 500 iterations at each level • Sequential time: 42.83 seconds • Left table 4pes • Right table 16 pes
Summary • Almost identical to Jacobi • Very predictable application • Easy load balancing • Good for shared memory, MPI • Charm++: virtualization helps, probably need more data points to see if it can beat MPI • DSM: false sharing might be too high a cost • Parallel paradigms for irregular programs not a good fit