1 / 24

NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects

NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects. Laxmikant Kale First Annual Review 27-28 October 1999, Iowa City. Outline. Quench and solidification codes Coarse grain parallelization of the quench code Adaptive parallelization techniques Dynamic variations

meg
Download Presentation

NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NSF/DARPA OPAALAdaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review 27-28 October 1999, Iowa City

  2. Outline • Quench and solidification codes • Coarse grain parallelization of the quench code • Adaptive parallelization techniques • Dynamic variations • Adaptive load balancing • Finite element framework with adaptivity • Preliminary results

  3. Coarse grain parallelization • Structure of current sequential quench code: • 2-D array of elements (each independently refined) • Within row dependence • Independent rows, but… • share global variables • Parallelization using Charm++: • 3 hours effort (after a false start) • about 20 lines of change to F90 code • A 100 line Charm++ wrapper • Observations: • Global variables that are defined and used within inner loop iterations are easily dealt with in Charm++ , in contrast to OpenMP • Dynamic load balancing is possible, but was unnecessary

  4. Performance results Contributors: Engineering: N. Sobh, R. Haber Computer Science: M. Bhandarkar, R. Liu, L. Kale

  5. OpenMP experience • Work by: • J. Hoeflinger, D. Padua, with N. Sobh, R. Haber, J. Dantzig, N. Provatas • Solidification code: • Parallelized using openMp • Relatively straightforward, after a key decision • Parallelize by rows only

  6. OpenMP experience • Quench code on Origin2000 • Privatization of variables is needed • as outer loop was parallelized • Unexpected initial difficulties with OpenMP • Led initially to large slowdown in parallelized code • Traced to unnecessary locking in MATMUL intrinsic

  7. Adaptive Strategies • Advanced codes model dynamic and irregular behavior • Solidification: adaptive grid refinement • Quench: • Complex dependencies, • Parallelization within elements • To parallelize these effectively, • adaptive runtime strategies are necessary

  8. Multi-partition decomposition: • Idea: decompose the problem into a number of partitions, • independent of the number of processors • # Partitions > # Processors • The system maps partitions to processors • The system should be able to map and re-map objects as needed

  9. Charm++ • A parallel C++ library • Supports data driven objects • singleton objects, object arrays, groups, • Many objects per processor, with method execution scheduled with availability of data • System supports automatic instrumentation and object migration • Works with other paradigms: MPI, openMP, ..

  10. Data driven executionin Charm++ Scheduler Scheduler Message Q Message Q

  11. Load Balancing Framework • Aimed at handling ... • Continuous (slow) load variation • Abrupt load variation (refinement) • Workstation clusters in multi-user mode • Measurement based • Exploits temporal persistence of computation and communication structures • Very accurate (compared with estimation) • instrumentation possible via Charm++/Converse

  12. Object balancing framework

  13. Utility of the framework: workstation clusters • Cluster of 8 machines, • One machine gets another job • Parallel job slows down on all machines • Using the framework: • Detection mechanism • Migrate objects away from overloaded processor • Restored almost original throughput!

  14. Performance on timeshared clusters Another user logged on at about 28 seconds into a parallel run on 8 workstations. Throughput dipped from 10 steps per second to 7. The load balancer intervened at 35 seconds,and restored throughput to almost its initial value.

  15. Utility of the framework: Intrinsic load imbalance • To test the abilities of the framework • A simple problem: Gauss-Jacobi iterations • Refine selected sub-domains • ConSpector: web based tool • Submit parallel jobs • Monitor performance and application behavior • Interact with running jobs via GUI interfaces

  16. AppSpector view of Load balancer on the synthetic Jacobi relaxation benchmark. Imbalance is introduced by interactively refining a subset of cells around 9 seconds.. The resultant load imbalance brings the utilization down to 80% from the peak of 96%. The load balancer kicks in around t = 16, and restores utilization to around 94%.

  17. Using the Load Balancing Framework Automatic Conversion from MPI Cross module interpolation Structured FEM MPI-on-Charm Irecv+ Frameworkpath Load database + balancer Migration path Charm++ Converse

  18. Example application: • Crack propagation • (P. Geubelle et al) • Similar in structure to Quench components • 1900 lines of F90 • Rewritten using FEM framework in C++ • 1200 lines of C++ code • Framework: 500 lines of code, • reused by all applications • Parallelization completely by the framework

  19. Crack Propagation Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle

  20. “Overhead” of multi-partition method

  21. Overhead study on 8 processors When running on 8 processors, the effect of using multiple partitions per processor is also beneficial, due to cache behavior.

  22. Cross-approach comparison MPI-F90 original Charm++ framework(all C++) F90 + charm++ library

  23. Load balancer in action

  24. Summary and Planned Research • Use the adaptive FEM framework • To parallelize Quench code further • Quad tree based solidification code: • First phase: parallelize each phase separately • Parallelize across refinement phases • Refine the FEM framework • Use feedback from applications • Support for implicit solvers and multigrid

More Related