240 likes | 391 Views
NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects. Laxmikant Kale First Annual Review 27-28 October 1999, Iowa City. Outline. Quench and solidification codes Coarse grain parallelization of the quench code Adaptive parallelization techniques Dynamic variations
E N D
NSF/DARPA OPAALAdaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review 27-28 October 1999, Iowa City
Outline • Quench and solidification codes • Coarse grain parallelization of the quench code • Adaptive parallelization techniques • Dynamic variations • Adaptive load balancing • Finite element framework with adaptivity • Preliminary results
Coarse grain parallelization • Structure of current sequential quench code: • 2-D array of elements (each independently refined) • Within row dependence • Independent rows, but… • share global variables • Parallelization using Charm++: • 3 hours effort (after a false start) • about 20 lines of change to F90 code • A 100 line Charm++ wrapper • Observations: • Global variables that are defined and used within inner loop iterations are easily dealt with in Charm++ , in contrast to OpenMP • Dynamic load balancing is possible, but was unnecessary
Performance results Contributors: Engineering: N. Sobh, R. Haber Computer Science: M. Bhandarkar, R. Liu, L. Kale
OpenMP experience • Work by: • J. Hoeflinger, D. Padua, with N. Sobh, R. Haber, J. Dantzig, N. Provatas • Solidification code: • Parallelized using openMp • Relatively straightforward, after a key decision • Parallelize by rows only
OpenMP experience • Quench code on Origin2000 • Privatization of variables is needed • as outer loop was parallelized • Unexpected initial difficulties with OpenMP • Led initially to large slowdown in parallelized code • Traced to unnecessary locking in MATMUL intrinsic
Adaptive Strategies • Advanced codes model dynamic and irregular behavior • Solidification: adaptive grid refinement • Quench: • Complex dependencies, • Parallelization within elements • To parallelize these effectively, • adaptive runtime strategies are necessary
Multi-partition decomposition: • Idea: decompose the problem into a number of partitions, • independent of the number of processors • # Partitions > # Processors • The system maps partitions to processors • The system should be able to map and re-map objects as needed
Charm++ • A parallel C++ library • Supports data driven objects • singleton objects, object arrays, groups, • Many objects per processor, with method execution scheduled with availability of data • System supports automatic instrumentation and object migration • Works with other paradigms: MPI, openMP, ..
Data driven executionin Charm++ Scheduler Scheduler Message Q Message Q
Load Balancing Framework • Aimed at handling ... • Continuous (slow) load variation • Abrupt load variation (refinement) • Workstation clusters in multi-user mode • Measurement based • Exploits temporal persistence of computation and communication structures • Very accurate (compared with estimation) • instrumentation possible via Charm++/Converse
Utility of the framework: workstation clusters • Cluster of 8 machines, • One machine gets another job • Parallel job slows down on all machines • Using the framework: • Detection mechanism • Migrate objects away from overloaded processor • Restored almost original throughput!
Performance on timeshared clusters Another user logged on at about 28 seconds into a parallel run on 8 workstations. Throughput dipped from 10 steps per second to 7. The load balancer intervened at 35 seconds,and restored throughput to almost its initial value.
Utility of the framework: Intrinsic load imbalance • To test the abilities of the framework • A simple problem: Gauss-Jacobi iterations • Refine selected sub-domains • ConSpector: web based tool • Submit parallel jobs • Monitor performance and application behavior • Interact with running jobs via GUI interfaces
AppSpector view of Load balancer on the synthetic Jacobi relaxation benchmark. Imbalance is introduced by interactively refining a subset of cells around 9 seconds.. The resultant load imbalance brings the utilization down to 80% from the peak of 96%. The load balancer kicks in around t = 16, and restores utilization to around 94%.
Using the Load Balancing Framework Automatic Conversion from MPI Cross module interpolation Structured FEM MPI-on-Charm Irecv+ Frameworkpath Load database + balancer Migration path Charm++ Converse
Example application: • Crack propagation • (P. Geubelle et al) • Similar in structure to Quench components • 1900 lines of F90 • Rewritten using FEM framework in C++ • 1200 lines of C++ code • Framework: 500 lines of code, • reused by all applications • Parallelization completely by the framework
Crack Propagation Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle
Overhead study on 8 processors When running on 8 processors, the effect of using multiple partitions per processor is also beneficial, due to cache behavior.
Cross-approach comparison MPI-F90 original Charm++ framework(all C++) F90 + charm++ library
Summary and Planned Research • Use the adaptive FEM framework • To parallelize Quench code further • Quad tree based solidification code: • First phase: parallelize each phase separately • Parallelize across refinement phases • Refine the FEM framework • Use feedback from applications • Support for implicit solvers and multigrid