NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects

NSF/DARPA OPAALAdaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review 27-28 October 1999, Iowa City

Outline • Quench and solidification codes • Coarse grain parallelization of the quench code • Adaptive parallelization techniques • Dynamic variations • Adaptive load balancing • Finite element framework with adaptivity • Preliminary results

Coarse grain parallelization • Structure of current sequential quench code: • 2-D array of elements (each independently refined) • Within row dependence • Independent rows, but… • share global variables • Parallelization using Charm++: • 3 hours effort (after a false start) • about 20 lines of change to F90 code • A 100 line Charm++ wrapper • Observations: • Global variables that are defined and used within inner loop iterations are easily dealt with in Charm++ , in contrast to OpenMP • Dynamic load balancing is possible, but was unnecessary

Performance results Contributors: Engineering: N. Sobh, R. Haber Computer Science: M. Bhandarkar, R. Liu, L. Kale

OpenMP experience • Work by: • J. Hoeflinger, D. Padua, with N. Sobh, R. Haber, J. Dantzig, N. Provatas • Solidification code: • Parallelized using openMp • Relatively straightforward, after a key decision • Parallelize by rows only

OpenMP experience • Quench code on Origin2000 • Privatization of variables is needed • as outer loop was parallelized • Unexpected initial difficulties with OpenMP • Led initially to large slowdown in parallelized code • Traced to unnecessary locking in MATMUL intrinsic

Adaptive Strategies • Advanced codes model dynamic and irregular behavior • Solidification: adaptive grid refinement • Quench: • Complex dependencies, • Parallelization within elements • To parallelize these effectively, • adaptive runtime strategies are necessary

Multi-partition decomposition: • Idea: decompose the problem into a number of partitions, • independent of the number of processors • # Partitions > # Processors • The system maps partitions to processors • The system should be able to map and re-map objects as needed

Charm++ • A parallel C++ library • Supports data driven objects • singleton objects, object arrays, groups, • Many objects per processor, with method execution scheduled with availability of data • System supports automatic instrumentation and object migration • Works with other paradigms: MPI, openMP, ..

Data driven executionin Charm++ Scheduler Scheduler Message Q Message Q

Load Balancing Framework • Aimed at handling ... • Continuous (slow) load variation • Abrupt load variation (refinement) • Workstation clusters in multi-user mode • Measurement based • Exploits temporal persistence of computation and communication structures • Very accurate (compared with estimation) • instrumentation possible via Charm++/Converse

Object balancing framework

Utility of the framework: workstation clusters • Cluster of 8 machines, • One machine gets another job • Parallel job slows down on all machines • Using the framework: • Detection mechanism • Migrate objects away from overloaded processor • Restored almost original throughput!

Performance on timeshared clusters Another user logged on at about 28 seconds into a parallel run on 8 workstations. Throughput dipped from 10 steps per second to 7. The load balancer intervened at 35 seconds,and restored throughput to almost its initial value.

Utility of the framework: Intrinsic load imbalance • To test the abilities of the framework • A simple problem: Gauss-Jacobi iterations • Refine selected sub-domains • ConSpector: web based tool • Submit parallel jobs • Monitor performance and application behavior • Interact with running jobs via GUI interfaces

AppSpector view of Load balancer on the synthetic Jacobi relaxation benchmark. Imbalance is introduced by interactively refining a subset of cells around 9 seconds.. The resultant load imbalance brings the utilization down to 80% from the peak of 96%. The load balancer kicks in around t = 16, and restores utilization to around 94%.

Using the Load Balancing Framework Automatic Conversion from MPI Cross module interpolation Structured FEM MPI-on-Charm Irecv+ Frameworkpath Load database + balancer Migration path Charm++ Converse

Example application: • Crack propagation • (P. Geubelle et al) • Similar in structure to Quench components • 1900 lines of F90 • Rewritten using FEM framework in C++ • 1200 lines of C++ code • Framework: 500 lines of code, • reused by all applications • Parallelization completely by the framework

Crack Propagation Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle

“Overhead” of multi-partition method

Overhead study on 8 processors When running on 8 processors, the effect of using multiple partitions per processor is also beneficial, due to cache behavior.

Cross-approach comparison MPI-F90 original Charm++ framework(all C++) F90 + charm++ library

Load balancer in action

Summary and Planned Research • Use the adaptive FEM framework • To parallelize Quench code further • Quad tree based solidification code: • First phase: parallelize each phase separately • Parallelize across refinement phases • Refine the FEM framework • Use feedback from applications • Support for implicit solvers and multigrid

NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects

NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects

Presentation Transcript

DMA Versus Polling or Interrupt Driven I/O

Adaptive Design Methods in Clinical Trials

Data-Driven Decision Making

Data Wise

Exercising Adaptive Leadership

Geometric Objects and Transformations

EM413 Using Adaptive Server Anywhere’s Remote Data Access Feature

M415 Using Adaptive Server Anywhere and UltraLite with Visual Basic

HPCI Centre Presentation

Adaptive Query Processing with Eddies

The Adaptive Immune System

Chapter 25 - Active Server Pages (ASP)

Sequence Searching Strategies

The Immune System: Innate and Adaptive Body Defenses: Part B

WYSIWYG Development of Data Driven Web Applications

VHDL 2 Identifiers, data objects and data types

Image Segmentation by Data-Driven Markov Chain Monte Carlo

Java Programming

Charm++ Tutorial

Scalable and transparent parallelization of multiplayer games

MARK2038 Data Base Marketing Strategies II

CURRICULUM / INSTRUCTION / ASSESSMENT