Parallel Objects: Virtualization & In-Process Components

Parallel Objects: Virtualization& In-Process Components Orion Sky Lawlor Univ. of Illinois at Urbana-Champaign POHLL-2002

Introduction Parallel Programming is hard: • Communication takes time • Message startup cost • Bandwidth & contention • Synchronization, race conditions • Parallelism breaks abstractions • Flatten data structures • Hand off control between modules • Harder than serial programming

Motivation Parallel Applications are either: • Embarrassingly Parallel Trivial, 1 RA-week effort E.g. Monte Carlo, parameter sweep, SETI@home Communication totally irrelevant to performance

Motivation Parallel Applications are either: • Embarrassingly Parallel • Excruciatingly Parallel Massive, 1+ RA-year effort E.g. “Pure” MPI codes ≥10k lines Communication, synchronization totally determine performance

Motivation Parallel Applications are either: • Embarrassingly Parallel • Excruciatingly Parallel • “We’ll be done in 6 months…” Several parallel libraries & codes & groups, dynamic & adaptive E.g. Multiphysics simulation

Serial Solution: Abstract! Build layers of software • High-level: Libc, C++ STL, … • Mid-level: OS Kernel • Silently schedule processes • Keep CPU busy even when some processes block • Allows a process to ignore other processes • Low-level: assembler

Parallel Solution: Abstract! Middle layers are missing • High-level: ScaLAPACK, POOMA.. • Mid-level: ? Kernel • Silently schedule components • Keep CPU busy even when some components block • Allows a component to ignore other components • Low-level: MPI

The missing middle layer: • Provides dynamic computation and communication overlap, even across separate modules • Handles inter-module handoff • Pipelines communication • Improves cache utilization—smaller components • Provides nice layer for advanced features, like process migration

Examples: Multiprogramming

Examples: Pipelining

Middle Layer: Implementation • Real OS processes/threads • Robust, reliable, implemented • High performance penalty • No parallel features (migration!) • Converse/Charm++ • In-process components: efficient • Piles of advanced features • AMPI, MPI interface to Charm • Application Framework

Charm++ • Parallel library for Object-Oriented C++ applications • Messaging via method calls • Communication “proxy” objects • Methods called by scheduler • System determines who runs next • Multiple objects per processor • Object migration fully supported • Even with broadcasts, reductions

Mapping Work to Processors System implementation User View

AMPI • MPI interface, implemented on Charm++ • Multiple “virtual processors” per physical processor • Implemented as user-level threads • Very fast context switching • MPI_Recv only blocks virtual processor, not physical • All the benefits of Charm++

Application Frameworks • Domain-specific interfaces: unstructured grids, structured grids, particle-in-cell • Provide natural interface to application scientists (Fortran!) • “Encapsulate” communication • Built on Charm++ • Most popular interfaces to Charm++

Charm++ Features: Migration • Automatic load balancing • Balance load by migrating objects • Application-independent • Built-in data collection (cpu, net) • Pluggable “strategy” modules • Adaptive Job Scheduler • Shrink/expand parallel job, by migrating objects • Dramatic utilization improvment

Examples: Load Balancing 1. Adaptive Refinement 3. Chunks Migrated 2. Load Balancer Invoked

Examples: Expanding Job

Examples: Virtualization

Conclusions • Parallel applications need something like a “kernel” • Neutral party to mediate CPU use • Significant utilization gains • Easy to put good tools in kernel • Work migration support • Load balancing • Consider using Charm++ http://charm.cs.uiuc.edu/

Parallel Objects: Virtualization & In-Process Components