480 likes | 503 Views
Explore the benefits of processor virtualization via Charm++ and AMPI for enhanced software engineering, message-driven execution, and dynamic mapping for parallel programs at the University of Illinois. Learn about decomposition, object-based parallelization, and realizations through Charm++ and AMPI. Gain insights on object arrays and arrays, comparing the advantages with MPI and the innovative Adaptive MPI for seamless migration. Discover the consequences, modularization, and the decoupling of logical units for better software engineering practices in processor virtualization.
E N D
Runtime Optimizationsvia Processor Virtualization Laxmikant Kale Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign http://charm.cs.uiuc.edu
Where we stand Predictable behavior very important for performance What is virtualization Charm++, AMPI Consequences: Software Engineering: Natural unit of parallelism Data vs code abstraction cohesion/coupling: MPI vs objects Message driven execution Adaptive overlap Handling jitter outofcore caches /controllable SRAM Flexible and dynamic mapping Vacating. Accommodating speed varn shrink/expand Principle of persistence Measurement based lb Communication opts Learning algorithms New uses of compiler analysis Apparently simple, Threads, Global vars Minimizing state at migration Border fusion Outline Processor virtualization
Decomposition Mapping Charm++ HPF Scheduling expression MPI Specialization Technical Approach • Seek optimal division of labor between “system” and programmer: • Decomposition done by programmer, everything else automated • Develop standard library of reusable parallel components Processor virtualization
Object-based Decomposition • Idea: • Divide the computation into a large number of pieces • Independent of number of processors • Typically larger than number of processors • Let the system map objects to processors • This is “virtualization++” • Language and runtime support for virtualization • Exploitation of virtualization to the hilt Processor virtualization
Object-based Parallelization User is only concerned with interaction between objects System implementation User View Processor virtualization
Realizations: Charm++ and AMPI • Charm++: • Parallel C++ with Data Driven Objects (Chares) • Object Arrays/ Object Collections • Object Groups: • Global object with a “representative” on each PE • Asynchronous method invocation • Prioritized scheduling • Information sharing abstractions: readonly, tables,.. • Mature, robust, portable (http://charm.cs.uiuc.edu) • AMPI: • Adaptive MPI, with many MPI threads on each processor Processor virtualization
Chare Arrays • Elements are data-driven objects • Elements are indexed by a user-defined data type-- [sparse] 1D, 2D, 3D, tree, ... • Send messages to index, receive messages at element. Reductions and broadcasts across the array • Dynamic insertion, deletion, migration-- and everything still has to work! Processor virtualization
Object Arrays • A collection of data-driven objects (aka chares), • With a single global name for the collection, and • Each member addressed by an index • Mapping of element objects to processors handled by the system User’s view A[0] A[1] A[2] A[3] A[..] Processor virtualization
Object Arrays • A collection of chares, • with a single global name for the collection, and • each member addressed by an index • Mapping of element objects to processors handled by the system User’s view A[0] A[1] A[2] A[3] A[..] System view A[0] A[3] Processor virtualization
Object Arrays • A collection of chares, • with a single global name for the collection, and • each member addressed by an index • Mapping of element objects to processors handled by the system User’s view A[0] A[1] A[2] A[3] A[..] System view A[0] A[3] Processor virtualization
Comparison with MPI • Advantage: Charm++ • Modules/Abstractions are centered on application data structures, • Not processors • Several other… • Advantage: MPI • Highly popular, widely available, industry standard • “Anthropomorphic” view of processor • Many developers find this intuitive • But mostly: • There is no hope of weaning people away from MPI • There is no need to choose between them! Processor virtualization
Adaptive MPI • A migration path for legacy MPI codes • Allows them dynamic load balancing capabilities of Charm++ • AMPI = MPI + dynamic load balancing • Uses Charm++ object arrays and migratable threads • Minimal modifications to convert existing MPI programs • Automated via AMPizer • Bindings for • C, C++, and Fortran90 Processor virtualization
Real Processors AMPI: MPI “processes” Implemented as virtual processors (user-level migratable threads) Processor virtualization
II: Consequences of virtualization • Better Software Engineering • Message Driven Execution • Flexible and Dynamic mapping to processors Processor virtualization
Modularization “Number of processors” decoupled from logical units • E.g. Oct tree nodes for particle data • No artificial restriction on the number of processors • Cube of power of 2 • Modularity: • Software engineering: Cohesion and coupling • MPI’s “are on the same processor” is a bad coupling principle • Objects liberate you from that: • E.g. solid and fluid moldules in a rocket simulation Processor virtualization
Rocket Simulation • Large Collaboration headed Mike Heath • DOE supported center • Challenge: • Multi-component code, with modules from independent researchers • MPI was common base • AMPI: new wine in old bottle • Easier to convert • Can still run original codes on MPI, unchanged Processor virtualization
Rocflo Rocflo Rocflo Rocflo Rocflo Rocflo Rocface Rocface Rocface Rocface Rocface Rocflo Rocflo Rocflo Rocflo Rocsolid Rocsolid Rocsolid Rocsolid Rocsolid Rocface Rocface Rocface Rocface Rocface Rocsolid Rocsolid Rocsolid Rocsolid Rocsolid Rocket simulation via virtual processors Processor virtualization
Rocface Rocface Rocface Rocface Rocface Rocsolid Rocsolid Rocsolid Rocsolid Rocsolid AMPI and Roc*: Communication Rocflo Rocflo Rocflo Rocflo Rocflo Processor virtualization
Message Driven Execution Scheduler Scheduler Message Q Message Q Processor virtualization
Adaptive Overlap via Data-driven Objects • Problem: • Processors wait for too long at “receive” statements • Routine communication optimizations in MPI • Move sends up and receives down • Sometimes. Use irecvs, but be careful • With Data-driven objects • Adaptive overlap of computation and communication • No object or threads holds up the processor • No need to guess which is likely to arrive first Processor virtualization
Adaptive overlap and modules SPMD and Message-Driven Modules (From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.) Processor virtualization
Modularity and Adaptive Overlap “Parallel Composition Principle: For effective composition of parallel components, a compositional programming language should allow concurrent interleaving of component execution, with the order of execution constrained only by availability of data.” (Ian Foster, Compositional parallel programming languages, ACM Transactions of Programming Languages and Systems, 1996) Processor virtualization
Handling OS Jitter via MDE • MDE encourages asynchrony • Asynchronous reductions, for example • Only data dependence should force synchronization • One benefit: • Consider algorithm with N steps • Each step has different load balance • Loose dependence between steps • (on neighbors, for example) • Sum-of-max (MPI) vs max-of-sums (MDE) • OS Jitter: • Causes random processors to add delays in each step • Handled Automatically by MDE Processor virtualization
Virtualization/MDE leads to predictability • Ability to predict: • Which data is going to be needed and • Which code will execute • Based on the ready queue of object method invocations Processor virtualization
S S Q Q Virtualization/MDE leads to predictability • Ability to predict: • Which data is going to be needed and • Which code will execute • Based on the ready queue of object method invocations • So, we can: • Prefetch data accurately • Prefetch code if needed • Out-of-core execution • Caches vs controllable SRAM Processor virtualization
Flexible Dynamic Mapping to Processors • The system can migrate objects between processors • Vacate workstation used by a parallel program • Dealing with extraneous loads on shared workstations • Shrink and Expand the set of processors used by an app • Adaptive job scheduling • Better System utilization • Adapt to speed difference between processors • E.g. Cluster with 500 MHz and 1 Ghz processors • Automatic checkpointing • Checkpointing = migrate to disk! • Restart on a different number of processors Processor virtualization
Phase 16P3 16P2 8P3,8P2 w/o LB 8P3,8P2 w. LB Fluid update 75.24 97.50 96.73 86.89 Solid update 41.86 52.50 52.20 46.83 Pre-Cor Iter 117.16 150.08 149.01 133.76 Time Step 235.19 301.56 299.85 267.75 Load Balancing with AMPI/Charm++ Turing cluster has processors with different speeds Processor virtualization
Principle of Persistence • Parallel analog of principle of locality • Heuristics • Holds for most CSE applications • Once the application is expressed in terms of interacting objects: • Object communication patterns and computational loads tend to persist over time • In spite of dynamic behavior • Abrupt but infrequent changes • Slow and small changes Processor virtualization
Utility of the principle of persistence • Learning / adaptive algorithms • Adaptive Communication libraries • Each to all individualized sends • Performance depends on many runtime characteristics • Library switches between different algorithms • Measurement based load balancing Processor virtualization
Dynamic Load Balancing For CSE • Many CSE applications exhibit dynamic behavior • Adaptive mesh refinement • Atom migration • Particle migration in cosmology codes • Objects allow RTS to remap them to balance load • Just move objects away from overloaded processors to underloaded processors Just?? Processor virtualization
Measurement Based Load Balancing • Based on Principle of persistence • Runtime instrumentation • Measures communication volume and computation time • Measurement based load balancers • Use the instrumented data-base periodically to make new decisions • Many alternative strategies can use the database • Centralized vs distributed • Greedy improvements vs complete reassignments • Taking communication into account • Taking dependences into account (More complex) Processor virtualization
Load balancer in action Automatic Load Balancing in Crack Propagation 1. Elements Added 3. Chunks Migrated 2. Load Balancer Invoked Processor virtualization
“Overhead” of Multipartitioning Processor virtualization
Optimizing for Communication Patterns • The parallel-objects Runtime System can observe, instrument, and measure communication patterns • Communication is from/to objects, not processors • Load balancers can use this to optimize object placement • Communication libraries can optimize • By substituting most suitable algorithm for each operation • Learning at runtime V. Krishnan, MS Thesis, 1996 Processor virtualization
Example: All to all on Lemieux Processor virtualization
The Other Side: Pipelining • A sends a large message to B, whereupon B computes • Problem: B is idle for a long time, while the message gets there. • Solution: Pipelining • Send the message in multiple pieces, triggering a computation on each • Objects makes this easy to do: • Example: • Ab Initio Computations using Car-Parinello method • Multiple 3D FFT kernel Recent collaboration with: R. Car, M. Klein, G. Martyna, M, Tuckerman, N. Nystrom, J. Torrellas Processor virtualization
Effect of Pipelining Multiple Concurrent 3D FFTs, on 64 Processors of Lemieux V. Ramkumar (PPL) Processor virtualization
Control Points: learning and tuning • The RTS can automatically optimize the degree of pipelining • If it is given a control point (knob) to tune • By the application Controlling pipelining between a pair of objects: S. Krishnan, PhD Thesis, 1994 Controlling degree of virtualization: Orchestration Framework: Ongoing PhD thesis Processor virtualization
Example: Molecular Dynamics in NAMD • Collection of [charged] atoms, with bonds • Newtonian mechanics • At each time-step • Calculate forces on each atom • Bonds: • Non-bonded: electrostatic and van der Waal’s • Calculate velocities and advance positions • 1 femtosecond time-step, millions needed! • Thousands of atoms (1,000 - 500,000) Collaboration with K. Schulten, R. Skeel, and coworkers Processor virtualization
NAMD performance using virtualization • Written in Charm++ • Uses measurement based load balancing • Object level performance feedback • using “projections” tool for Charm++ • Identifies problems at source level easily • Almost suggests fixes • Attained unprecedented performance Processor virtualization
Grainsize analysis Solution: Split compute objects that may have too much work: using a heuristics based on number of interacting atoms Problem Processor virtualization
Integration overhead analysis integration Problem: integration time had doubled from sequential run Processor virtualization
Improved Performance Data Published in SC2000: Gordon Bell Award Finalist Processor virtualization
Role of compilers • New uses of compiler analysis • Apparently simple, but then again, data flow analysis must have seemed simple • Supporting Threads, • Shades of global variables • Minimizing state at migration • Border fusion • Split-phase semantics (UPC). • Components (separately compiled) • Compiler – RTS collaboration needed! Processor virtualization
Summary • Virtualization as a magic bullet • Logical decomposition, better software eng. • Flexible and dynamic mapping to processors • Message driven execution: • Adaptive overlap, modularity, predictability • Principle of persistence • Measurement based load balancing, • Adaptive communication libraries • Future: • Compiler support • Realize the potential: • Strategies and applications More info: http://charm.cs.uiuc.edu Processor virtualization
Decomposition Mapping Charm++ HPF Scheduling expression MPI Specialization Component Frameworks • Seek optimal division of labor between “system” and programmer: • Decomposition done by programmer, everything else automated • Develop standard library of reusable parallel components Domain specific frameworks Processor virtualization