Optimized Processor Virtualization for Parallel Programming Efficiency

Runtime Optimizationsvia Processor Virtualization Laxmikant Kale Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign http://charm.cs.uiuc.edu

Where we stand Predictable behavior very important for performance What is virtualization Charm++, AMPI Consequences: Software Engineering: Natural unit of parallelism Data vs code abstraction cohesion/coupling: MPI vs objects Message driven execution Adaptive overlap Handling jitter outofcore caches /controllable SRAM Flexible and dynamic mapping Vacating. Accommodating speed varn shrink/expand Principle of persistence Measurement based lb Communication opts Learning algorithms New uses of compiler analysis Apparently simple, Threads, Global vars Minimizing state at migration Border fusion Outline Processor virtualization

Decomposition Mapping Charm++ HPF Scheduling expression MPI Specialization Technical Approach • Seek optimal division of labor between “system” and programmer: • Decomposition done by programmer, everything else automated • Develop standard library of reusable parallel components Processor virtualization

Object-based Decomposition • Idea: • Divide the computation into a large number of pieces • Independent of number of processors • Typically larger than number of processors • Let the system map objects to processors • This is “virtualization++” • Language and runtime support for virtualization • Exploitation of virtualization to the hilt Processor virtualization

Object-based Parallelization User is only concerned with interaction between objects System implementation User View Processor virtualization

Realizations: Charm++ and AMPI • Charm++: • Parallel C++ with Data Driven Objects (Chares) • Object Arrays/ Object Collections • Object Groups: • Global object with a “representative” on each PE • Asynchronous method invocation • Prioritized scheduling • Information sharing abstractions: readonly, tables,.. • Mature, robust, portable (http://charm.cs.uiuc.edu) • AMPI: • Adaptive MPI, with many MPI threads on each processor Processor virtualization

Chare Arrays • Elements are data-driven objects • Elements are indexed by a user-defined data type-- [sparse] 1D, 2D, 3D, tree, ... • Send messages to index, receive messages at element. Reductions and broadcasts across the array • Dynamic insertion, deletion, migration-- and everything still has to work! Processor virtualization

Object Arrays • A collection of data-driven objects (aka chares), • With a single global name for the collection, and • Each member addressed by an index • Mapping of element objects to processors handled by the system User’s view A[0] A[1] A[2] A[3] A[..] Processor virtualization

Object Arrays • A collection of chares, • with a single global name for the collection, and • each member addressed by an index • Mapping of element objects to processors handled by the system User’s view A[0] A[1] A[2] A[3] A[..] System view A[0] A[3] Processor virtualization

Comparison with MPI • Advantage: Charm++ • Modules/Abstractions are centered on application data structures, • Not processors • Several other… • Advantage: MPI • Highly popular, widely available, industry standard • “Anthropomorphic” view of processor • Many developers find this intuitive • But mostly: • There is no hope of weaning people away from MPI • There is no need to choose between them! Processor virtualization

Adaptive MPI • A migration path for legacy MPI codes • Allows them dynamic load balancing capabilities of Charm++ • AMPI = MPI + dynamic load balancing • Uses Charm++ object arrays and migratable threads • Minimal modifications to convert existing MPI programs • Automated via AMPizer • Bindings for • C, C++, and Fortran90 Processor virtualization

Real Processors AMPI: MPI “processes” Implemented as virtual processors (user-level migratable threads) Processor virtualization

II: Consequences of virtualization • Better Software Engineering • Message Driven Execution • Flexible and Dynamic mapping to processors Processor virtualization

Modularization “Number of processors” decoupled from logical units • E.g. Oct tree nodes for particle data • No artificial restriction on the number of processors • Cube of power of 2 • Modularity: • Software engineering: Cohesion and coupling • MPI’s “are on the same processor” is a bad coupling principle • Objects liberate you from that: • E.g. solid and fluid moldules in a rocket simulation Processor virtualization

Rocket Simulation • Large Collaboration headed Mike Heath • DOE supported center • Challenge: • Multi-component code, with modules from independent researchers • MPI was common base • AMPI: new wine in old bottle • Easier to convert • Can still run original codes on MPI, unchanged Processor virtualization

Rocflo Rocflo Rocflo Rocflo Rocflo Rocflo Rocface Rocface Rocface Rocface Rocface Rocflo Rocflo Rocflo Rocflo Rocsolid Rocsolid Rocsolid Rocsolid Rocsolid Rocface Rocface Rocface Rocface Rocface Rocsolid Rocsolid Rocsolid Rocsolid Rocsolid Rocket simulation via virtual processors Processor virtualization

Rocface Rocface Rocface Rocface Rocface Rocsolid Rocsolid Rocsolid Rocsolid Rocsolid AMPI and Roc*: Communication Rocflo Rocflo Rocflo Rocflo Rocflo Processor virtualization

Message Driven Execution Scheduler Scheduler Message Q Message Q Processor virtualization

Adaptive Overlap via Data-driven Objects • Problem: • Processors wait for too long at “receive” statements • Routine communication optimizations in MPI • Move sends up and receives down • Sometimes. Use irecvs, but be careful • With Data-driven objects • Adaptive overlap of computation and communication • No object or threads holds up the processor • No need to guess which is likely to arrive first Processor virtualization

Adaptive overlap and modules SPMD and Message-Driven Modules (From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.) Processor virtualization

Modularity and Adaptive Overlap “Parallel Composition Principle: For effective composition of parallel components, a compositional programming language should allow concurrent interleaving of component execution, with the order of execution constrained only by availability of data.” (Ian Foster, Compositional parallel programming languages, ACM Transactions of Programming Languages and Systems, 1996) Processor virtualization

Handling OS Jitter via MDE • MDE encourages asynchrony • Asynchronous reductions, for example • Only data dependence should force synchronization • One benefit: • Consider algorithm with N steps • Each step has different load balance • Loose dependence between steps • (on neighbors, for example) • Sum-of-max (MPI) vs max-of-sums (MDE) • OS Jitter: • Causes random processors to add delays in each step • Handled Automatically by MDE Processor virtualization

Virtualization/MDE leads to predictability • Ability to predict: • Which data is going to be needed and • Which code will execute • Based on the ready queue of object method invocations Processor virtualization

S S Q Q Virtualization/MDE leads to predictability • Ability to predict: • Which data is going to be needed and • Which code will execute • Based on the ready queue of object method invocations • So, we can: • Prefetch data accurately • Prefetch code if needed • Out-of-core execution • Caches vs controllable SRAM Processor virtualization

Flexible Dynamic Mapping to Processors • The system can migrate objects between processors • Vacate workstation used by a parallel program • Dealing with extraneous loads on shared workstations • Shrink and Expand the set of processors used by an app • Adaptive job scheduling • Better System utilization • Adapt to speed difference between processors • E.g. Cluster with 500 MHz and 1 Ghz processors • Automatic checkpointing • Checkpointing = migrate to disk! • Restart on a different number of processors Processor virtualization

Phase 16P3 16P2 8P3,8P2 w/o LB 8P3,8P2 w. LB Fluid update 75.24 97.50 96.73 86.89 Solid update 41.86 52.50 52.20 46.83 Pre-Cor Iter 117.16 150.08 149.01 133.76 Time Step 235.19 301.56 299.85 267.75 Load Balancing with AMPI/Charm++ Turing cluster has processors with different speeds Processor virtualization

Principle of Persistence • Parallel analog of principle of locality • Heuristics • Holds for most CSE applications • Once the application is expressed in terms of interacting objects: • Object communication patterns and computational loads tend to persist over time • In spite of dynamic behavior • Abrupt but infrequent changes • Slow and small changes Processor virtualization

Utility of the principle of persistence • Learning / adaptive algorithms • Adaptive Communication libraries • Each to all individualized sends • Performance depends on many runtime characteristics • Library switches between different algorithms • Measurement based load balancing Processor virtualization

Dynamic Load Balancing For CSE • Many CSE applications exhibit dynamic behavior • Adaptive mesh refinement • Atom migration • Particle migration in cosmology codes • Objects allow RTS to remap them to balance load • Just move objects away from overloaded processors to underloaded processors Just?? Processor virtualization

Measurement Based Load Balancing • Based on Principle of persistence • Runtime instrumentation • Measures communication volume and computation time • Measurement based load balancers • Use the instrumented data-base periodically to make new decisions • Many alternative strategies can use the database • Centralized vs distributed • Greedy improvements vs complete reassignments • Taking communication into account • Taking dependences into account (More complex) Processor virtualization

Load balancer in action Automatic Load Balancing in Crack Propagation 1. Elements Added 3. Chunks Migrated 2. Load Balancer Invoked Processor virtualization

“Overhead” of Multipartitioning Processor virtualization

Optimizing for Communication Patterns • The parallel-objects Runtime System can observe, instrument, and measure communication patterns • Communication is from/to objects, not processors • Load balancers can use this to optimize object placement • Communication libraries can optimize • By substituting most suitable algorithm for each operation • Learning at runtime V. Krishnan, MS Thesis, 1996 Processor virtualization

Example: All to all on Lemieux Processor virtualization

The Other Side: Pipelining • A sends a large message to B, whereupon B computes • Problem: B is idle for a long time, while the message gets there. • Solution: Pipelining • Send the message in multiple pieces, triggering a computation on each • Objects makes this easy to do: • Example: • Ab Initio Computations using Car-Parinello method • Multiple 3D FFT kernel Recent collaboration with: R. Car, M. Klein, G. Martyna, M, Tuckerman, N. Nystrom, J. Torrellas Processor virtualization

Processor virtualization

Effect of Pipelining Multiple Concurrent 3D FFTs, on 64 Processors of Lemieux V. Ramkumar (PPL) Processor virtualization

Control Points: learning and tuning • The RTS can automatically optimize the degree of pipelining • If it is given a control point (knob) to tune • By the application Controlling pipelining between a pair of objects: S. Krishnan, PhD Thesis, 1994 Controlling degree of virtualization: Orchestration Framework: Ongoing PhD thesis Processor virtualization

Example: Molecular Dynamics in NAMD • Collection of [charged] atoms, with bonds • Newtonian mechanics • At each time-step • Calculate forces on each atom • Bonds: • Non-bonded: electrostatic and van der Waal’s • Calculate velocities and advance positions • 1 femtosecond time-step, millions needed! • Thousands of atoms (1,000 - 500,000) Collaboration with K. Schulten, R. Skeel, and coworkers Processor virtualization

NAMD performance using virtualization • Written in Charm++ • Uses measurement based load balancing • Object level performance feedback • using “projections” tool for Charm++ • Identifies problems at source level easily • Almost suggests fixes • Attained unprecedented performance Processor virtualization

Grainsize analysis Solution: Split compute objects that may have too much work: using a heuristics based on number of interacting atoms Problem Processor virtualization

Integration overhead analysis integration Problem: integration time had doubled from sequential run Processor virtualization

Improved Performance Data Published in SC2000: Gordon Bell Award Finalist Processor virtualization

Processor virtualization

Role of compilers • New uses of compiler analysis • Apparently simple, but then again, data flow analysis must have seemed simple • Supporting Threads, • Shades of global variables • Minimizing state at migration • Border fusion • Split-phase semantics (UPC). • Components (separately compiled) • Compiler – RTS collaboration needed! Processor virtualization

Summary • Virtualization as a magic bullet • Logical decomposition, better software eng. • Flexible and dynamic mapping to processors • Message driven execution: • Adaptive overlap, modularity, predictability • Principle of persistence • Measurement based load balancing, • Adaptive communication libraries • Future: • Compiler support • Realize the potential: • Strategies and applications More info: http://charm.cs.uiuc.edu Processor virtualization

Decomposition Mapping Charm++ HPF Scheduling expression MPI Specialization Component Frameworks • Seek optimal division of labor between “system” and programmer: • Decomposition done by programmer, everything else automated • Develop standard library of reusable parallel components Domain specific frameworks Processor virtualization

Optimized Processor Virtualization for Parallel Programming Efficiency

Optimized Processor Virtualization for Parallel Programming Efficiency

Presentation Transcript

Collimation optimizations

Energy Efficient GPU Transactional Memory via Space-Time Optimizations

Android runtime environment the dalkvik vm and jit optimizatiOns

Sustainability via Desktop Virtualization

Global optimizations

Manycore Optimizations: A Compiler and L anguage Independent ManyCore Runtime System

Architectural Optimizations

Loop Optimizations

Local Optimizations

Architectural Optimizations

Intraprocedural Optimizations

Interconnect Optimizations

Single Processor Optimizations Matrix Multiplication Case Study

Runtime Processor Power Monitoring

Runtime

Storage Management Via Virtualization

Geometry Optimizations

Interprocedural Optimizations

AES Acceleration Via FPGA Co-Processor

Gaming Optimizations

Vector Optimizations

Interconnect Optimizations