1 / 48

Optimized Processor Virtualization for Parallel Programming Efficiency

Explore the benefits of processor virtualization via Charm++ and AMPI for enhanced software engineering, message-driven execution, and dynamic mapping for parallel programs at the University of Illinois. Learn about decomposition, object-based parallelization, and realizations through Charm++ and AMPI. Gain insights on object arrays and arrays, comparing the advantages with MPI and the innovative Adaptive MPI for seamless migration. Discover the consequences, modularization, and the decoupling of logical units for better software engineering practices in processor virtualization.

falice
Download Presentation

Optimized Processor Virtualization for Parallel Programming Efficiency

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Runtime Optimizationsvia Processor Virtualization Laxmikant Kale Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign http://charm.cs.uiuc.edu

  2. Where we stand Predictable behavior very important for performance What is virtualization Charm++, AMPI Consequences: Software Engineering: Natural unit of parallelism Data vs code abstraction cohesion/coupling: MPI vs objects Message driven execution Adaptive overlap Handling jitter outofcore caches /controllable SRAM Flexible and dynamic mapping Vacating. Accommodating speed varn shrink/expand Principle of persistence Measurement based lb Communication opts Learning algorithms New uses of compiler analysis Apparently simple, Threads, Global vars Minimizing state at migration Border fusion Outline Processor virtualization

  3. Decomposition Mapping Charm++ HPF Scheduling expression MPI Specialization Technical Approach • Seek optimal division of labor between “system” and programmer: • Decomposition done by programmer, everything else automated • Develop standard library of reusable parallel components Processor virtualization

  4. Object-based Decomposition • Idea: • Divide the computation into a large number of pieces • Independent of number of processors • Typically larger than number of processors • Let the system map objects to processors • This is “virtualization++” • Language and runtime support for virtualization • Exploitation of virtualization to the hilt Processor virtualization

  5. Object-based Parallelization User is only concerned with interaction between objects System implementation User View Processor virtualization

  6. Realizations: Charm++ and AMPI • Charm++: • Parallel C++ with Data Driven Objects (Chares) • Object Arrays/ Object Collections • Object Groups: • Global object with a “representative” on each PE • Asynchronous method invocation • Prioritized scheduling • Information sharing abstractions: readonly, tables,.. • Mature, robust, portable (http://charm.cs.uiuc.edu) • AMPI: • Adaptive MPI, with many MPI threads on each processor Processor virtualization

  7. Chare Arrays • Elements are data-driven objects • Elements are indexed by a user-defined data type-- [sparse] 1D, 2D, 3D, tree, ... • Send messages to index, receive messages at element. Reductions and broadcasts across the array • Dynamic insertion, deletion, migration-- and everything still has to work! Processor virtualization

  8. Object Arrays • A collection of data-driven objects (aka chares), • With a single global name for the collection, and • Each member addressed by an index • Mapping of element objects to processors handled by the system User’s view A[0] A[1] A[2] A[3] A[..] Processor virtualization

  9. Object Arrays • A collection of chares, • with a single global name for the collection, and • each member addressed by an index • Mapping of element objects to processors handled by the system User’s view A[0] A[1] A[2] A[3] A[..] System view A[0] A[3] Processor virtualization

  10. Object Arrays • A collection of chares, • with a single global name for the collection, and • each member addressed by an index • Mapping of element objects to processors handled by the system User’s view A[0] A[1] A[2] A[3] A[..] System view A[0] A[3] Processor virtualization

  11. Comparison with MPI • Advantage: Charm++ • Modules/Abstractions are centered on application data structures, • Not processors • Several other… • Advantage: MPI • Highly popular, widely available, industry standard • “Anthropomorphic” view of processor • Many developers find this intuitive • But mostly: • There is no hope of weaning people away from MPI • There is no need to choose between them! Processor virtualization

  12. Adaptive MPI • A migration path for legacy MPI codes • Allows them dynamic load balancing capabilities of Charm++ • AMPI = MPI + dynamic load balancing • Uses Charm++ object arrays and migratable threads • Minimal modifications to convert existing MPI programs • Automated via AMPizer • Bindings for • C, C++, and Fortran90 Processor virtualization

  13. Real Processors AMPI: MPI “processes” Implemented as virtual processors (user-level migratable threads) Processor virtualization

  14. II: Consequences of virtualization • Better Software Engineering • Message Driven Execution • Flexible and Dynamic mapping to processors Processor virtualization

  15. Modularization “Number of processors” decoupled from logical units • E.g. Oct tree nodes for particle data • No artificial restriction on the number of processors • Cube of power of 2 • Modularity: • Software engineering: Cohesion and coupling • MPI’s “are on the same processor” is a bad coupling principle • Objects liberate you from that: • E.g. solid and fluid moldules in a rocket simulation Processor virtualization

  16. Rocket Simulation • Large Collaboration headed Mike Heath • DOE supported center • Challenge: • Multi-component code, with modules from independent researchers • MPI was common base • AMPI: new wine in old bottle • Easier to convert • Can still run original codes on MPI, unchanged Processor virtualization

  17. Rocflo Rocflo Rocflo Rocflo Rocflo Rocflo Rocface Rocface Rocface Rocface Rocface Rocflo Rocflo Rocflo Rocflo Rocsolid Rocsolid Rocsolid Rocsolid Rocsolid Rocface Rocface Rocface Rocface Rocface Rocsolid Rocsolid Rocsolid Rocsolid Rocsolid Rocket simulation via virtual processors Processor virtualization

  18. Rocface Rocface Rocface Rocface Rocface Rocsolid Rocsolid Rocsolid Rocsolid Rocsolid AMPI and Roc*: Communication Rocflo Rocflo Rocflo Rocflo Rocflo Processor virtualization

  19. Message Driven Execution Scheduler Scheduler Message Q Message Q Processor virtualization

  20. Adaptive Overlap via Data-driven Objects • Problem: • Processors wait for too long at “receive” statements • Routine communication optimizations in MPI • Move sends up and receives down • Sometimes. Use irecvs, but be careful • With Data-driven objects • Adaptive overlap of computation and communication • No object or threads holds up the processor • No need to guess which is likely to arrive first Processor virtualization

  21. Adaptive overlap and modules SPMD and Message-Driven Modules (From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.) Processor virtualization

  22. Modularity and Adaptive Overlap “Parallel Composition Principle: For effective composition of parallel components, a compositional programming language should allow concurrent interleaving of component execution, with the order of execution constrained only by availability of data.” (Ian Foster, Compositional parallel programming languages, ACM Transactions of Programming Languages and Systems, 1996) Processor virtualization

  23. Handling OS Jitter via MDE • MDE encourages asynchrony • Asynchronous reductions, for example • Only data dependence should force synchronization • One benefit: • Consider algorithm with N steps • Each step has different load balance • Loose dependence between steps • (on neighbors, for example) • Sum-of-max (MPI) vs max-of-sums (MDE) • OS Jitter: • Causes random processors to add delays in each step • Handled Automatically by MDE Processor virtualization

  24. Virtualization/MDE leads to predictability • Ability to predict: • Which data is going to be needed and • Which code will execute • Based on the ready queue of object method invocations Processor virtualization

  25. S S Q Q Virtualization/MDE leads to predictability • Ability to predict: • Which data is going to be needed and • Which code will execute • Based on the ready queue of object method invocations • So, we can: • Prefetch data accurately • Prefetch code if needed • Out-of-core execution • Caches vs controllable SRAM Processor virtualization

  26. Flexible Dynamic Mapping to Processors • The system can migrate objects between processors • Vacate workstation used by a parallel program • Dealing with extraneous loads on shared workstations • Shrink and Expand the set of processors used by an app • Adaptive job scheduling • Better System utilization • Adapt to speed difference between processors • E.g. Cluster with 500 MHz and 1 Ghz processors • Automatic checkpointing • Checkpointing = migrate to disk! • Restart on a different number of processors Processor virtualization

  27. Phase 16P3 16P2 8P3,8P2 w/o LB 8P3,8P2 w. LB Fluid update 75.24 97.50 96.73 86.89 Solid update 41.86 52.50 52.20 46.83 Pre-Cor Iter 117.16 150.08 149.01 133.76 Time Step 235.19 301.56 299.85 267.75 Load Balancing with AMPI/Charm++ Turing cluster has processors with different speeds Processor virtualization

  28. Principle of Persistence • Parallel analog of principle of locality • Heuristics • Holds for most CSE applications • Once the application is expressed in terms of interacting objects: • Object communication patterns and computational loads tend to persist over time • In spite of dynamic behavior • Abrupt but infrequent changes • Slow and small changes Processor virtualization

  29. Utility of the principle of persistence • Learning / adaptive algorithms • Adaptive Communication libraries • Each to all individualized sends • Performance depends on many runtime characteristics • Library switches between different algorithms • Measurement based load balancing Processor virtualization

  30. Dynamic Load Balancing For CSE • Many CSE applications exhibit dynamic behavior • Adaptive mesh refinement • Atom migration • Particle migration in cosmology codes • Objects allow RTS to remap them to balance load • Just move objects away from overloaded processors to underloaded processors Just?? Processor virtualization

  31. Measurement Based Load Balancing • Based on Principle of persistence • Runtime instrumentation • Measures communication volume and computation time • Measurement based load balancers • Use the instrumented data-base periodically to make new decisions • Many alternative strategies can use the database • Centralized vs distributed • Greedy improvements vs complete reassignments • Taking communication into account • Taking dependences into account (More complex) Processor virtualization

  32. Load balancer in action Automatic Load Balancing in Crack Propagation 1. Elements Added 3. Chunks Migrated 2. Load Balancer Invoked Processor virtualization

  33. “Overhead” of Multipartitioning Processor virtualization

  34. Optimizing for Communication Patterns • The parallel-objects Runtime System can observe, instrument, and measure communication patterns • Communication is from/to objects, not processors • Load balancers can use this to optimize object placement • Communication libraries can optimize • By substituting most suitable algorithm for each operation • Learning at runtime V. Krishnan, MS Thesis, 1996 Processor virtualization

  35. Example: All to all on Lemieux Processor virtualization

  36. The Other Side: Pipelining • A sends a large message to B, whereupon B computes • Problem: B is idle for a long time, while the message gets there. • Solution: Pipelining • Send the message in multiple pieces, triggering a computation on each • Objects makes this easy to do: • Example: • Ab Initio Computations using Car-Parinello method • Multiple 3D FFT kernel Recent collaboration with: R. Car, M. Klein, G. Martyna, M, Tuckerman, N. Nystrom, J. Torrellas Processor virtualization

  37. Processor virtualization

  38. Effect of Pipelining Multiple Concurrent 3D FFTs, on 64 Processors of Lemieux V. Ramkumar (PPL) Processor virtualization

  39. Control Points: learning and tuning • The RTS can automatically optimize the degree of pipelining • If it is given a control point (knob) to tune • By the application Controlling pipelining between a pair of objects: S. Krishnan, PhD Thesis, 1994 Controlling degree of virtualization: Orchestration Framework: Ongoing PhD thesis Processor virtualization

  40. Example: Molecular Dynamics in NAMD • Collection of [charged] atoms, with bonds • Newtonian mechanics • At each time-step • Calculate forces on each atom • Bonds: • Non-bonded: electrostatic and van der Waal’s • Calculate velocities and advance positions • 1 femtosecond time-step, millions needed! • Thousands of atoms (1,000 - 500,000) Collaboration with K. Schulten, R. Skeel, and coworkers Processor virtualization

  41. NAMD performance using virtualization • Written in Charm++ • Uses measurement based load balancing • Object level performance feedback • using “projections” tool for Charm++ • Identifies problems at source level easily • Almost suggests fixes • Attained unprecedented performance Processor virtualization

  42. Grainsize analysis Solution: Split compute objects that may have too much work: using a heuristics based on number of interacting atoms Problem Processor virtualization

  43. Integration overhead analysis integration Problem: integration time had doubled from sequential run Processor virtualization

  44. Improved Performance Data Published in SC2000: Gordon Bell Award Finalist Processor virtualization

  45. Processor virtualization

  46. Role of compilers • New uses of compiler analysis • Apparently simple, but then again, data flow analysis must have seemed simple • Supporting Threads, • Shades of global variables • Minimizing state at migration • Border fusion • Split-phase semantics (UPC). • Components (separately compiled) • Compiler – RTS collaboration needed! Processor virtualization

  47. Summary • Virtualization as a magic bullet • Logical decomposition, better software eng. • Flexible and dynamic mapping to processors • Message driven execution: • Adaptive overlap, modularity, predictability • Principle of persistence • Measurement based load balancing, • Adaptive communication libraries • Future: • Compiler support • Realize the potential: • Strategies and applications More info: http://charm.cs.uiuc.edu Processor virtualization

  48. Decomposition Mapping Charm++ HPF Scheduling expression MPI Specialization Component Frameworks • Seek optimal division of labor between “system” and programmer: • Decomposition done by programmer, everything else automated • Develop standard library of reusable parallel components Domain specific frameworks Processor virtualization

More Related