600 likes | 735 Views
Techniques for Developing Efficient Petascale Applications. Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign. Application Oriented Parallel Abstractions.
E N D
Techniques for Developing Efficient Petascale Applications Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign
Application Oriented Parallel Abstractions Synergy between Computer Science Research and Apps has been beneficial to both LeanCP Space-time meshing Other Applications Issues NAMD Charm++ Techniques & libraries Rocket Simulation ChaNGa SciDAC-08
Application Oriented Parallel Abstractions Synergy between Computer Science Research and Apps has been beneficial to both LeanCP Space-time meshing Other Applications Issues NAMD Charm++ Techniques & libraries Rocket Simulation ChaNGa SciDAC-08
Enabling CS technology of parallel objects and intelligent runtime systems (Charm++ and AMPI) has led to several collaborative applications in CSE PSC PetaScale
Outline • Techniques for Petascale Programming: • My biases: isoefficiency analysis, overdecompistion, automated resource management, sophisticated perf. analysis tools • BigSim: • Early development of Applications on future machines • Dealing with Multicore processors and SMP nodes • Novel parallel programming models 9/17/2014 SciDAC-08 7
Techniques For Scaling to Multi-PetaFLOP/s SciDAC-08
1.Analyze Scalability of Algorithms (say via the iso-efficiency metric) SciDAC-08
Equal efficiency curves Problem size processors Isoefficiency Analysis Vipin Kumar (circa 1990) • An algorithm (*) is scalable if • If you double the number of processors available to it, it can retain the same parallel efficiency by increasing the size of the problem by some amount • Not all algorithms are scalable in this sense.. • Isoefficiency is the rate at which the problem size must be increased, as a function of number of processors, to keep the same efficiency Parallel efficiency= T1/(Tp*P) T1 : Time on one processor Tp: Time on P processors SciDAC-08
NAMD: A Production MD program • NAMD • Fully featured program • NIH-funded development • Distributed free of charge (~20,000 registered users) • Binaries and source code • Installed at NSF centers • User training and support • Large published simulations 9/17/2014 SciDAC-08 11
Molecular Dynamics in NAMD • Collection of [charged] atoms, with bonds • Newtonian mechanics • Thousands of atoms (10,000 – 5,000,000) • At each time-step • Calculate forces on each atom • Bonds: • Non-bonded: electrostatic and van der Waal’s • Short-distance: every timestep • Long-distance: using PME (3D FFT) • Multiple Time Stepping : PME every 4 timesteps • Calculate velocities and advance positions • Challenge: femtosecond time-step, millions needed! Collaboration with K. Schulten, R. Skeel, and coworkers SciDAC-08
Traditional Approaches: non isoefficient In 1996-2002 • Replicated Data: • All atom coordinates stored on each processor • Communication/Computation ratio: P log P • Partition the Atoms array across processors • Nearby atoms may not be on the same processor • C/C ratio: O(P) • Distribute force matrix to processors • Matrix is sparse, non uniform, • C/C Ratio: sqrt(P) Not Scalable Not Scalable Not Scalable SciDAC-08
Spatial Decomposition Via Charm • Atoms distributed to cubes based on their location • Size of each cube : • Just a bit larger than cut-off radius • Communicate only with neighbors • Work: for each pair of nbr objects • C/C ratio: O(1) • However: • Load Imbalance • Limited Parallelism Charm++ is useful to handle this Cells, Cubes or“Patches” SciDAC-08
Object Based Parallelization for MD: Force Decomposition + Spatial Decomposition • Now, we have many objects to load balance: • Each diamond can be assigned to any proc. • Number of diamonds (3D): • 14·Number of Patches • 2-away variation: • Half-size cubes • 5x5x5 interactions • 3-away interactions: 7x7x7 SciDAC-08
Performance of NAMD: STMV STMV: ~1million atoms Time (ms per step) No. of cores
2. Decouple decomposition from Physical Processors SciDAC-08
Benefits • Software engineering • Number of virtual processors can be independently controlled • Separate VPs for different modules • Message driven execution • Adaptive overlap of communication • Predictability : • Automatic out-of-core • Prefetch to local stores • Asynchronous reductions • Dynamic mapping • Heterogeneous clusters • Vacate, adjust to speed, share • Automatic checkpointing • Change set of processors used • Automatic dynamic load balancing • Communication optimization System implementation User View Migratable Objects (aka Processor Virtualization) Programmer: [Over] decomposition into virtual processors Runtime:Assigns VPs to processors Enables adaptive runtime strategies Implementations: Charm++, AMPI SciDAC-08
Software engineering Number of virtual processors can be independently controlled Separate VPs for different modules Message driven execution Adaptive overlap of communication Predictability : Automatic out-of-core Prefetch to local stores Asynchronous reductions Dynamic mapping Heterogeneous clusters Vacate, adjust to speed, share Automatic checkpointing Change set of processors used Automatic dynamic load balancing Communication optimization Real Processors Migratable Objects (aka Processor Virtualization) Benefits Programmer: [Over] decomposition into virtual processors Runtime:Assigns VPs to processors Enables adaptive runtime strategies Implementations: Charm++, AMPI MPI processes Virtual Processors (user-level migratable threads) SciDAC-08
Parallel Decomposition and Processors • MPI-style encourages • Decomposition into P pieces, where P is the number of physical processors available • If your natural decomposition is a cube, then the number of processors must be a cube • … • Charm++/AMPI style “virtual processors” • Decompose into natural objects of the application • Let the runtime map them to processors • Decouple decomposition from load balancing SciDAC-08
Decomposition independent of numCores Solid Solid Solid . . . Fluid Fluid Fluid 1 2 P . . . Solid1 Solid2 Solid3 Solidn . . . Fluid1 Fluid2 Fluidm • Rocket simulation example under traditional MPI vs. Charm++/AMPI framework • Benefit: load balance, communication optimizations, modularity 9/17/2014 LSU PetaScale SciDAC-08 21
Automatic Checkpointing Migrate objects to disk In-memory checkpointing as an option Automatic fault detection and restart “Impending Fault” Response Migrate objects to other processors Adjust processor-level parallel data structures Scalable fault tolerance When a processor out of 100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints! Sender-side message logging Latency tolerance helps mitigate costs Restart can be speeded up by spreading out objects from failed processor Fault Tolerance IITB2003
3. Use Dynamic Load Balancing Based on the Principle of Persistence Principle of persistence Computational loads and communication patterns tend to persist, even in dynamic computations So, recent past is a good predictor or near future 9/17/2014 LSU PetaScale SciDAC-08 23
Two step balancing • Computational entities (nodes, structured grid points, particles…) are partitioned into objects • Say, a few thousand FEM elements per chunk, on average • Objects are migrated across processors for balancing load • Much smaller problem than repartitioning entire mesh, e.g. • Occasional repartitioning for some problems SciDAC-08
Load Balancing Steps Regular Timesteps Detailed, aggressive Load Balancing Instrumented Timesteps Refinement Load Balancing SciDAC-08
ChaNGa: Parallel Gravity • Collaborative project (NSF ITR) • with Prof. Tom Quinn, Univ. of Washington • Components: gravity, gas dynamics • Barnes-Hut tree codes • Oct tree is natural decomposition: • Geometry has better aspect ratios, and so you “open” fewer nodes up. • But is not used because it leads to bad load balance • Assumption: one-to-one map between sub-trees and processors • Binary trees are considered better load balanced • With Charm++: Use Oct-Tree, and let Charm++ map subtrees to processors SciDAC-08
5.6s 6.1s ChaNGa Load balancing Simple balancers worsened performance! dwarf 5M on 1,024 BlueGene/L processors SciDAC-08
5.6s 5.0s Load balancing with OrbRefineLB dwarf 5M on 1,024 BlueGene/L processors Need sophisticated balancers, and ability to choose the right ones automatically SciDAC-08
ChaNGa Preliminary Performance ChaNGa: Parallel Gravity Code Developed in Collaboration with Tom Quinn (Univ. Washington) using Charm++ SciDAC-08
Load balancing for large machines: I • Centralized balancers achieve best balance • Collect object-communication graph on one processor • But won’t scale beyond tens of thousands of nodes • Fully distributed load balancers • Avoid bottleneck but.. Achieve poor load balance • Not adequately agile • Hierarchical load balancers • Careful control of what information goes up and down the hierarchy can lead to fast, high-quality balancers SciDAC-08
Load balancing for large machines: II • Interconnection topology starts to matter again • Was hidden due to wormhole routing etc. • Latency variation is still small.. • But bandwidth occupancy is a problem • Topology aware load balancers • Some general heuristic have shown good performance • But may require too much compute power • Also, special-purpose heuristic work fine when applicable • Still, many open challenges SciDAC-08
OpenAtom Car-Parinello ab initio MD NSF ITR 2001-2007, IBM Molecular Clusters : Nanowires: G. Martyna M. Tuckerman L. Kale 3D-Solids/Liquids: Semiconductor Surfaces: SciDAC-08
Goal : The accurate treatment of complex heterogeneous nanoscale systems Courtsey: G. Martyna readhead bit write head soft under-layer recording medium
New OpenAtom Collaboration (DOE) • Principle Investigators • G. Martyna (IBM TJ Watson) • M. Tuckerman (NYU) • L.V. Kale (UIUC) • K. Schulten (UIUC) • J. Dongarra (UTK/ORNL) • Current effort focus • QMMM via integration with NAMD2 • ORNL Cray XT4 Jaguar (31,328 cores) • A unique parallel decomposition of the Car-Parinello method. • Using Charm++ virtualization we can efficiently scale small (32 molecule) systems to thousands of processors SciDAC-08
Decomposition and Computation Flow SciDAC-08
BlueGene/P BlueGene/L OpenAtom Performance Cray XT3
Improvements wrought by network topological aware mapping of computational work to processors The simulation of the left panel maps computational work to processors taking the network connectivity into account while the right panel simulation does not . The ``black’’ or idle time processors spent waiting for computational work to arrive on processors is significantly reduced at left. (256waters, 70R, on BG/L 4096 cores)
4. Analyze Performance with Sophisticated Tools SciDAC-08
Exploit sophisticated Performance analysis tools • We use a tool called “Projections” • Many other tools exist • Need for scalable analysis • A recent example: • Trying to identify the next performance obstacle for NAMD • Running on 8192 processors, with 92,000 atom simulation • Test scenario: without PME • Time is 3 ms per step, but lower bound is 1.6ms or so.. SciDAC-08
5. Model and Predict Performance Via Simulation SciDAC-08
How to tune performance for a future machine? • For example, Blue Waters will arrive in 2011 • But need to prepare applications for it stating now • Even for extant machines: • Full size machine may not be available as often as needed for tuning runs • A simulation-based approach is needed • Our Approach: BigSim • Full scale Emulation • Trace driven Simulation SciDAC-08
BigSim: Emulation • Emulation: • Run an existing, full-scale MPI, AMPI or Charm++ application • Using an emulation layer that pretends to be (say) 100k cores • Leverages Charm++’s object-based virtualization • Generates traces: • Characteristics of SEBs (Sequential Execution Blocks) • Dependences between SEBs and messages SciDAC-08
BigSim: Trace-Driven Simulation • Trace Driven Parallel Simulation • Typically on tens to hundreds of processors • Multiple resolution Simulation of sequential execution: • from simple scaling to • cycle-accurate modeling • Multipl resolution simulation of the Network: • from simple latency/bw model to • A detailed packet and switching port level simulation • Generates Timing traces just as a real app on full scale machine • Phase 3: Analyze performance • Identify bottlenecks, even w/o predicting exact performance • Carry out various “what-if” analysis SciDAC-08
Projections: Performance visualization SciDAC-08
BigSim: Challenges • This simple picture hides many complexities • Emulation: • Automatic Out-of-core support for large memory footprint apps • Simulation: • Accuracy vs cost tradeoffs • Interpolation mechanisms • Memory optimizations • Active area of research • But early versions will be available for application developers for Blue Waters next month SciDAC-08