470 likes | 596 Views
Preparing for Petascale and Beyond. Celso L. Mendes http://charm.cs.uiuc.edu/people/cmendes Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign. Presentation Outline. Present Status HPC Landscape, Petascale, Exascale
E N D
Preparing for Petascale and Beyond Celso L. Mendes http://charm.cs.uiuc.edu/people/cmendes Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign
Presentation Outline • Present Status • HPC Landscape, Petascale, Exascale • Parallel Programming Lab • Mission and approach • Programming methodology • Scalability results for S&E applications • Other extensions and opportunities • Some ongoing research directions • Happening at Illinois • BlueWaters, NCSA/IACAT • Intel/Microsoft, NVIDIA, HP/Intel/Yahoo!, … 4/2/2014 LNCC-08 2
Current HPC Landscape Source: top500.org • Petascale era started! • Roadrunner@LANL (#1 in Top500): • Linpack: 1.026 Pflops, Peak: 1,375 Pflops • Heterogeneous systems starting to spread (Cell, GPUs, …) • Multicore processors widely used • Current trends: 4/2/2014 LNCC-08 3
Current HPC Landscape (cont.) • Processor counts: • #1 Roadrunner@LANL: 122K • #2 BG/L@LLNL: 212K • #3 BG/P@ANL: 163K • Exascale: sooner than we imagine… • U.S. Dep. of Energy town hall meetings in 2007: • LBNL (April), ORNL (May), ANL (August) • Goals: discuss exascale possibilities, how to accelerate it • Sections: • Climate, Energy, Biology, Socioeconomic Modeling, Astrophysics, Math & Algorithms, Software, Hardware • Report: http://www.er.doe.gov/ASCR/ProgramDocuments/TownHall.pdf LNCC-08 4
Current HPC Landscape (cont.) • Current reality: • Steady increase in processor counts • Systems become multicore or heterogeneous • “Memory wall” effects worsening • MPI programming model still dominant • Challenges (now and into foreseeable future): • How to explore new systems’ power • Capacity x Capability – different problems • Capacity is a concern for system managers • Capability is a concern for users • How to program in parallel effectively • Both multicore (desktop) and million-core (supercomputers) LNCC-08 5
Parallel Programming Lab LNCC-08 6
Parallel Programming Lab - PPL PPL, April’2008 • http://charm.cs.uiuc.edu • One of the largest research groups at Illinois • Currently: • 1 faculty, 3 research scientists, 4 research programmers • 13 grad students, 1 undergrad student • Open positions 4/2/2014 LNCC-08 7
PPL Mission and Approach • To enhance Performance and Productivity in programming complex parallel applications • Performance: scalable to thousands of processors • Productivity: of human programmers • Complex: irregular structure, dynamic variations • Application-oriented yet CS-centered research • Develop enabling technology, for a wide collection of apps. • Develop, use and test it in the context of real applications • Embody it into easy to use abstractions • Implementation: Charm++ • Object-oriented runtime infrastructure • Freely available for non-commercial use 4/2/2014 LNCC-08 8
Application-Oriented Parallel Abstractions Synergy between Computer Science research and applications has been beneficial to both LeanCP Space-time meshing Other Applications Issues NAMD Charm++ Techniques & libraries Rocket Simulation ChaNGa LNCC-08 9
Programming Methodology LNCC-08 10
Benefits of Virtualization • Software engineering • Number of virtual processors can be independently controlled • Separate VP sets for different modules in an application • Message driven execution • Adaptive overlap of computation/communication • Dynamic mapping • Heterogeneous clusters • Vacate, adjust to speed, share • Automatic checkpointing • Change set of processors used • Automatic dynamic load balancing • Communication optimization System implementation User View Methodology: Migratable Objects Programmer: [Over] decomposition into objects (“virtual processors” - VPs) Runtime:Assigns VPs to real processors dynamically, during execution Enables adaptive runtime strategies Implementations: Charm++, AMPI LNCC-08 11
MPI processes MPI “processes” Implemented as virtual processes (user-level migratable threads) Real Processors Adaptive MPI (AMPI): MPI + Virtualization • Each virtual process implemented as a user-levelthread embedded in a Charm object • Must properly handle globals and statics (analogous to what’s needed in OpenMP) • But… thread context-switch is much faster than other techniques LNCC-08 12
Parallel Decomposition and Processors • MPI-style: • Encourages decomposition into P pieces, where P is the number of physical processors available • If the natural decomposition is a cube, then the number of processors must be a cube • Overlap of comput./communication is a user’s responsibility • Charm++/AMPI style: “virtual processors” • Decompose into natural objects of the application • Let the runtime map them to physical processors • Decouple decomposition from load balancing LNCC-08 13
Decomposition independent of numCores Solid Solid Solid . . . Fluid Fluid Fluid 1 2 P . . . Solid1 Solid2 Solid3 Solidn . . . Fluid1 Fluid2 Fluidm • Rocket simulation example under traditional MPI vs. Charm++/AMPI framework • Benefits: load balance, communication optimizations, modularity LNCC-08 14
Dynamic Load Balancing • Based on Principle of Persistence • Computational loads and communication patterns tend to persist, even in dynamic computations • Recent past is a good predictor of near future • Implementation in Charm++: • Computational entities (nodes, structured grid points, particles…) are partitioned into objects • Load from objects may be measured during execution • Objects are migrated across processors for balancing load • Much smaller problem than repartitioning entire dataset • Several available policies for load-balancing decisions LNCC-08 15
Typical Load Balancing Phases Regular Timesteps Detailed, aggressive Load Balancing Time Instrumented Timesteps Refinement Load Balancing LNCC-08 16
Examples of Science & Engineering Charm++ Applications LNCC-08 17
NAMD: A Production MD program • NAMD • Fully featured program • NIH-funded development • Distributed free of charge (~20,000 registered users) • Binaries and source code • Installed at NSF centers • 20% cycles (NCSA, PSC) • User training and support • Large published simulations • Gordon-Bell award in 2002 • URL: www.ks.uiuc.edu/Research/namd LNCC-08 18
Spatial Decomposition Via Charm++ • Atoms distributed to cubes based on their location • Size of each cube : • Just a bit larger than cut-off radius • Communicate only with neighbors • Work: for each pair of nbr objects • C/C ratio: O(1) • However: • Load Imbalance • Limited Parallelism Charm++ is useful to handle this Cells, Cubes or “Patches” LNCC-08 19
Object-based Parallelization for MD Force Decomposition + Spatial Decomposition • Now, we have many objects to apply load-balancing: • Each diamond can be assigned to any processor • Number of diamonds (3D): • 14 * Number of patches • 2-away variation: • Half-size cubes, 5x5x5 inter. • 3-away interactions: 7x7x7 • Prototype NAMD versions created for Cell, GPUs LNCC-08 20
Performance of NAMD: STMV STMV: ~1 million atoms Time (ms per step) Number of cores LNCC-08 21
LNCC-08 22
ChaNGa: Cosmological Simulations • Collaborative project (NSF ITR) • with Prof. Tom Quinn, Univ. of Washington • Components: gravity (done), gas dynamics (almost) • Barnes-Hut tree code • Particles represented hierarchically in a tree according to their spatial position • “Pieces” of the tree distributed across processors • Gravity computation: • “Nearby” particles: computed precisely • “Distant” particles: approximated by remote node’s center • Software-caching mechanism, critical for performance • Multi-timestepping: update frequently only the fastest particles (see Jetley et al, IPDPS’2008) LNCC-08 23
ChaNGa Performance Results obtained on BlueGene/L No multi-timestepping, simple load-balancers LNCC-08 24
Other Opportunities LNCC-08 25
MPI Extensions in AMPI • Automatic load balancing • MPI_Migrate(): collective operation, possible migration • Asynchronous collective operations • e.g. MPI_Ialltoall() • Post operation, test/wait for completion; do work in between • Checkpointing support • MPI_Checkpoint() • Checkpoint into disk • MPI_MemCheckpoint() • Checkpoint in memory, with remote redundancy LNCC-08 26
Performance Tuning for Future Machines • For example, Blue Waters will arrive in 2011 • But we need to prepare applications for it, starting now • Even for extant machines: • Full size machine may not be available as often as needed for tuning runs • A simulation-based approach is needed • Our approach: BigSim • Based on Charm++ virtualization approach • Full scale program Emulation • Trace-driven Simulation • History: developed for BlueGene predictions LNCC-08 27
BigSim Simulation System • General system organization • Emulation: • Run an existing, full-scale MPI, AMPI or Charm++ application • Uses an emulation layer that pretends to be (say) 100k cores • Target cores are emulated as Charm+ virtual processors • Resulting traces (aka logs): • Characteristics of SEBs (Sequential Execution Blocks) • Dependences between SEBs and messages LNCC-08 28
BigSim Simulation System (cont.) • Trace driven parallel simulation • Typically run on tens to hundreds of processors • Multiple resolution simulation of sequential execution: • from simple scaling factor to cycle-accurate modeling • Multiple resolution simulation of the Network: • from simple latency/bw model to detailed packet and switching port level modeling • Generates Timing traces just as a real app on full scale machine • Phase 3: Analyze performance • Identify bottlenecks, even w/o predicting exact performance • Carry out various “what-if” analysis LNCC-08 29
Projections: Performance Visualization LNCC-08 30
BigSim Validation: BG/L Predictions LNCC-08 31
Some Ongoing Research Directions LNCC-08 32
Load Balancing for Large Machines: I • Centralized balancers achieve best balance • Collect object-communication graph on one processor • But won’t scale beyond tens of thousands of nodes • Fully distributed load balancers • Avoid bottleneck but.. Achieve poor load balance • Not adequately agile • Hierarchical load balancers • Careful control of what information goes up and down the hierarchy can lead to fast, high-quality balancers LNCC-08 33
Load Balancing for Large Machines: II • Interconnection topology starts to matter again • Was hidden due to wormhole routing etc. • Latency variation is still small... • But bandwidth occupancy (link contention) is a problem • Topology aware load balancers • Some general heuristics have shown good performance • But may require too much compute power • Also, special-purpose heuristic work fine when applicable • Preliminary results: • see Bhatele & Kale’s paper, LSPP@IPDPS’2008 • Still, many open challenges LNCC-08 34
Major Challenges in Applications • NAMD: • Scalable PME (long range forces) – 3D FFT • Specialized balancers for multi-resolution cases • Ex: ChaNGa running highly-clustered cosmological datasets and multi-timestepping Time Black: Processor Activity processor (a) Singlestepping (b) Multi-timestepping (c) Multi-timestepping + special load-balancing LNCC-08 35
BigSim: Challenges • BigSim’s simple diagram hides many complexities • Emulation: • Automatic Out-of-core support for large memory footprint apps • Simulation: • Accuracy vs cost tradeoffs • Interpolation mechanisms for prediction of serial performance • Memory management optimizations • I/O optimizations for handling (many) large trace files • Performance analysis: • Need scalable tools • Active area of research LNCC-08 36
Automatic Checkpointing Migrate objects to disk In-memory checkpointing as an option Both schemes above are available in Charm++ Proactive Fault Handling Migrate objects to other processors upon detecting imminent fault Adjust processor-level parallel data structures Rebalance load after migrations HiPC’07 paper: Chakravorty et al Scalable fault tolerance When a processor out of 100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints! Sender-side message logging Restart can be speeded up by spreading out objects from failed processor IPDPS’07 paper: Chakravorty & Kale Ongoing effort to minimize logging protocol overheads Fault Tolerance LNCC-08 37
Higher Level Languages & Interoperability LNCC-08 38
HPC at Illinois LNCC-08 39
HPC at Illinois • Many other exciting developments • Microsoft/Intel parallel computing research center • Parallel Programming Classes • CS-420: Parallel Programming Sci. and Enginnering • ECE-498: NVIDIA/ECE collaboration • HP/Intel/Yahoo! Institute • NCSA’s Blue Waters system approved for 2011 • see http://www.ncsa.uiuc.edu/BlueWaters/ • NCSA/IACAT new institute • see http://www.iacat.uiuc.edu/ 4/2/2014 LNCC-08 40
Microsoft/Intel UPCRC • Universal Parallel Computing Research Center • 5 year funding, 2 centers: • Univ.Illinois & Univ.Cal.-Berkeley • Joint effort by Intel/Microsoft: $2M/year • Mission: • Conduct research to make parallel programming broadly accessible and “easy” • Focus areas: • Programming, Translation, Execution, Applications • URL: http://www.upcrc.illinois.edu/ 4/2/2014 LNCC-08 41
Parallel Programming Classes • CS-420: Parallel Programming • Introduction to fundamental issues in parallelism • Students from both CS and other engineering areas • Offered every semester, by CS Profs. Kale or Padua • ECE-498: Progr. Massively Parallel Processors • Focus on GPU programming techniques • ECE Prof. Wen-Mei Hwu • NVIDIA’s Chief Scientist David Kirk • URL: http://courses.ece.uiuc.edu/ece498/al1 4/2/2014 LNCC-08 42
HP/Intel/Yahoo! Initiative • Cloud Computing Testbed - worldwide • Goal: • Study Internet-scale systems, focusing on data-intensive applications using distributed computational resources • Areas of study: • Networking, OS, virtual machines, distributed systems, data-mining, Web search, network measurement, and multimedia • Illinois/CS testbed site: • 1,024-core HP system with 200 TB of disk space • External access via an upcoming proposal selection process • URL: http://www.hp.com/hpinfo/newsroom/press/2008/080729xa.html 4/2/2014 LNCC-08 43
Our Sponsors LNCC-08 44
PPL Funding Sources • National Science Foundation • BigSim, Cosmology, Languages • Dep. of Energy • Charm++ (Load-Bal., Fault-Toler.), Quantum Chemistry • National Institutes of Health • NAMD • NCSA/NSF, NCSA/IACAT • Blue Waters project, applications • Dep. of Energy / UIUC Rocket Center • AMPI, applications • Nasa • Cosmology/Visualization 4/2/2014 LNCC-08 45
Obrigado ! LNCC-08 46