Switching to High Gear Opportunities for Grand-scale Real-time Parallel Simulations

IEEE DS-RT, Singapore Oct 26, 2009 Switching to High Gear Opportunities for Grand-scale Real-time Parallel Simulations Kalyan S. Perumalla, Ph.D. Senior Research Staff MemberOak Ridge National Laboratory Adjunct ProfessorGeorgia Institute of Technology

Main Theme Computational Power…unprecedented potential…exploit Simulation Scale…stretch imagination…new scopes “Think Big…Really Big”

Confluence of Opportunities, Needs • Yes ??? • Yes

Parallel Computing Power: It’s Coming High-end computing… Coming soon to a center near you! Access to 1000’s of cores… for every parallel simulation researcher… in just 2-3 years from now

Evidence of Growth in 103-Core

Now, all Top 500 are 103-core or More!

Switching Gears Gear Decade Processors 1 1980 101 2 1990 102 3 2000 103 4 2010 104 5 2010 105 -106 R 2020

Potential Areas for Discrete Event Execution on 105-106 Scale Cyber infrastructure simulations Internet protocols, peer-to-peer designs, … Epidemiological simulations Disease spread models, mitigation strategies, … Social dynamics simulations Pre- and post-operations campaigns, foreign policy, … Vehicular mobility simulations Regional- or nation-scale, … Agent-based simulations Behavioral exploration, complex compositions, … Sensor network simulations Wide area monitoring, situational awareness, … Organization simulations Command and control, business processes, … Logistics simulations Supply chain processes, contingency analyses, … Initial models scaling to103-104 cores Business Sensitive

If only we look harder… Many nation-scale and world-scale questions are becoming relevant New methods and methodologies are waiting to be discovered

Slippery Slopes Starting point for an experimental study Tendency with evolving needs of accuracy and detail Tendency with evolving needs of accuracy and detail Gory detail Abstractions 10

How do we abstract immense complexity?Answer: Very difficult until we experiment with the system at scale

What do we mean by Gory Detail?Cyber Security Example Network at large Topologies, bandwidths, latencies, link types, MAC protocols, TCP/IP, BGP, … Core systems Routers, databases, service level agreements, inter-AS relationships, … End systems Processor traits, disk traits, OS instances, daemons, services, S/W bugs, … “Heavy” applications and traffic Video (YouTube, …), VOIP, live streams; foreground, background Behavioral infusion Social nets (topologies, dynamics, agencies, advertisers), peer-to-peer 12

Example: Epidemiology or Computer Worm Propagation Typical dynamics model Multiple variants exist, but qualitatively similar Excellent fit, but post-facto (!) Plot collected data Difficult as predictive model Great amount of detail buried in α Gory detail needed for better predictive power Interaction topology Resource limitations 13

Slippery Slope: Cost and Time Cost to realize experimentation capability Time to reach experimentation capability 14

Our Research Organization in Discrete Event Runtimes and Applications Evacuation Decision Support Automated Detection/Tracking Design & Analysis Comm. Effects Design & Analysis … • Customization • Scenario Generation • Experimentation • Visualization TransportationNetworkSimulations Sensor Network Simulations … • Core Models • Feasibility Demonstration • Extensible Frameworks • Novel Modeling Methods • Trade-offs • Memory-Computation • Speed-Accuracy Vehicular Simulations Communication Network Simulations Logistics Simulations Enterprise Simulations Social Network Simulations Asynchronous Scientific Simulations … Parallel/Distributed Discrete Event Simulation Engines • “Enabling” • Scalability • Efficiency • Correctness • Robustness • Usability • Extensibility • Integration Model Execution Synchronization Data Integration Interoperability Multi-Scale … Super computers Clusters Multi-Cores GPGPUs PDAs … Business Sensitive

A Few of Our Current Areas, Projects State-level mobility Multi-million intersections and links Epidemiological analyses Detailed, billion-entity dynamics Wireless radio signal estimation Multi-million-cell cluttered terrains Supercomputer design Designing next architectures by simulating on current Internet security, protocol design As-is instantiation of nodes and routers Populace’s cognitive behaviors Large population cognition with connectionist networks GARFIELD-EVAC 106-107-link scenarios of FL, LA, … RCREDIF 109-individual infection scenarios RCTLM 3-D 107-cells simulated on 104 cores µΠ Performance prediction of 106-core MPI programs on 104 cores NetWarp Hi-fi Internet test-bed

Scalable Experimentation for Cyber Security NetWarp is our novel test-bed technology for highly scalable, detailed, rapid experimentation of cyber security and cyber infrastructures

Real-Time or Faster Cyber Experimentation Approaches Fully Virtualized System NetWarp Hardware Testbed As Fast As Possible Fidelity Emulation System Packet-level Simulation Parallel Sequential Mixed Abstraction Simulation Aggregate Models 102 103 104 105 106 107 108 Scalability

NetWarp Architecture 19 Business sensitive

DOE-Sponsored Institute for Advanced Architectures and Algorithms Need highly scalable simulation methods and methodologies to simulate next generation architectures and algorithms on future supercomputing platforms… “…catalyst for the co-design and development of architectures, algorithms, and applications to create synergy in their respective evolutions…”

μπ (MUPI) Performance Investigation System μπ = micro parallel performance investigator Performance prediction for MPI, Portals and other parallel applications Actual application code executed on the real hardware Platform is simulated at large virtual scale Timing customized by user-defined machine Scale is key differentiator Target: 150,000 virtual cores E.g., 150,000 virtual MPI ranks in simulated scenario Based on µsik (microsimulator kernel)‏ Scalable PDES engine TCP- or MPI-connected simulation kernels

Example: MPI application over μπ Modify MPI include and recompile Change #include <mpi.h> to#include <mupi.h> Relink to mupi library Instead of –lmpi, use -lmupi Run the modified MPI application(a μπ simulation)‏ mpirun –np 4 test -nvp 32runs test with 32 virtual MPI rankssimulation uses 4 real cores μπ itself uses multiple real cores to run in parallel

Epidemic Disease Propagation • Can be an extremely challenging simulation problem • Asymptotic behaviors are relatively well understood • Transients are poorly understood, hard to predict well • Defined and characterized by many interlinked processes • “Gory Detail” necessary

Epidemic Disease Propagation • Reaction-diffusion processes • Probability based on interaction times, vulnerabilities, thresholds • Short- and long-distance mobility, sojourn times • Probabilistic state transitions, infections, recoveries • Supercomputing’08 model reported scalability only to 400 cores • Synchronization costs become prohibitive • Synchronous execution our prime suspect • Our discrete event execution relieves synchronization costs • Scales to tens of thousands of cores • Up to 1 billion affected entities Image from psc.edu

PDES Scaling Neeeds Anticipate impending opportunities in multiple application areas of grand-scale PDES scenarios Prepare to capitalize on increasing computational power (300K+ cores) Aim to achieve computational capability to enable new PDES-based scientific solutions

Jaguar Petascale System [Cray XT5]

Jaguar: NCCS’ Cray XT5* * Data and images from http://nccs.gov

Technological Upgrade: 105-Scalable PDES Frameworks To realize scale with any of the PDES models and applications, we need the core frameworks to scale

Recent Attempts at 105-Core PDES Frameworks Bauer et al (Jun’09) on Blue Gene P (Argonne) Perumalla & Tipparaju (Jan’09) on Cray XT5 (ORNL) Degradation beyond 64K cores observed by us as well as others Degradation observed in more than one metric (rollback efficiency, speedup) Business Sensitive

Implications to Discrete Event Execution on High Performance Computing Platforms Business Sensitive

Some of our Objectives Scale from 104 cores (current) to 105-106 cores (new) Realize very large-scale scenarios (multi-billion entity) Cyber infrastructures, social computing, epidemiology, logistics Aid projects in simulation-based design of future generation supercomputers Fill technological gap by achieving the highest scaling capabilities of parallel discrete event simulations Ultimately, enable formulation of grand-scale solutions with non-traditional supercomputing simulations

Electro-magnetic (EM) Wave Propagation Predict receiver signal Account for reflectivity, transmitivity, multi-path effects Power level (voltage) modeled per face of grid cell

PHOLD Benchmark Relatively fine grained ~5 microseconds computation per event 10 “juggler” entities per processor core Analogous to grid cells, road intersections or such Total of 1000 “juggling balls” per core Analogous to state updates exchanged among cells Upon receipt of a ball event, a juggler throws it back random (exponential) time into the future to a random juggler 1 every 1000 juggling exchanges are constrained to be intra-core, rest inter-core

Radio Propagation: Speedup on Cray XT4

Radio Propagation: Runtime Costs on Cray XT4

Epidemic Propagation: Performance on Cray XT5

Epidemic Propagation – Parallel Run time on Cray XT5

PHOLD: Performance on Cray XT5

Scalability – Observations Scalability problems with current approaches not evident previously Fine until 104 cores, but poor thereafter Even with discrete event, implementation is key Semi-asynchronous execution scales poorly Fully asynchronous execution needed

Algorithm Design and Development for Scalable Discrete Event Execution Design algorithms optimized for Cray XT5, Blue Gene P/Q Design new virtual-time synchronization algorithm Design novel rollback control schemes Design discrete event-specific flow control Current synchronization algorithm

Additional Important Algorithmic Aspects Novel separation of event communication from synchronization Prioritization support in our communication layer “QoS” support for fast synchronization Novel timestamp-aware buffering Exploit near vs. far timestamps Coordinated with virtual-time synchronization Efficient flow control Highly unstructured inter-processor communication Optimized rollback dynamics Stability and throttling mechanisms Cancel back protocols Example of the “transient event” problem

Data Integration Interface Development Application Programming Interface (API) to Incorporate streaming input into discrete event execution Achieve runtime efficiency as an important consideration Novel concepts supporting latency-hiding • To permit maximal concurrency without violating time-ordering between live simulation and real-time inputs • Reuse optimistic synchronization for latency-hiding for unpredictable data input from external sources

Software Implementation Runtime algorithms and data integration interfaces realized in software Primarily in C/C++ Building on current software (scales to 104) Optimized for performance on Cray XT5 and Blue Gene P Communication to be structured flexibly Use MPI or Portals or combination Will explore potentially new layers Non-blocking collectives (MPI-3) Chapel language Our current scalable data structures Our existing layered software

Performance Metrics Efficiency, speedup measured using event rates Event rate ≡ No. of events processed per wall clock sec • Weak scaling: Ideal speedup ≡ Events/second/processor invariant with number of processors • Strong scaling: Ideal speedup ≡ Aggregate events/second linearly increases with number of processors

Entire runtime and data integration frameworks to be exercised Instantiate scenarios scaled up from smaller-scale scenarios in literature Experiment with strong-scaling as well as weak-scaling, as appropriate for each application area Application Benchmarking and Demonstration At-scale simulation from each area Epidemiological simulations Human behavioral simulations Cyber infrastructure simulations Logistics simulations Example: Probability of infection in epidemiological model Example inter-entity networks

Status Showed preliminary evidence that PDES is Feasible even at the largest core-counts Adequately scalable to over 100,000 cores But should be improved much, much more Applications can now move beyond “if” and begin to contemplate on “how” to use petascale discrete event execution

Methodological Alternatives Sometimes, new modeling formulations may better suit scaling needs! Redefine and refine model to suit the computing platform Example Ultra-scale vehicular mobility simulations on GPUs…

Example: Ultra-scale Vehicular Mobility Simulations E.g., National Evacuation Conference www.nationalevacuationconference.org

Our GARFIELD Simulation & Visualization System Texture Memory Texture Memory Texture Memory Texture Memory Texture Memory Texture Memory FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP=Fragment Processor FP=Fragment Processor FP=Fragment Processor FP=Fragment Processor FP=Fragment Processor FP=Fragment Processor FP FP FP FP FP FP Demo v v v v

Switching to High Gear Opportunities for Grand-scale Real-time Parallel Simulations

Switching to High Gear Opportunities for Grand-scale Real-time Parallel Simulations

Presentation Transcript

Energy Metering for Free: Augmenting Switching Regulators for Real-Time Monitoring

Real-Time Parallel Radiosity

Parallel Simulations of Commercial-Scale Polymer Floods

Parallel, Real-Time Garbage Collection

Real-Time Communication Analysis for NoCs with Wormhole Switching

Gear Up for High School

Semi-Parallel Reconfigurable Architecture for Real-time LDPC decoding

Parallel Flat Histogram Simulations

A Parallel, Real-Time Garbage Collector

Parallel Real-Time Systems

Real-time Automatic Redundant Parallel Systems

Time Parallel Simulations II

Time Parallel Simulations I

Parallel Simulations on High-Performance Clusters

Real-Time High Quality Rendering

Real-Time High Quality Rendering

Large-scale parallel simulations of earthquakes at high frequency: the SPECFEM3D project

Parallel Real-Time Systems

Grids for DoD and Real Time Simulations

Energy Metering for Free: Augmenting Switching Regulators for Real-Time Monitoring