540 likes | 619 Views
IEEE DS-RT, Singapore Oct 26, 2009. Switching to High Gear Opportunities for Grand-scale Real-time Parallel Simulations. Kalyan S. Perumalla, Ph.D. Senior Research Staff Member Oak Ridge National Laboratory Adjunct Professor Georgia Institute of Technology. Main Theme.
E N D
IEEE DS-RT, Singapore Oct 26, 2009 Switching to High Gear Opportunities for Grand-scale Real-time Parallel Simulations Kalyan S. Perumalla, Ph.D. Senior Research Staff MemberOak Ridge National Laboratory Adjunct ProfessorGeorgia Institute of Technology
Main Theme Computational Power…unprecedented potential…exploit Simulation Scale…stretch imagination…new scopes “Think Big…Really Big”
Confluence of Opportunities, Needs • Yes ??? • Yes
Parallel Computing Power: It’s Coming High-end computing… Coming soon to a center near you! Access to 1000’s of cores… for every parallel simulation researcher… in just 2-3 years from now
Switching Gears Gear Decade Processors 1 1980 101 2 1990 102 3 2000 103 4 2010 104 5 2010 105 -106 R 2020
Potential Areas for Discrete Event Execution on 105-106 Scale Cyber infrastructure simulations Internet protocols, peer-to-peer designs, … Epidemiological simulations Disease spread models, mitigation strategies, … Social dynamics simulations Pre- and post-operations campaigns, foreign policy, … Vehicular mobility simulations Regional- or nation-scale, … Agent-based simulations Behavioral exploration, complex compositions, … Sensor network simulations Wide area monitoring, situational awareness, … Organization simulations Command and control, business processes, … Logistics simulations Supply chain processes, contingency analyses, … Initial models scaling to103-104 cores Business Sensitive
If only we look harder… Many nation-scale and world-scale questions are becoming relevant New methods and methodologies are waiting to be discovered
Slippery Slopes Starting point for an experimental study Tendency with evolving needs of accuracy and detail Tendency with evolving needs of accuracy and detail Gory detail Abstractions 10
How do we abstract immense complexity?Answer: Very difficult until we experiment with the system at scale
What do we mean by Gory Detail?Cyber Security Example Network at large Topologies, bandwidths, latencies, link types, MAC protocols, TCP/IP, BGP, … Core systems Routers, databases, service level agreements, inter-AS relationships, … End systems Processor traits, disk traits, OS instances, daemons, services, S/W bugs, … “Heavy” applications and traffic Video (YouTube, …), VOIP, live streams; foreground, background Behavioral infusion Social nets (topologies, dynamics, agencies, advertisers), peer-to-peer 12
Example: Epidemiology or Computer Worm Propagation Typical dynamics model Multiple variants exist, but qualitatively similar Excellent fit, but post-facto (!) Plot collected data Difficult as predictive model Great amount of detail buried in α Gory detail needed for better predictive power Interaction topology Resource limitations 13
Slippery Slope: Cost and Time Cost to realize experimentation capability Time to reach experimentation capability 14
Our Research Organization in Discrete Event Runtimes and Applications Evacuation Decision Support Automated Detection/Tracking Design & Analysis Comm. Effects Design & Analysis … • Customization • Scenario Generation • Experimentation • Visualization TransportationNetworkSimulations Sensor Network Simulations … • Core Models • Feasibility Demonstration • Extensible Frameworks • Novel Modeling Methods • Trade-offs • Memory-Computation • Speed-Accuracy Vehicular Simulations Communication Network Simulations Logistics Simulations Enterprise Simulations Social Network Simulations Asynchronous Scientific Simulations … Parallel/Distributed Discrete Event Simulation Engines • “Enabling” • Scalability • Efficiency • Correctness • Robustness • Usability • Extensibility • Integration Model Execution Synchronization Data Integration Interoperability Multi-Scale … Super computers Clusters Multi-Cores GPGPUs PDAs … Business Sensitive
A Few of Our Current Areas, Projects State-level mobility Multi-million intersections and links Epidemiological analyses Detailed, billion-entity dynamics Wireless radio signal estimation Multi-million-cell cluttered terrains Supercomputer design Designing next architectures by simulating on current Internet security, protocol design As-is instantiation of nodes and routers Populace’s cognitive behaviors Large population cognition with connectionist networks GARFIELD-EVAC 106-107-link scenarios of FL, LA, … RCREDIF 109-individual infection scenarios RCTLM 3-D 107-cells simulated on 104 cores µΠ Performance prediction of 106-core MPI programs on 104 cores NetWarp Hi-fi Internet test-bed
Scalable Experimentation for Cyber Security NetWarp is our novel test-bed technology for highly scalable, detailed, rapid experimentation of cyber security and cyber infrastructures
Real-Time or Faster Cyber Experimentation Approaches Fully Virtualized System NetWarp Hardware Testbed As Fast As Possible Fidelity Emulation System Packet-level Simulation Parallel Sequential Mixed Abstraction Simulation Aggregate Models 102 103 104 105 106 107 108 Scalability
NetWarp Architecture 19 Business sensitive
DOE-Sponsored Institute for Advanced Architectures and Algorithms Need highly scalable simulation methods and methodologies to simulate next generation architectures and algorithms on future supercomputing platforms… “…catalyst for the co-design and development of architectures, algorithms, and applications to create synergy in their respective evolutions…”
μπ (MUPI) Performance Investigation System μπ = micro parallel performance investigator Performance prediction for MPI, Portals and other parallel applications Actual application code executed on the real hardware Platform is simulated at large virtual scale Timing customized by user-defined machine Scale is key differentiator Target: 150,000 virtual cores E.g., 150,000 virtual MPI ranks in simulated scenario Based on µsik (microsimulator kernel) Scalable PDES engine TCP- or MPI-connected simulation kernels
Example: MPI application over μπ Modify MPI include and recompile Change #include <mpi.h> to#include <mupi.h> Relink to mupi library Instead of –lmpi, use -lmupi Run the modified MPI application(a μπ simulation) mpirun –np 4 test -nvp 32runs test with 32 virtual MPI rankssimulation uses 4 real cores μπ itself uses multiple real cores to run in parallel
Epidemic Disease Propagation • Can be an extremely challenging simulation problem • Asymptotic behaviors are relatively well understood • Transients are poorly understood, hard to predict well • Defined and characterized by many interlinked processes • “Gory Detail” necessary
Epidemic Disease Propagation • Reaction-diffusion processes • Probability based on interaction times, vulnerabilities, thresholds • Short- and long-distance mobility, sojourn times • Probabilistic state transitions, infections, recoveries • Supercomputing’08 model reported scalability only to 400 cores • Synchronization costs become prohibitive • Synchronous execution our prime suspect • Our discrete event execution relieves synchronization costs • Scales to tens of thousands of cores • Up to 1 billion affected entities Image from psc.edu
PDES Scaling Neeeds Anticipate impending opportunities in multiple application areas of grand-scale PDES scenarios Prepare to capitalize on increasing computational power (300K+ cores) Aim to achieve computational capability to enable new PDES-based scientific solutions
Jaguar: NCCS’ Cray XT5* * Data and images from http://nccs.gov
Technological Upgrade: 105-Scalable PDES Frameworks To realize scale with any of the PDES models and applications, we need the core frameworks to scale
Recent Attempts at 105-Core PDES Frameworks Bauer et al (Jun’09) on Blue Gene P (Argonne) Perumalla & Tipparaju (Jan’09) on Cray XT5 (ORNL) Degradation beyond 64K cores observed by us as well as others Degradation observed in more than one metric (rollback efficiency, speedup) Business Sensitive
Implications to Discrete Event Execution on High Performance Computing Platforms Business Sensitive
Some of our Objectives Scale from 104 cores (current) to 105-106 cores (new) Realize very large-scale scenarios (multi-billion entity) Cyber infrastructures, social computing, epidemiology, logistics Aid projects in simulation-based design of future generation supercomputers Fill technological gap by achieving the highest scaling capabilities of parallel discrete event simulations Ultimately, enable formulation of grand-scale solutions with non-traditional supercomputing simulations
Electro-magnetic (EM) Wave Propagation Predict receiver signal Account for reflectivity, transmitivity, multi-path effects Power level (voltage) modeled per face of grid cell
PHOLD Benchmark Relatively fine grained ~5 microseconds computation per event 10 “juggler” entities per processor core Analogous to grid cells, road intersections or such Total of 1000 “juggling balls” per core Analogous to state updates exchanged among cells Upon receipt of a ball event, a juggler throws it back random (exponential) time into the future to a random juggler 1 every 1000 juggling exchanges are constrained to be intra-core, rest inter-core
Scalability – Observations Scalability problems with current approaches not evident previously Fine until 104 cores, but poor thereafter Even with discrete event, implementation is key Semi-asynchronous execution scales poorly Fully asynchronous execution needed
Algorithm Design and Development for Scalable Discrete Event Execution Design algorithms optimized for Cray XT5, Blue Gene P/Q Design new virtual-time synchronization algorithm Design novel rollback control schemes Design discrete event-specific flow control Current synchronization algorithm
Additional Important Algorithmic Aspects Novel separation of event communication from synchronization Prioritization support in our communication layer “QoS” support for fast synchronization Novel timestamp-aware buffering Exploit near vs. far timestamps Coordinated with virtual-time synchronization Efficient flow control Highly unstructured inter-processor communication Optimized rollback dynamics Stability and throttling mechanisms Cancel back protocols Example of the “transient event” problem
Data Integration Interface Development Application Programming Interface (API) to Incorporate streaming input into discrete event execution Achieve runtime efficiency as an important consideration Novel concepts supporting latency-hiding • To permit maximal concurrency without violating time-ordering between live simulation and real-time inputs • Reuse optimistic synchronization for latency-hiding for unpredictable data input from external sources
Software Implementation Runtime algorithms and data integration interfaces realized in software Primarily in C/C++ Building on current software (scales to 104) Optimized for performance on Cray XT5 and Blue Gene P Communication to be structured flexibly Use MPI or Portals or combination Will explore potentially new layers Non-blocking collectives (MPI-3) Chapel language Our current scalable data structures Our existing layered software
Performance Metrics Efficiency, speedup measured using event rates Event rate ≡ No. of events processed per wall clock sec • Weak scaling: Ideal speedup ≡ Events/second/processor invariant with number of processors • Strong scaling: Ideal speedup ≡ Aggregate events/second linearly increases with number of processors
Entire runtime and data integration frameworks to be exercised Instantiate scenarios scaled up from smaller-scale scenarios in literature Experiment with strong-scaling as well as weak-scaling, as appropriate for each application area Application Benchmarking and Demonstration At-scale simulation from each area Epidemiological simulations Human behavioral simulations Cyber infrastructure simulations Logistics simulations Example: Probability of infection in epidemiological model Example inter-entity networks
Status Showed preliminary evidence that PDES is Feasible even at the largest core-counts Adequately scalable to over 100,000 cores But should be improved much, much more Applications can now move beyond “if” and begin to contemplate on “how” to use petascale discrete event execution
Methodological Alternatives Sometimes, new modeling formulations may better suit scaling needs! Redefine and refine model to suit the computing platform Example Ultra-scale vehicular mobility simulations on GPUs…
Example: Ultra-scale Vehicular Mobility Simulations E.g., National Evacuation Conference www.nationalevacuationconference.org
Our GARFIELD Simulation & Visualization System Texture Memory Texture Memory Texture Memory Texture Memory Texture Memory Texture Memory FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP=Fragment Processor FP=Fragment Processor FP=Fragment Processor FP=Fragment Processor FP=Fragment Processor FP=Fragment Processor FP FP FP FP FP FP Demo v v v v