280 likes | 475 Views
SimuTools , Malaga, Spain March 16, 2010. Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters. Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor Georgia Institute of Technology. In a Nut Shell.
E N D
SimuTools, Malaga, Spain March 16, 2010 Efficient Simulation ofAgent-based Models onMulti-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D ManagerOak Ridge National Laboratory Adjunct ProfessorGeorgia Institute of Technology
In a Nut Shell Dramatic improvements in speed
Agent Based Modeling and Simulation (ABMS) Game of Life Afghan Leadership ABMS: Motivating Demonstrations GOL LDR
GPU-based ABMS References • Examples: • K. S. Perumalla and B. Aaby, "Data Parallel Execution Challenges and Runtime Performance of Agent Simulations on GPUs," in Agent-Directed Simulation Symposium, 2008 • R. D'Souza, M. Lysenko, and K. Rehmani, "SugarScape on Steroids: Simulating Over a Million Agents at Interactive Rates," in AGENT Conference on Complex Interaction and Social Emergence, 2007
Host initiates “launch” of many SIMD threads Threads get “scheduled” in batches on GPU hardware CUDA claims extremely efficient thread-launch implementation Millions of CUDA threads at once Computation Kernels on each GPUE.g., CUDA Threads
GPU Memory Types (CUDA) • GPU memory comes in several flavors • Registers • Local Memory • Shared Memory • Constant Memory • Global Memory • Texture Memory • An important challenge is organizing the application to make most effective use of hierarchy
CUDA + MPI • An economical cluster solution • Affordable GPUs, each providing one-node CUDA • MPI on giga-bit Ethernet for inter-node comm. • Memory speed-constrained system • Inter-memory transfers can dominate runtime • Runtime overhead can be severe • Need a way to tie CUDA and MPI • Algorithmic solution needed • Need to overcome latency challenge
Block0,0 P0,0 Block0,1 P0,1 Block0,2 P0,2 Block1,1 P1,1 Block1,0 P1,0 Block1,2 P1,2 B Block2,0 P2,0 Block2,1 P2,1 Block2,2 P2,2 Parallel Execution: Conventional Method
Latency Challenge: Conventional Method • High latency between GPU and CPU memories • CUDA inter-memory data transfer primitives • Very high latency across CPU memories • MPI communication for data transfers • Naïve method gives very poor computation to communication ratio • Slow-downs instead of speedups • Need latency resilient method …
Block0,0 P0,0 Block0,1 P0,1 Block0,2 P0,2 Block1,1 P1,1 Block1,0 P1,0 Block1,2 P1,2 B Block2,0 P2,0 Block2,1 P2,1 Block2,2 P2,2 Our Solution: B2R Method R R
Total Runtime Cost: Analytical Form At any level in the hierarchy, total runtime F is given by: Most interesting aspect Cubic in R!
Implications of being Cubic in R • Benefits with B2R not immediately seen for small R • In fact, degradation for small R! • Dramatic improvement possible after small R • Our experiments confirm this trend! • Too large is too bad too • Can’t profit indefinitely! Total Execution Time R
Sub-division Across LevelsE.g., MPI to Blocks to Threads MPI: Rm Block: Rb Thread: Rt
Hierarchy and Recursive Use of B & R B2R can be applied at all levels! E.g., CUDA Hierarchy • A different R can be chosen at every level, E.g. • Rb for block-level R • Rt for thread-level R • Simple constraints exist for possible values of R • Between R and B • Between R’s at different levels • Details in our paper
Performance Over 100× speedup with MPI+CUDA Speedup relative to naïve method with no latency-hiding
Summary • B2R Algorithm applies across heterogeneous, hierarchical platforms • Deep GPU hierarchies • Deep CPU multi-core systems • Cubic nature of runtime dependence on R is a a remarkable aspect • A maximum and minimum exist • Optimal (minimum) can be dramatically low • Results show clear performance improvement • Up to 150x in the best case (fine grained)
Future Work • Generate cross-platform code • E.g, Implement in OpenCL • Add to CUDA-MPI levels • Multi-GPU per node • Implement and test with more benchmarks • E.g., From existing ABMS suites NetLogo & Repast • Generalize to unstructured inter-agent graphs • E.g., Social networks • Potential to apply to other domains • E.g., Stencil computations
Thank you!Questions? Additional material at our webpage: Discrete Computing Systems www.ornl.gov/~2ip