1 / 28

Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters

SimuTools , Malaga, Spain March 16, 2010. Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters. Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor Georgia Institute of Technology. In a Nut Shell.

ketan
Download Presentation

Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SimuTools, Malaga, Spain March 16, 2010 Efficient Simulation ofAgent-based Models onMulti-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D ManagerOak Ridge National Laboratory Adjunct ProfessorGeorgia Institute of Technology

  2. In a Nut Shell Dramatic improvements in speed

  3. Outline

  4. Agent Based Modeling and Simulation (ABMS) Game of Life Afghan Leadership ABMS: Motivating Demonstrations GOL LDR

  5. GPU-based ABMS References • Examples: • K. S. Perumalla and B. Aaby, "Data Parallel Execution Challenges and Runtime Performance of Agent Simulations on GPUs," in Agent-Directed Simulation Symposium, 2008 • R. D'Souza, M. Lysenko, and K. Rehmani, "SugarScape on Steroids: Simulating Over a Million Agents at Interactive Rates," in AGENT Conference on Complex Interaction and Social Emergence, 2007

  6. Hierarchical GPU System Hardware

  7. Host initiates “launch” of many SIMD threads Threads get “scheduled” in batches on GPU hardware CUDA claims extremely efficient thread-launch implementation Millions of CUDA threads at once Computation Kernels on each GPUE.g., CUDA Threads

  8. GPU Memory Types (CUDA) • GPU memory comes in several flavors • Registers • Local Memory • Shared Memory • Constant Memory • Global Memory • Texture Memory • An important challenge is organizing the application to make most effective use of hierarchy

  9. GPU Communication Latencies (CUDA)

  10. CUDA + MPI • An economical cluster solution • Affordable GPUs, each providing one-node CUDA • MPI on giga-bit Ethernet for inter-node comm. • Memory speed-constrained system • Inter-memory transfers can dominate runtime • Runtime overhead can be severe • Need a way to tie CUDA and MPI • Algorithmic solution needed • Need to overcome latency challenge

  11. Analogous Networked Multi-core System

  12. Block0,0 P0,0 Block0,1 P0,1 Block0,2 P0,2 Block1,1 P1,1 Block1,0 P1,0 Block1,2 P1,2 B Block2,0 P2,0 Block2,1 P2,1 Block2,2 P2,2 Parallel Execution: Conventional Method

  13. Latency Challenge: Conventional Method • High latency between GPU and CPU memories • CUDA inter-memory data transfer primitives • Very high latency across CPU memories • MPI communication for data transfers • Naïve method gives very poor computation to communication ratio • Slow-downs instead of speedups • Need latency resilient method …

  14. Block0,0 P0,0 Block0,1 P0,1 Block0,2 P0,2 Block1,1 P1,1 Block1,0 P1,0 Block1,2 P1,2 B Block2,0 P2,0 Block2,1 P2,1 Block2,2 P2,2 Our Solution: B2R Method R R

  15. B2R Algorithm

  16. Total Runtime Cost: Analytical Form At any level in the hierarchy, total runtime F is given by: Most interesting aspect Cubic in R!

  17. Implications of being Cubic in R • Benefits with B2R not immediately seen for small R • In fact, degradation for small R! • Dramatic improvement possible after small R • Our experiments confirm this trend! • Too large is too bad too • Can’t profit indefinitely! Total Execution Time R

  18. Sub-division Across LevelsE.g., MPI to Blocks to Threads MPI: Rm Block: Rb Thread: Rt

  19. Hierarchy and Recursive Use of B & R B2R can be applied at all levels! E.g., CUDA Hierarchy • A different R can be chosen at every level, E.g. • Rb for block-level R • Rt for thread-level R • Simple constraints exist for possible values of R • Between R and B • Between R’s at different levels • Details in our paper

  20. B2R Implementation within CUDA

  21. Performance Over 100× speedup with MPI+CUDA Speedup relative to naïve method with no latency-hiding

  22. Multi-GPU MPI+CUDA – Game of Life

  23. Multi-core MPI+pthreads– Game of Life

  24. Multi-core MPI+Pthreads – Game of Life

  25. Multi-core MPI+pthreads – Leadership

  26. Summary • B2R Algorithm applies across heterogeneous, hierarchical platforms • Deep GPU hierarchies • Deep CPU multi-core systems • Cubic nature of runtime dependence on R is a a remarkable aspect • A maximum and minimum exist • Optimal (minimum) can be dramatically low • Results show clear performance improvement • Up to 150x in the best case (fine grained)

  27. Future Work • Generate cross-platform code • E.g, Implement in OpenCL • Add to CUDA-MPI levels • Multi-GPU per node • Implement and test with more benchmarks • E.g., From existing ABMS suites NetLogo & Repast • Generalize to unstructured inter-agent graphs • E.g., Social networks • Potential to apply to other domains • E.g., Stencil computations

  28. Thank you!Questions? Additional material at our webpage: Discrete Computing Systems www.ornl.gov/~2ip

More Related