180 likes | 301 Views
Performance Evaluation of OhHelp 'ed P IC Simulation. Hiroshi Nakashima (ACCMS, Kyoto U.) cooperated by Yohei Miyake (ACCMS, Kyoto U.) Hideyuki Usui (Kobe U.) Yoshiharu Omura (RISH, Kyoto U.). Contents. Introduction: what I said last june PIC simulation: overview & problems
E N D
Performance Evaluation of OhHelp'ed PIC Simulation Hiroshi Nakashima (ACCMS, Kyoto U.) cooperated by Yohei Miyake (ACCMS, Kyoto U.) Hideyuki Usui (Kobe U.) Yoshiharu Omura (RISH, Kyoto U.)
Contents • Introduction: what I said last june • PIC simulation: overview & problems • OhHelp: load balancer for PIC simulation • algorithm overview • some detail of load balancing • total view of OhHelp'ed PIC simulation • performance evaluation • Conclusion
Introduction: I Said Last June ...Why Plasma Simulation ? A big user group of plasma simulation insisted that our new system should include this power/money hungry subsystem for their memory hungry SM-parallel application. power/money hungry large scale (128cores, 1TB, 1.28TFlops) shared memory nodes Now we are very friendly with each other and are pursuing tightly collaborating research work I failed to persuade them to build Open-Supercomputer-only system. So I swore revenge on them by coding a much more efficient DM-parallel program to run on Open Supercomputer.
Introduction: Also I Showed Last June ...How Efficient on SMP & Small T2K • performance @ 16-128 proc on HPC2500 x11.71 balanced T2K Open Supercomputer 4 nodes (64 cores) x4.02 x8.76 unbalanced x10.7 x1.66 original x3.20
simulate particle movement by PIC SimulationWhat to Do? a large number of (e.g. > 1 trillion) charged particles a large scale (e.g. 1000x1000x1000 grid) electromagnetic field (e.g. magnetosphere)
PIC SimulationWhat the Problem? • Inherently has much parallelism but believed hardly scalable because ... • Particle decomposition copying fields cannot sustain large space domain and global operations on it. • Domain decomposition cannot work when particles are distributed non-uniformly. • Dynamic domain decomposition also fails when particles populate a small subdomain too densely. Need new idea!!
OhHelp: Overview 03 13 23 33 03 13 23 33 02 12 22 32 02 00 10 12 21 31 22 23 32 33 32 03 01 11 21 31 01 11 21 31 02 11 22 20 30 01 13 OhHelp: One-handedHelp 00 10 20 30 00 10 20 30 primary subdomain secondary subdomain • uniformblock decomposition • well-balanced: #particle-in-subdomain #p / #nodes (1 + ) simulate primary particles neighboring comm. only • each node helps another node having dense subdomain • balanced #particles • balanced subdomain size • well-balanced stable subdomain assignment
OhHelp: Load Balancing • Secondary Subdomain Assignment give p even if becoming less than average get from somebody afterward move p from heaviest to lightest so that lightest has av. #p av. #p 33 00 32 01 30 10 13 03 23 20 31 02 11 21 12 22
12 21 31 10 00 22 11 01 13 02 23 32 33 03 20 30 OhHelp: Helper-Tree • Helper-Tree is traversed each time-step (i.e. each particle movement) • bottom-up: does helper assignment sustain the load variance ? • top-down: how (re)distribute particles among family members ?
OhHelp: Balancing Check (2/2) 12 21 31 10 00 22 11 01 13 02 23 32 33 03 20 30
OhHelp: Balancing Check (1/2) 12 21 31 10 00 22 11 01 13 02 23 32 33 03 20 30
all-reduce broadcast + OhHelp: Simulation Flow current scatter field solve particle push secondary particle transfer load balance & particle transfer primary
OhHelp: Evaluation Setup • strong scaling 64 64 64 #proc=1 #proc=512 • weak scaling 256 256 32x32x32 512 #proc=1 #proc=1024
uniformly distributed good scalability thicken particles in a 323 area also well scalable particle decomp. not scalable OhHelp: Performance • strong scaling domain size = 643 #particles = 32 x 220 • weak scaling domain size = 323 x #proc #particles = 8 x 220 x #proc x293 x721 x626 x253 106 particle/s almost linear speed- up over 16 proc. perf(1024)> perf(16) x 60 x35 # of processes # of processes
Conclusion • We confirmed OhHelp'ed PIC simulator is scalable. • 600-700 speedup with 1024 process • (very near) Future work is to build OhHelp library. • lev. 1: load balancing • lev. 2: particle transfer • lev. 3: inter-subdomain communication • lev. 4: semi-automatic PIC code transformation for OhHelp'ing
Breakdown: Strong Scaling @ 256 • exec. time/step • comm. time/step [ms] [ms]
Breakdown: Weak Scaling @ 256 • exec. time/step • comm. time/step [ms] [ms]