Yoshikatsu Yoshida and Koki Maruyama

High performance parallel computing of climate models towards the Earth Simulator--- computing science activities at CRIEPI --- YoshikatsuYoshida and Koki Maruyama Central Research Institute of Electric Power Industry (CRIEPI)

Outline • Atmosphere model, CCM3 • evaluation of vector/parallel computing performance • improvement of parallel computing performance • load balance, communication, MPI/multi-thread • performance prediction on GS40 • improvement of communication performance • Ocean model, POP • evaluation of vector/parallel computing performance • vector/parallel tuning • performance prediction on GS40 • Coupled model, CSM-1 • ported into SX-4 and being ported into VPP5000 • performance evaluation on SX-4

Performance prediction of CCM3 on GS40 (1) • Method for performance prediction • communication • load imbalance • non-parallel (serial) sections and computation overheads • dependence of vector performance on vector length • CCM3.6.6 w/ 2D domain decomposition, supposed • resolution: T341 (~ 40 km) • MPI/multi-thread hybrid parallelism • e.g., north-south decomposition by MPI, and east-west decomposition by OpenMP • GS40 • 640 nodes x 8 vector PEs (8 Gflops/PE, 40 Tflops in total) • communication bandwidth: 16 GB/s, communication latency: 5 ms

Performance prediction of CCM3 on GS40 (2) • Predicted wallclock time • 100-year integration needs ~10 days estimated execution rate is ~1.5 Tflops, when using 4096 PEs • comm. startup is a principal cause of performance degradation → should be improved

How to improve communication performance? (1) • “all-gather” communication in the original CCM3.6.6 • each processor (or node) sends its own data to all the other processors (or nodes), and then receives data from all the other processors (or nodes). • # of communhications per PE is O(P). • Modification of “all-gather” communication • # of communications per PE is reduced from O(P) to O(log P) data processor

How to improve communication performance? (2) • Estimated performance of “all-gather” communication • much improved communication performance, expected a) original “all-gather” b) modified “all-gather”

Predicted performance of original and modified CCM3 • CCM3 turnaround time w/ and w/o modified “all-gather” • 100-year integration can be done within a week • 2.2 Tflops expected when using 4096 PEs b) w/ modified “all-gather” a) w/ original “all-gather”

Vector computing performance of POP (Ocean model) • Vector computing performance on SX-4 • 192 x 128 x 20 grid division • vector processor of SX-4 • peak rate: 2 Gflops • length of vector register: 256 words • relatively minor modifications for vectorization resulted in good performance • POP is much suitable even for vector platforms

Parallel computing performance of modified POP (1) • # of simulated years possibly integrated within a day • 192x128x20 grid division, measured on SX-4 • 1.5-fold speedup achieved by vector/parallel tunings

Parallel computing performance of modified POP (2) • Parallel efficiency at various model resolutions • measured on SX-4 • efficiency on 16 PEs reaches 80% in the 768x512x20 grid case • communication in PCG solver is a performance bottleneck

Performance prediction method of POP on GS40 (1) • Execution time for POP model consists of • computation time • communication time (startup, transfer) because of no significant load-imbalance (land and bottom topography are treated by mask operations) • Computation time is estimated from timing results of decomposed sub-domain • measurements were done on single processor of SX-4 • Communication time is estimated from # of communications and amount of transferred data • latency ~5 msec, bandwidth ~16 GB/s, assumed

Performance prediction method of POP on GS40 (2) • POP performance on SX-4 • predicted results agreed very well with observations

Predicted performance of modified POP on GS40 (1) • COWbench grid (large): 992x1280x40 • parallel efficiency ~16%, 0.8 Tflops, expected on 2048 PEs a) wallclock time per simulated day b) Tflops and parallel efficiency

Predicted performance of modified POP on GS40 (2) • ~1/10 degree model (3072x2048x40) • 100-year integration can be done in 8 days when using 4096 PEs • predicted execution rate reaches 3 Tflops maximum minimum

Predicted performance of modified POP on GS40 (3) • Prediction of turnaround time for POP model (~1/10 deg) • communication startup cost (latency) is a main reason for the performance degradation, even in the case of POP model

Summary • Performance prediction of CCM3 on GS40 • ~ 7 days per simulated century, w/ minor modification of “all-gather” communication. 2.2 Tflops expected. • Performance evaluation of POP on SX-4 • POP code can sustain ~50% of peak rate of SX-4’s vector processor. • communication in CG solver for barotropic mode is a bottleneck of its parallel computing performance. • Performance prediction of POP on GS40 • COWbench grid (large): ~16 % efficiency, 0.8 Tflops expected. • ~1/10 degree model: 100-year can be integrated in 8 days. • In terms of CPU resource requirements, coupled simulation using • ~T341 atmosphere model and ~1/10 degree ocean model is relevant to GS40 (Simulated century takes ~10 days).

Yoshikatsu Yoshida and Koki Maruyama

Yoshikatsu Yoshida and Koki Maruyama

Presentation Transcript

Yoshida laboratory Hiromi Okubo

Takeshi Yoshida

Michitoshi YOSHIDA Hiroshima University

Yoshida Lab M1 Y oshitaka Mino

Saki Yoshida Mio Imada Suhyun Kim

Classroom Management for English Teachers Marla Yoshida

Yoshida Laboratory Mino Yoshitaka

11720140 Keitaro YOSHIDA

Yoshida Lab Tatsuo Kano

Maruyama-Shijo

Mabito YOSHIDA Director, IT Security Office

Reiko Watanuki, Yuko Yoshida and Kiyoko Futagami

Inkjet Heads – E1 Hitachi Koki

YOSHIDA Lab. Ryusuke Tominaga

Koki Agarwal Director, USAID’s Maternal and Child Health Integrated Program Jhpiego

MARUYAMA PARTS LIST SYSTEM

Garrett Maruyama

Biochemical Engineering Shigeo Katoh and Fumitake Yoshida Wiley-VCH 2009