160 likes | 309 Views
High performance parallel computing of climate models towards the Earth Simulator --- computing science activities at CRIEPI ---. Yoshikatsu Yoshida and Koki Maruyama Central Research Institute of Electric Power Industry (CRIEPI). Outline. Atmosphere model, CCM3
E N D
High performance parallel computing of climate models towards the Earth Simulator--- computing science activities at CRIEPI --- YoshikatsuYoshida and Koki Maruyama Central Research Institute of Electric Power Industry (CRIEPI)
Outline • Atmosphere model, CCM3 • evaluation of vector/parallel computing performance • improvement of parallel computing performance • load balance, communication, MPI/multi-thread • performance prediction on GS40 • improvement of communication performance • Ocean model, POP • evaluation of vector/parallel computing performance • vector/parallel tuning • performance prediction on GS40 • Coupled model, CSM-1 • ported into SX-4 and being ported into VPP5000 • performance evaluation on SX-4
Performance prediction of CCM3 on GS40 (1) • Method for performance prediction • communication • load imbalance • non-parallel (serial) sections and computation overheads • dependence of vector performance on vector length • CCM3.6.6 w/ 2D domain decomposition, supposed • resolution: T341 (~ 40 km) • MPI/multi-thread hybrid parallelism • e.g., north-south decomposition by MPI, and east-west decomposition by OpenMP • GS40 • 640 nodes x 8 vector PEs (8 Gflops/PE, 40 Tflops in total) • communication bandwidth: 16 GB/s, communication latency: 5 ms
Performance prediction of CCM3 on GS40 (2) • Predicted wallclock time • 100-year integration needs ~10 days estimated execution rate is ~1.5 Tflops, when using 4096 PEs • comm. startup is a principal cause of performance degradation → should be improved
How to improve communication performance? (1) • “all-gather” communication in the original CCM3.6.6 • each processor (or node) sends its own data to all the other processors (or nodes), and then receives data from all the other processors (or nodes). • # of communhications per PE is O(P). • Modification of “all-gather” communication • # of communications per PE is reduced from O(P) to O(log P) data processor
How to improve communication performance? (2) • Estimated performance of “all-gather” communication • much improved communication performance, expected a) original “all-gather” b) modified “all-gather”
Predicted performance of original and modified CCM3 • CCM3 turnaround time w/ and w/o modified “all-gather” • 100-year integration can be done within a week • 2.2 Tflops expected when using 4096 PEs b) w/ modified “all-gather” a) w/ original “all-gather”
Vector computing performance of POP (Ocean model) • Vector computing performance on SX-4 • 192 x 128 x 20 grid division • vector processor of SX-4 • peak rate: 2 Gflops • length of vector register: 256 words • relatively minor modifications for vectorization resulted in good performance • POP is much suitable even for vector platforms
Parallel computing performance of modified POP (1) • # of simulated years possibly integrated within a day • 192x128x20 grid division, measured on SX-4 • 1.5-fold speedup achieved by vector/parallel tunings
Parallel computing performance of modified POP (2) • Parallel efficiency at various model resolutions • measured on SX-4 • efficiency on 16 PEs reaches 80% in the 768x512x20 grid case • communication in PCG solver is a performance bottleneck
Performance prediction method of POP on GS40 (1) • Execution time for POP model consists of • computation time • communication time (startup, transfer) because of no significant load-imbalance (land and bottom topography are treated by mask operations) • Computation time is estimated from timing results of decomposed sub-domain • measurements were done on single processor of SX-4 • Communication time is estimated from # of communications and amount of transferred data • latency ~5 msec, bandwidth ~16 GB/s, assumed
Performance prediction method of POP on GS40 (2) • POP performance on SX-4 • predicted results agreed very well with observations
Predicted performance of modified POP on GS40 (1) • COWbench grid (large): 992x1280x40 • parallel efficiency ~16%, 0.8 Tflops, expected on 2048 PEs a) wallclock time per simulated day b) Tflops and parallel efficiency
Predicted performance of modified POP on GS40 (2) • ~1/10 degree model (3072x2048x40) • 100-year integration can be done in 8 days when using 4096 PEs • predicted execution rate reaches 3 Tflops maximum minimum
Predicted performance of modified POP on GS40 (3) • Prediction of turnaround time for POP model (~1/10 deg) • communication startup cost (latency) is a main reason for the performance degradation, even in the case of POP model
Summary • Performance prediction of CCM3 on GS40 • ~ 7 days per simulated century, w/ minor modification of “all-gather” communication. 2.2 Tflops expected. • Performance evaluation of POP on SX-4 • POP code can sustain ~50% of peak rate of SX-4’s vector processor. • communication in CG solver for barotropic mode is a bottleneck of its parallel computing performance. • Performance prediction of POP on GS40 • COWbench grid (large): ~16 % efficiency, 0.8 Tflops expected. • ~1/10 degree model: 100-year can be integrated in 8 days. • In terms of CPU resource requirements, coupled simulation using • ~T341 atmosphere model and ~1/10 degree ocean model is relevant to GS40 (Simulated century takes ~10 days).