180 likes | 190 Views
This project aims to enhance the performance of Arctic forecast models on advanced architectures. It focuses on improving the sea ice model (CICE), global ocean model (HYCOM), and wave model (WaveWatch III) to enable better predictions of polar ice and global ocean conditions. The project also aims to address challenges in computational intensity, parallelism, and data transfer in order to provide more accurate predictions of Arctic conditions.
E N D
Accelerated Prediction of Polar Ice and Global Ocean (APPIGO): Overview Phil Jones (LANL) Eric Chassignet (FSU) Elizabeth Hunke, Rob Aulwes (LANL) Alan Wallcraft, Tim Campbell (NRL-SSC) Mohamed Iskandarani, Ben Kirtman (Univ. Miami)
Arctic Prediction • Polar amplification • Rapid ice loss, feedbacks • Impacts on global weather • Human activities • Infrastructure, coastal erosion, permafrost melt • Resource extraction • Shipping • Security/safety, staging • Regime change • Thin ice leads to more variability Shell Kulluck Arctic oil rig runs aground in Gulf of Alaska (USCG photo) LNG carrier Ob River in winter crossing (with icebreakers)
Interagency Arctic efforts • Earth System Prediction Capability (ESPC) Focus Area • Sea ice prediction: up to seasonal • Sea Ice Prediction Network (SIPN) • Sea Ice Outlook • This project – enabling better prediction through model performance
Interagency Arctic efforts • Earth System Prediction Capability (ESPC) Focus Area • Sea ice prediction: up to seasonal • Seasonal prediction: Broncos vs Carolina in Super Bowl • Sea Ice Prediction Network (SIPN) • Sea Ice Outlook • This project
APPIGO • Enhance performance of Arctic forecast models on advanced architectures with a focus on: • Los Alamos CICE – sea ice model • HYCOM – global ocean model • WaveWatch III – wave model • Components of Arctic Cap Nowcast/Forecast System (ACNFS), Global Ocean Forecast System (GOFS)
Proposed Approach • Refactoring: incremental • Profile • Accelerate section (slower) • Expand sections • Can test along way • Try directive/other approaches • Optimized • Best possible for specific kernels • Abstractions, larger-scale changes (data structures) • In parallel: optimized operator library • Stennis (HYCOM, Phi/many-core), LANL (GPU, CICE, HYCOM), Miami (operators), FSU (validation, science)
APPIGO proposed timeline • Year 1 • Initial profiling • Initial acceleration (deceleration!) • CICE: GPU • HYCOM: GPU, Phi (MIC) • WW3: hybrid scalability • Begin operator libs • Year 2 • Continued optimization • Expand accelerated regions (change sign) • Abstractions, operator lib • Year 3 • Deploy in models and validate with science
Focus on CICE: Challenges • CICE • Dynamics (EVP rheology) • Transport • Column physics (thermo, ridging, etc.) • Quasi-2d • Num of levels, thickness classes small • Parallelism • Not enough in just horiz domain decomp • Computational intensity • Maybe not enough work for efficient kernels • BGC and new improvements help
Accelerating CICE with OpenACC • Focused on dynamics • Halo updates presented signification challenge • Attempted to use GPUDirect to avoid extra GPU – CPU data transfers • What we tried • Refactored loops to get more computation on GPU • Fused separate kernels • Using OpenACC streams to get concurrent execution and hide data transfer latencies
HYCOM ProgressLarge Benchmark • Standard DoD HPCMP HYCOM 1/25 global benchmark • 9000 by 6595 by 32 layers • Includes typical I/O and data sampling • Benchmark updated from HYCOM version 2.2.27 to 2.2.98 • Land masks in place of do-loop land avoidance • Dynamic vs static memory allocation
HYCOM ProgressLarge Benchmark • On the Cray XC40: • Using huge pages improves performance by about 3% • Making the first dimension of all arrays a multiple of 8 saved 3-6% • Change a single number in the run-time patch.input file • ifort -align array64byte • Total Core hours per model day vs number of cores • 3 generations of Xeon cores • No single-core improvement, but 8 vs 12 vs 16 cores per socket
HYCOM on Xeon Phi • Standard gx1v6 HYCOM benchmark run in native mode on 48 cores of single 5120D Phi attached to Navy DSRC’s Cray XC30 • No additional code optimization • Compared to 24 cores of a single Xeon E5-2697v2 node • Individual subroutines run 6 to 13 times slower • Overall, 10 times slower • Memory capacity is too small • I/O is very slow • Native mode is not practical • Decided not to optimize for Knights Corner - Knights Landing very different • Self hosting Knights Landing nodes • Up to 72 cores per socket, lots of memory • Scalability of 1/25 global HYCOM make this a good target • May need additional vector (AVX-512F) optimization • I/O must perform well
Validation Case • CESM test case • HYCOM (2.2.35), CICE • Implementation of flux exchange • HYCOM, CICE in G compset • Three 50-year experiments • CORE v2 forcing • HYCOM in CESM w/ CICE • POP in CESM w/ CICE • HYCOM standalone w/ CICE
Lessons Learned • Hosted accelerators suck • Programming models, software stack immature • Inability to even build at Hackathon a year ago • Substantial improvement • Can build and run to break-even at 2015 Hackathon • OpenACC can compete with CUDA, 2-3x speedup • Based on ACME atmosphere experience • GPU Direct • Need to expand accelerated regions beyond single-routine to gain performance • We have learned a great deal and obtained valuable experience
APPIGO Final Year • CICE • Continue, expand OpenACC work • Column physics • HYCOM • Revisit OpenACC • Continue work toward Intel Phi • Continue validation/comparison • Coupled and uncoupled
APPIGO Continuation? • Focus on path to operational ESPC model • Continued optimization, but focus on coverage, incorporation into production models • CICE, HYCOM on Phi (threading), GPU (OpenACC) • WWIII? • Science application • Use coupled sims to understand Arctic regime change • Throw Mo under the bus: Abandon stencils • Too fine granularity