130 likes | 245 Views
Project Athena: Technical Issues. Larry Marx and the Project Athena Team. Outline. Project Athena Resources Models and Machine Usage Experiments Running Models Initial and Boundary Data Preparation Post Processing, Data Selection and Compression Data Management.
E N D
Project Athena: Technical Issues Larry Marx and the Project Athena Team
Outline • Project Athena Resources • Models and Machine Usage • Experiments • Running Models • Initial and Boundary Data Preparation • Post Processing, Data Selection and Compression • Data Management
Dedicated, Oct’09 – Mar’10 post-processing Dedicated, Oct’09 – Mar’10 79 million core-hours Shared, Oct’09 – Mar’10 5 million core-hours Athena 4512 nodes @ 4 cores, 2 GB mem Verne 5 nodes @ 32 cores, 128 GB mem Kraken 8256 nodes @ 12 cores, 16 GB mem Read-only scratch 78 TB (Lustre) nakji 360 TB (Lustre) homes 8 TB (NFS) 800+ TB HPSS tape archive
Models and Machine Usage • NICAM initially was the primary focus of implementation • Limited flexibility in scaling, due to icosahedral grid • Limited testing on multicore/cache processor architectures; production primarily on the vector-parallel (NEC SX) Earth Simulator • Step 1: Port low resolution version with simple physics to Athena • Step 2: Determine highest resolution possible on Athena and minimum and maximum number of cores to be used • Unique solution: G-level = 10 or 10,485,762 cells (7-km spacing) using exactly 2,560 cores • Step 3: Initially NICAM jobs failed frequently due to improper namelist settings. During visit by U. Tokyo and JAMSTEC scientists to COLA, new settings determined that generally ran with little trouble. However 2003 could never be stabilized and was abandoned.
Models and Machine Usage (cont’d) • IFS flexible scalability sustains good performance for higher resolution configurations (T1279 and T2047) using 2,560 processor cores • We defined one “slot” as 2,560 cores and managed a mix of NICAM and IFS jobs @ 1 job per slot maximally efficient use of resource. • Having equal size slots for both models permits either model to be queued and run in the event of a job failure. • Selected jobs given higher priority so that they continue to run ahead of others. • Machine partition: 7 slots of 2,560 cores = 17,920 cores out of 18,048 • 99% machine utilization • 128 processors for pre- and post-processing and as spares (postpone reboot) • Lower resolution IFS experiments (T159 and T511) were run on Kraken • IFS runs were initially made by COLA. When the ECMWF SMS model management system was installed, runs could be made by COLA or ECMWF.
Initial and Boundary Data Preparation • IFS: • Most input data prepared by ECMWF. Large files shipped by removable disk. • Time Slice experiment input data prepared by COLA. • NICAM: • Initial data from GDAS 1° files. Available for all dates. • Boundary files other than SST included with NICAM. • SST from ¼° NCDC OI daily (version 2). Data starting 1 June 2002 include in situ, AVHRR (IR), and AMSR-E (microwave) . Earlier data does not include AMSR-E. • All data interpolated to icosahedral grid.
Post Processing, Data Selection and Compression • All IFS (Grib-1) data interpolated (coarsened) to the N80 reduce grid for common comparison among the resolutions and with the ERA-40 data. All IFS spectral data truncated to T159 coefficients and transformed to N80 full grid. • Key fields at full model resolution were processed, including transforming spectral coefficients to grids and compression to NetCDF-4 via GrADS. • Processing accomplished using Kraken, because Athena lacks sufficient memory and computing power on each node. • All the common comparison and selected high-resolution data electronically transferred to COLA via bbcp (up to 40MB/s sustained).
Post Processing, Data Selection and Compression (cont’d) • Nearly all (91) NICAM diagnostic variables saved. Each variable saved with (2560) separate files for model domains, resulting in over 230,000 files. The number of files quickly saturated LFS. • Original program to interpolate data to regular lat-lon grid had to be revised to use less I/O and to multithread, thereby eliminating a processing backlog. • Selected 3-d fields were interpolated from z-coordinate to p-coordinate levels. • Selected 2- and 3-d fields were compressed (NetCDF-4) and electronically transferred to COLA. • All selected fields coarsened to N80 full grid.
Data Management: NICS • All data archived to HPSS approaching 1 PB • Workflow required complex data movement: • All model runs at high resolution done on Athena • Model output stored on scratch or nakji and all copied to tape on HPSS • IFS data interpolation/truncation done directly from retrieved HPSS files • NICAM data processed using Verne and nakji (more capable CPUs and larger memory)
Data Management: COLA • Athena allocated 50TB (26%) on COLA disk servers. • Required considerable discussion and judgment to down-select variables from IFS and NICAM, based on factors including scientific use and data compressibility. • Large directory structure needed to organize the data, particularly IFS with many resolutions, sub-resolutions, data forms and ensemble members.
Data Management: Future • New machines at COLA and NICS will permit further analysis not currently possible due to lack of memory and compute power. • Some or all of the data will be made publically available eventually when long term disposition is determined. • TeraGrid Science Portal?? • Earth System Grid??
Summary • Large, international team of climate and computer scientists, using dedicated and shared resources, introduces many challenges for production computing, data analysis and data management • The shear volume and the complexity of the data, “breaks” everything: • Disk capacity • File name space • Bandwidth connecting systems within NICS • HPSS tape capacity • Bandwidth to remote sites for collaborating groups • Software for analysis and display of results (GrADS modifications) • COLA overcame these difficulties as they were encountered in 24×7 production mode and prevent having an idle dedicated computer.