710 likes | 1k Views
Training session 2 : Advanced training course on modipsl and libIGCM November 14 th 2013, MdS. Outline. IPSL climate modelling centre (ICMC) presentation IPSLCM history and perspective Mini how to use modipsl/libIGCM Post-processing with libIGCM Monitoring a simulation Hands-on.
E N D
Training session 2 :Advanced training course on modipsl and libIGCMNovember 14th 2013, MdS
Outline • IPSL climate modelling centre (ICMC) presentation • IPSLCM history and perspective • Mini how to use modipsl/libIGCM • Post-processing with libIGCM • Monitoring a simulation • Hands-on
Modeling platform(IPSL-ESM)Arnaud Caubel (LSCE) - Marie-Alice Foujols (IPSL) Current and future climate changesJean-Louis Dufresne(LMD) - Olivier Boucher (LMD) Atmospheric and surface physics and dynamics (LMDZ)Frédéric Hourdin (LMD) - Laurent Fairhead (LMD) Paleoclimate and last millennium Pascale Braconnot - Masa Kageyama (LSCE) Ocean and sea ice physics and dynamics (NEMO, LIM)C Ethé (IPSL) - Claire Lévy - Gurvan Madec (LOCEAN) “Near-term” prediction (seasonal to decadal)Eric Guilyardi (LOCEAN) - Juliette Mignot (LOCEAN) Atmosphere and ocean interactions (IPSL-CM, different resolutions) Sébastien Masson (LOCEAN) - Olivier Marti (LSCE) Regional climatesRobert Vautard (LSCE), Laurent Li (LMD) Atmospheric chemistry and aerosols (INCA, INCA_aer, Reprobus)Anne Cozic (LSCE) - M. Marchand (LATMOS) Biogeochemical cycles (PISCES)Laurent Bopp (LSCE) - Patricia Cadule (IPSL) Evaluation of the models, present-day and future climate change analysis Sandrine Bony (LMD) - Patricia Cadule (IPSL) - Marion Marchand (LATMOS) - Juliette Mignot (LOCEAN) – Jérôme Servonnat (LSCE) Data Archive and Access RequirementsSébastien Denvil (IPSL) - Karim Ramage (IPSL) ICMC organisation PI: J-L Dufresne; Office: L. Bopp, MA Foujols, J. Mignot Steering committee Continental processes (ORCHIDEE)Philippe Peylin (LSCE) - Josefine Ghattas (IPSL)
Outline • IPSL climate modelling centre (ICMC) presentation • IPSLCM history and perspective • Mini how to use modipsl/libIGCM • Post-processing with libIGCM • Monitoring a simulation • Hands-on
IPSLCM history and scientific articles IPCC reports FAR AR5 SAR TAR AR4 1990 1995 2001 2007 2013 CMIP projects CMIP3 CMIP 1 & 2 CMIP5 few articles IPSL-CM1 some articles IPSL-CM2 10+ articles IPSL-CM4 30+ articles IPSL-CM5 IPSL-CM6
LMDZ : atmospheric componenthttp://lmdz.lmd.jussieu.fr/?set_language=en Next LMDZ training session : 9-11 December 2013inscription before 15th November http://studs.unistra.fr/studs.php?sondage=1wgk8t9v44nsml27
Short history of IPSL modelhttp://icmc.ipsl.fr/index.php/icmc-models
1979 : 1st Linpack performance list 80 Mflops
Supercomputers timeline : top500.org *10/4 years
Complexity and resolution of models IPCC, AR4, WG1, Chap. 1, fig 1.2 and 1.4
top500.org : number of CPUS/cores 100 000 1 000 10 1993 2003 2013
Technical challenges : HPC • More parallelism in component : • MPI : messages programming • hybrid ie MPI/OpenMP : directives and shared memory • More parallelism in coupled model • 2 or 3 executables • each with MPI or MPI/OpenMP • more executables with XIOS : IO servers • Huge amount of data produced, to be analysed
On the road to IPSL-CM6 • New physical package : LMDZ, NEMO, ORCHIDEE • Increased H and V resolutions • Ensembles of simulations • Longer simulations : paleo • More complexity : INCA chemistry added • More processors used in parallel • New dynamical core : DYNAMICO • Optimization in IO and coupling • Improvement and Reliability of libIGCM
Outline • IPSL climate modelling centre (ICMC) presentation • IPSLCM history and perspective • Mini how to use modipsl/libIGCM • Post-processing with libIGCM • Monitoring a simulation • Hands-on
Summary : Extract, compile and launch a simulation of _v5 configuration • Download MODIPSL svn co http://forge.ipsl.jussieu.fr/igcmg/svn/modipsl/trunk modipsl • Extract a configuration (ex: IPSLCM5_v5)cd modipsl/util ; ./model IPSLCM5_v5 • Compilation cd modipsl/config/IPSLCM5_v5 ; gmake [resol] • Create submission directory cp EXPERIMENT/IPSLCM5/piControl/config.card . vi config.card ### Modify at least JobName=MYEXP ../../util/ins_job ### copy of piControl directory in MYEXP with COMP, DRIVER, PARAM • Launch simulation cd modipsl/config/IPSLCM5_v5/MYEXP; ccc_msub Job_MYEXP / llsumbmit Job_MYEXP
IPSL sources of components cvs/svn servers Connection Specific configuration dowloading Modipsl Compilation Simulation set up LibIGCM Physical package choice and set up Job set up and submission LibIGCM Front End Computing
Generical job: AA_Job PeriodLength
libIGCM library : schematic description EXP00/DRIVER EXP00 driver EXP00/COMP card
Job_EXP00 Job_EXP00 Job_EXP00 Job_EXP00 Computing job PackFrequency pack_debug PackFrequency pack_restart RebuildFrequency rebuild pack_output Post-processing jobs SeasonalFrequency create_se atlas atlas create_ts TimeSeriesFrequency create_ts monitoring
TGCC computers and file system in a nutshell Computers airainfront-end curie hybrid nodes-q hybrid airainnodes curiefront-end curiethin nodes -q standard curielarge nodes -q xlarge login compute File system Small precious filesSaved space $HOME $CCCWORKDIR sources small results IGCM_OUT : MONITORING/ATLAS cp dods/work dods_cp temporary REBUILD IGCM_OUT : files to be packed outputs of post-proc jobs $SCRATCHDIR cp quotas $CCCSTOREDIR IGCM_OUT : Packed resultsOutput, Analyse SE and TS dods/store ccc_hsm get dods_cp HPSS : Robotic tapes Temporary space Non saved space Saved space Space on tapes Visible from www October 2013
curie Job_EXP00 Job_EXP00 Job_EXP00 Compute TGCC PeriodLength PeriodLength $SCRATCHDIR/IGCM_OUT/.../REBUILD RebuildFrequency rebuild Post curie $SCRATCHDIR/IGCM_OUT/XXX/Output $SCRATCHDIR/IGCM_OUT/XXX/Restart Debug PackFrequency PackFrequency pack_restart pack_debug ncrcat tar pack_output Post curie $CCCSTOREDIR/IGCM_OUT/.../RESTART DEBUG $CCCSTOREDIR/IGCM_OUT/XXX/Output TimeSeriesFrequency SeasonalFrequency create_ts create_se Post monitoring atlas curie TS et SE : $CCCSTOREDIR/IGCM_OUT/… dods/storeMONITORING et ATLAS : $CCCWORKDIR dods/work DodsCopy=TRUE/FALSE
IDRIS computers and file system in a nutshell turingfront-end turingcalcul adappfront-end adappcompute adacompute login compute Small precious filesSaved space $HOME File system $HOME sources small results temporary REBUILD IGCM_OUT : files to be packed outputs of post-proc jobs $WORKDIR $WORKDIR $TMPDIR $TMPDIR $TMPDIR mfput/mfget mfput/mfget gaya dods $HOME dmput/dmget IGCM_OUT :Output, Analyse MONITORING/ATLAS dods_cp Robotic tapes Temporary space Non saved space Saved space Space on tapes Visible from www October 2013
ada Job_EXP00 Job_EXP00 Job_EXP00 Compute IDRIS PeriodLength PeriodLength $WORKDIR/IGCM_OUT/.../REBUILD RebuildFrequency rebuild Post adapp $WORKDIR/IGCM_OUT/XXX/Output $WORKDIR/IGCM_OUT/XXX/Restart Debug PackFrequency PackFrequency pack_restart pack_debug ncrcat tar pack_output Post adapp gaya:IGCM_OUT/.../RESTART DEBUG gaya:IGCM_OUT/XXX/Output TimeSeriesFrequency SeasonalFrequency create_ts create_se Post monitoring atlas adapp DodsCopy=TRUE/FALSE gaya:IGCM_OUT/… dods.idris.fr
Outline • IPSL climate modelling centre (ICMC) presentation • IPSLCM history and perspective • Mini how to use modipsl/libIGCM • Post-processing with libIGCM • Monitoring a simulation • Hands-on
Time Series : create_ts.job • A Time Series is a file which contains a single variable over the whole simulation period (ChunckJob2D = NONE) or for a shorter period for 2D (ChunckJob2D = 100Y) or 3D (ChunckJob3D = 50Y) variables. • The write frequency is defined in theconfig.cardfile: TimeSeriesFrequency=10Yindicates that the time series will be written every 10 years and for 10-year periods. • The Time Series are set in the COMP/*.card files by the TimeSeriesVars2D and TimeSeriesVars3D options. • The Time Series coming from monthly (or daily) output files are stored on the file server in the IGCM_OUT/TagName/[SpaceName]/[ExperimentName]/JobName/Composante/Analyse/TS_MO and TS_DA directories. • Bonus : TS_MO_YE (for annual mean time series) are produced for all TS_MO variables • You can add or remove variables to the TimeSeries lists according to your needs. [Post] ... #D- If you want to produce time series, this flag determines #D- frequency of post-processing submission (NONE if you don't want) TimeSeriesFrequency=10Y config.card • [OutputFiles] • List= (histmth.nc, ${R_OUT_ATM_O_M}/${PREFIX}_1M_histmth.nc, Post_1M_histmth),\ • ... • [Post_1M_histmth] • Patches= () • GatherWithInternal = (lon, lat, presnivs, time_counter, time_counter_bnds, aire) • TimeSeriesVars2D = (bils, cldh, ... • ... • ChunckJob2D = NONE • TimeSeriesVars3D = (upwd, lwcon, ... • ... • ChunckJob3D = OFF COMP/lmdz.card
Intermonitoring : http://webservices.ipsl.jussieu.fr/monitoring/ More details in Appendix
How to add a new variable in MONITORING • You can add or change the variables to be monitored by editing the configuration files of the monitoring. Those files are defined by default for each component. • The monitoring is defined here: ~shared_account/atlas For example for LMDZ on curie : ~p86ipsl/monitoring01_lmdz_LMD9695.cfgFor example for LMDZ on adapp : ~rpsl035/monitoring01_lmdz_LMD9695.cfg • You can change the monitoring by creating a POST directory which is part of your configuration. Copy a .cfg file and change it the way you want. • use ferret language • You can monitor variables produced in time series and stored in TS_MO • More information (in French): wiki.ipsl.jussieu.fr/IGCMG/Outils/ferret/Monitoring POST/monitoring01_lmdz_LMD9695.cfg • #-------------------------------------------------------------------------------------------------------- • # field | files patterns | files additionnal | operations | title | units | calcul of area • #-------------------------------------------------------------------------------------------------------- • nettop_global | "tops topl" | LMDZ4.0_9695_grid.nc | "(tops[d=1]-topl[d=2])" | "TOA. total heat flux (GLOBAL)" | "W/m^2" | "aire[d=3]"
Seasonal mean : create_se.job • A seasonal means files (SE) contain averages for each month of the year (jan, feb,...) for a frequency defined in the config.card files • SeasonalFrequency=10Y The seasonal means will be computed every 10 years. • SeasonalFrequencyOffset=0 The number of years to be skipped for calculating seasonal means. • All files with a requested Post (Seasonal=ON in COMP/*card) are then averaged within the ncra script before being stored in the directory: • IGCM_OUT/IPSLCM5A/DEVT/pdControl/MyExp/ATM/Analyse/SE. There is one file per SeasonalFrequency • ATLAS are launched by create_se. ATLAS sources are : ~rpsl035 ~p86ipsl/atlas • More information (in French): wiki.ipsl.jussieu.fr/IGCMG/Outils/ferret/Atlas #======================================================================== #D-- Post - [Post] ... #D- If you want to produce seasonal average, this flag determines #D- the period of this average (NONE if you don't want) SeasonalFrequency=10Y #D- Offset for seasonal average first start dates ; same unit as SeasonalFrequency #D- Usefull if you do not want to consider the first X simulation's years SeasonalFrequencyOffset=0 config.card • [OutputFiles] • List=(histmth.nc, ${R_OUT_ATM_O_M}/${PREFIX}_1M_histmth.nc, Post_1M_histmth),\ • ... • [Post_1M_histmth] • ... • Seasonal=ON COMP/lmdz.card
Outline • IPSL climate modelling centre (ICMC) presentation • IPSLCM history and perspective • Mini how to use modipsl/libIGCM • Post-processing with libIGCM • Monitoring a simulation • Hands-on
Monitoring the simulation Verification and Correction
Monitoring a simulation • We strongly encourage you to check your simulation frequently during run time. First of all, check job status : ccc_mstat llq • Real time limit exceeded : jobs are killed without any message on ada • RunChecker.job : This tool, provided with libIGCM, allows you to find out your simulations' status. • One historical simulation, 156 years : 1850-2005 is composed by 50 computing jobs and 1000 post-processing jobs Documentation http://forge.ipsl.jussieu.fr/igcmg/wiki/platform/en/documentation
RunChecker.job • RunChecker.job helps you to monitor all the jobs produced by libIGCM for a simulation
RunChecker.job : usage and options This script can be launched from anywhere. Usage: path/to/libIGCM/RunChecker.job [-u user] [-q] [-j n] [-s] job_name path/to/libIGCM/RunChecker.job [-u user] [-q] [-j n] -p config.card_path path/to/libIGCM/RunChecker.job [-u user] [-q] [-j n] -r Options : -h : print this help and exit -u user : owner of the job -q : quiet -j n : print n post-processing jobs (default is 20) -s : search for a new job in $WORKDIR and fill in the catalog before printing information -p path : give the absolute path to the directory containing the config.card instead of the job name (needed only once) -r : check all running simulations. 1) path/to/libIGCM/RunCkecker.job –p $CCCWORKDIR/CURIE/CMIP5/R1414/IPSLCM5A_20120731/modipsl/config/IPSLCM5A/v5.rcp45CMR2 2) path/to/libIGCM/RunCkecker.job v5.rcp45CMR2
Monitoring a simulation : mail • You receive a message at the end of the simulation • The simulation could be completed or failed De : rpsl003@idris.fr Objet : COURSNIV2 completed Date : 22 octobre 2013 18:29:24 UTC+02:00 À : rpsl003@idris.fr Dear rpsl003, Simulation COURSNIV2 completed on supercomputer ada027 Simulation started : 20000101 Simulation ended : 20000102 Output files are available in /u/rech/psl/rpsl003/IGCM_OUT/IPSLCM5A/DEVT/pdControl/COURSNIV2 Files to be rebuild are temporarily available in /workgpfs/rech/psl/rpsl003/IGCM_OUT/IPSLCM5A/DEVT/pdControl/COURSNIV2/REBUILD Pre-packed files are temporarily available in /workgpfs/rech/psl/rpsl003/IGCM_OUT/IPSLCM5A/DEVT/pdControl/COURSNIV2 Script files, Script Outputs and Debug files (if necessary) are available in /gpfs5r/workgpfs/rech/psl/rpsl003/ADA/COURS/NIV2/IPSLCM5_v5/modipsl/config/IPSLCM5_v5/COURSNIV2 Greetings! Check this out for more information : https://forge.ipsl.jussieu.fr/igcmg/wiki/platform/documentation Mail Début du message réexpédié : De : rpsl003@idris.fr Objet : MyJobTest failed Date : 22 octobre 2013 17:17:41 UTC+02:00 À : rpsl003@idris.fr Dear rpsl003,
Monitoring a simulation : run.card • When the simulation has started, the file run.card is created by libIGCM using the template run.card.init. • run.cardcontains information of the current run period and the previous periods already finished. • This file is updated at each run period by libIGCM. • You can find here information of the time consumption of each period. • The status of the job is set to OnQueue, Running, Completed or Fatal. [Configuration] #last PREFIX OldPrefix= COURSNIV2_20000103 #Compute date of loop PeriodDateBegin= 2000-01-04 PeriodDateEnd= 2000-01-04 CumulPeriod= 4 # State of Job "Start", "Running", "OnQueue", "Completed" PeriodState= Completed SubmitPath= /gpfs5r/workgpfs/rech/psl/rpsl003/ADA/COURS/NIV2/IPSLCM5_v5/modipsl/config/IPSLCM5_v5/COURSNIV2 #======================================================================== [PostProcessing] TimeSeriesRunning=n TimeSeriesCompleted= #======================================================================== [Log] # Executables Size LastExeSize= ( 88011086, 0, 0, 19956686, 0, 0, 1523952 ) #----------------------------------------------------------------------------------------------------------------------------------- # CumulPeriod | PeriodDateBegin | PeriodDateEnd | RunDateBegin | RunDateEnd | RealCpuTime | UserCpuTime | #----------------------------------------------------------------------------------------------------------------------------------- # 1 | 20000101 | 20000101 | 2013-10-22T17:53:48 | 2013-10-22T17:55:10 | 82.01000 | 4.21000 | # 2 | 20000102 | 20000102 | 2013-10-22T18:28:03 | 2013-10-22T18:29:17 | 74.19000 | 4.09000 | # 3 | 20000103 | 20000103 | 2013-10-23T17:28:50 | 2013-10-23T17:30:26 | 95.21000 | 4.30000 | run.card
Verification and correction 1/6 • Where did the problem occur ? • 1 "failed" email : Main computation job => gaya stopped at IDRIS, hardware problem ? Check Script_output_xxxx. => When gaya restarted, or if there isn't any clear error message, try relaunching (after a clean_month): path/to/libIGCM/clean_month.job ccc_msub (llsubmit) Job_...
Verification and correction 2/6 • Where did the problem occur ? • 1 "failed" email : Main computation job : analyse Script_output_xxxx ####################################### # ANOTHER GREAT SIMULATION # ####################################### 1ère partie (copying the input files) ####################################### # DIR BEFORE RUN EXECUTION # ####################################### 2ème partie (running the model) ####################################### # DIR AFTER RUN EXECUTION # ####################################### 3ème partie (post-processing) ####################################### http://forge.ipsl.jussieu.fr/igcmg/wiki/platform/en/documentation/G_suivi#AnalyzingtheJoboutput:Script_Output
Verification and correction 3/6 --> analyse Script_output_xxxx : In general, if your simulation stops you can look for the keyword "IGCM_debug_Exit" or ERROR in this file. This keyword will come after a line explaining the error you are experiencing. ===================================================================== EXECUTION of : /usr/bin/time ccc_mprun -E-K1 -f ./run_file Return code of executable : 153 IGCM_debug_Exit : EXECUTABLE !!!!!!!!!!!!!!!!!!!!!!!!!! !! ERROR TRIGGERED !! !! EXIT FLAG SET !! !------------------------! IGCM_sys_Mkdir : …/modipsl/config/IPSLCM5_v5/COURSNIV2KO/Debug IGCM_sys_Cp : out_execution …/modipsl/config/IPSLCM5_v5/COURSNIV2KO/Debug/COURSNIV2KO_20050401_20050430_out_execution_error ===================================================================== Updated 19/11/2013
Verification and correction 4/6 --> Check closely the sub directory Debug (if it exists) Check file xxxxx_error in Debug/ : • contains LMDZ standard output. LMDZ often fails in hgardfou. Stopping in hgardfou • contains abends (abnormal termination / exception) of each and every component. Check standard outputs for NEMO, ORCHIDEE, INCA, OASIS • Debug/xxxx_ocean.output • Debug/xxxx_output_orchidee • Debug/xxxx_inca.out • Debug/xxxx_cplout
Debug examples • Segmentation fault : check file xxxxx_error in Debug : information on the model which crashes. forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source p25mpava_lmdz.x_2 0000000000EF005B Unknown Unknown Unknown p25mpava_lmdz.x_2 00000000006F293D Unknown Unknown Unknown p25mpava_lmdz.x_2 00000000006BB58F Unknown Unknown Unknown p25mpava_lmdz.x_2 0000000000477A6F Unknown Unknown Unknown p25mpava_lmdz.x_2 0000000000457C99 Unknown Unknown Unknown p25mpava_lmdz.x_2 00000000004568BC Unknown Unknown Unknown libc.so.6 00000034AB81ECDD Unknown Unknown Unknown p25mpava_lmdz.x_2 00000000004567B9 Unknown Unknown Unknown • Compilation and run in « debug mode »
Debug examples • Compilation in « debug mode » • Default mode = « prod mode » (i.e optimized mode to run production runs) • Help of the compiler : compiler options may help to find : • -traceback (to have stack details) • -check bounds (to check array bounds,…) • -fp-stack-check (to check NaN,…) • -g (in order to use a debugger) • other : see compiler documentation... • Where do I have to add these options ? Depends on the model : • ORCHIDEE and IOIPSL : « modipsl/util/AA_make.gdef » (+ ins_make command) #-Q- curie F_O = -DCPP_PARA -xHost -O3 -g -traceback -fp-stack-check $(F_D) $(F_P) -I$(MODDIR) -module $(MODDIR) • LMDZ and INCA : « Makefile » in config/xxx/ by adding « -debug » ou « -dev » in the compilation line : (cd ../../modeles/INCA3; ./makeinca_fcm -debug -chimie CH4 -resol (...) ../../bin/inca.dat ; ) (cd ../../modeles/LMDZ; ./makelmdz_fcm -cpp ORCHIDEE_NOOPENMP -debug -d (..) ../../bin/gcm.e;) • NEMO : « Makefile » in « modeles/NEMO/WORK/Makefile » F_O = -O3 -i4 -r8 –xHost -traceback -module $(MODDIR)/oce -I$(MODDIR) -I$(MODDIR)/oce -I$(NCDF_INC) $(USER_INC) => Work on progress to make it easier !
Debug examples • Strange values (or not as expected) in output files or other pb… • Runtime ,1st debug level : outputs files always available (not migrated) in output directory IGCM_OUT/… • Space name=TEST in config.card ( i.e no packing, eveything is on the $SCRATCHDIR(curie) or $WORKDIR(ada)). • Put « Rebuildfrequency » to 1 period (ex : 1M) in config.card • Runtime, 2nd debug level : outputs files quickly • Space name=TEST in config.card ( i.e no packing, eveything is on the $SCRATCHDIR(curie) or $WORKDIR(ada)). • Rebuildfrequency to 1 period (ex: 1M) in config.card • On Curie : Use of « test » queue (limits : 2 jobs per user, 8 nodes and 1800s per job) • #MSUB -T 1800 # Time limit • #MSUB -Q test # test queue • On ada : no « test » queue (not needed because no waiting time so far) • (No rebuild (expert level !) : remove output files in cards) • Runtime, 3rd debug level : use of a debugger • Compilation option « -g » (intel compiler on curie and ada) • Use of « test » queue on Curie • see IDRIS or TGCC documentation on the use of « ddt » or « totalview » • use of « statistics » of variables, breakpoints,…