340 likes | 458 Views
Recent Developments for Parallel CMAQ. Jeff Young AMDB/ASMD/ARL/NOAA David Wong SAIC – NESCC/EPA. AQF-CMAQ. Running in “quasi-operational” mode at NCEP. 26+ minutes for a 48 hour forecast (on 33 processors). Code modifications to improve data locality. Some vdiff optimization
E N D
Recent Developments for Parallel CMAQ • Jeff Young AMDB/ASMD/ARL/NOAA • David Wong SAIC – NESCC/EPA
AQF-CMAQ Running in “quasi-operational” mode at NCEP 26+ minutes for a 48 hour forecast (on 33 processors)
Code modifications to improve data locality Some vdiff optimization Jerry Gipson’s MEBI/EBI chem solver Tried CGRID( SPC, LAYER, COLUMN, ROW ) – tests Indicated not good for MPI communication of data Some background: MPI data communication – ghost (halo) regions
Ghost (Halo) Regions DO J = 1, N DO I = 1, M DATA( I,J ) = A( I+2,J ) * A( I,J-1 ) END DO END DO
Horizontal Advection and Diffusion Data Requirements 3 4 5 3 4 Ghost Region 0 1 2 Exterior Boundary Stencil Exchange Data Communication Function
Architectural Changes for Parallel I/O Tests confirmed latest releases (May, Sep 2003) not very scalable Background: Original Version Latest Release AQF Version
Parallel I/O 2002 Release Computation only Read, write and computation 2003 Release Read and computation Write and computation asynchronous Write only AQF Data transfer by message passing
Modifications to the I/O API for Parallel I/O For 3D data, e.g. … time interpolation INTERP3 ( FileName, VarName, ProgName, Date, Time, Ncols*Nrows*Nlays, Data_Buffer ) spatial subset XTRACT3 ( FileName, VarName, StartLay, EndLay, StartRow, EndRow, StartCol, EndCol, Date, Time, Data_Buffer ) INTERPX ( FileName, VarName, ProgName, StartCol, EndCol, StartRow, EndRow, StartLay, EndLay, Date, Time, Data_Buffer ) WRPATCH ( FileID, VarID, TimeStamp, Record_No ) Called from PWRITE3 (pario)
Standard (May 2003 Release) DRIVER read IC’s into CGRID begin output timestep loop advstep (determine sync timestep) couple begin sync timestep loop SCIPROC X-Y-Z advect adjadv hdiff decouple vdiff DRYDEP cloud WETDEP gas chem (aero VIS) couple end sync timestep loop decouple write conc and avg conc CONC, ACONC end output timestep loop
DRIVER set WORKERs, WRITER if WORKER, read IC’s into CGRID begin output timestep loop if WORKER advstep (determine sync timestep) couple begin sync timestep loop SCIPROC X-Y-Z advect hdiff decouple vdiff cloud gas chem (aero) couple end sync timestep loop decouple MPI send conc, aconc, drydep, wetdep, (vis) if WRITER completion-wait: for conc, write conc CONC ; for aconc, write aconc, etc. end output timestep loop AQFCMAQ
Power3 Cluster (NESCC’s LPAR) 11 12 13 14 15 11 12 13 14 15 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Memory 2X slower than NCEP’s Switch The platform (cypress00, cypress01, cypress02) consists of 3 SP-Nighthawk nodes All cpu’s share user applications with file servers, interactive use, etc.
Power4 p690 Servers (NCEP’s LPAR) 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 Memory Memory Memory Memory 2 “colony” SW connectors/node ~2X performance of cypress nodes Switch Each platform (snow, frost) is composed of 22 p690 (Regatta) servers Each server has 32 cpu’s LPAR-ed into 8 nodes per server (4 cpu’s per node) Some nodes are dedicated to file servers, interactive use, etc. There are effectively 20 servers for general use (160 nodes, 640 cpu’s)
“ice” Beowulf Cluster Pentium 3 – 1.4 GHz 0 1 0 1 0 1 0 1 0 1 0 1 Memory Memory Memory Memory Memory Memory Internal Network Isolated from outside network traffic
“global” MPICH Cluster Pentium 4 XEON – 2.4 GHz global global1 global2 global3 global4 global5 0 1 0 1 0 1 0 1 0 1 0 1 Memory Memory Memory Memory Memory Memory Network
RESULTS 5 hour, 12Z – 17Z and 24 hour, 12Z – 12Z runs 20 Sept 2002 test data set used for developing the AQF-CMAQ Input Met from ETA, processed thru PRDGEN and PREMAQ 166 columns X 142 rows X 22 layers at 12 km resolution Domain seen on following slides CB4 mechanism, no aerosols, Pleim’s Yamartino advection for AQF-CMAQ PPM advection for May 2003 Release
The Matrix 4 = 2 X 2 8 = 4 X 2 16 = 4 X 4 32 = 8 X 4 64 = 8 X 8 64 = 16 X 4 = comparison of run times run = comparison of relative wall times wall for main science processes
24–Hour AQF vs. 2003 Release 2003 Release AQF-CMAQ Data shown at peak hour
24–Hour AQF vs. 2003 Release Almost 9 ppb max diff between Yamartino and PPM Less than 0.2 ppb diff between AQF on cypress and snow
AQF vs. 2003 Release Absolute Run Times 24 Hours, 8 Worker Processors Sec
AQF vs. 2003 Release on cypress and AQF on snow Absolute Run Times 24 Hours, 32 Worker Processors Sec
AQF CMAQ – Various Platforms Relative Run Times 5 Hours, 8 Worker Processors % of Slowest
AQF-CMAQ cypress vs. snow Relative Run Times 5 Hours % of Slowest
AQF-CMAQ on Various Platforms Relative Run Times 5 Hours % of Slowest Number of Worker Processors
AQF-CMAQ on cypress Relative Wall Times 5 Hours %
AQF-CMAQ on snow Relative Wall Times 5 Hours %
AQF vs. 2003 Release on cypress Relative Wall Times 24 hr, 8 Worker Processors PPM % Yamo
AQF vs. 2003 Release on cypress Relative Wall Times 24 hr, 8 Worker Processors % Add snow for 8 and 32 processors
AQF Horizontal Advection Relative Wall Times 24 hr, 8 Worker Processors % x-r-l y-c-l x-row-loop y-column-loop -hppm kernel solver in each loop
AQF Horizontal Advection Relative Wall Times 24 hr, 32 Worker Processors %
AQF & Release Horizontal Advection Relative Wall Times 24 hr, 32 Worker Processors % r- release
Future Work Add aerosols back in TKE vdiff Improve horizontal advection/diffusion scalability Some I/O improvements Layer variable horizontal advection time steps
Aspect Ratios 71 71 142 83 41 42 166 35 35 36 36 42 41 21 20
Aspect Ratios 17 35 36 18 21 11 10 20