1 / 22

Carlie J. Coats, Jr., MCNC (coats@ncsc) John N. McHenry, MCNC (mchenry@ncsc)

MM5 Optimization Experiences and Numerical Sensitivities Found in Convective/Non-Convective Cloud Interactions. Carlie J. Coats, Jr., MCNC (coats@ncsc.org) John N. McHenry, MCNC (mchenry@ncsc.org) Elizabeth Hayes, SGI (eah@sgi.com). Introduction.

semah
Download Presentation

Carlie J. Coats, Jr., MCNC (coats@ncsc) John N. McHenry, MCNC (mchenry@ncsc)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MM5 Optimization Experiences and Numerical Sensitivities Found in Convective/Non-Convective Cloud Interactions • Carlie J. Coats, Jr., MCNC (coats@ncsc.org) • John N. McHenry, MCNC (mchenry@ncsc.org) • Elizabeth Hayes, SGI (eah@sgi.com)

  2. Introduction • MM5 Optimization for Microprocessor/Parallel Systems • Started from MM5V2.[7,12]-GSPBL • Speedups so far: 1.4 on SGI, 1.9 on Linux/X86, 2.36 on IBM SP • Tiny numerical changes cause gross changes in the output • (but these seem to be unbiased) • Causative mechanisms include convective triggering • inherent problem; this is ill-conditioned in nature • Need to be careful with algorithmic formulations and optimizations • will not be fixed simply by improved compiler technology

  3. Optimization For Microprocessor/Parallel • Processor characteristics: • Pipelining and Superscalarity—need lots of independent work • Hierarchical memory organization with registers and caches • Solutions: • Data structure transformations • Logic and loop re-factoring • Expensive-operation avoidance • Minimize and optimize memory traffic

  4. Pipelining and Superscalarity • Modern microprocessors try to have multiple instructions in different stages of execution on each FPU or ALU at the same time. • Dependencies between instructions (where one needs to complete before another can start) stall the system. • Current technology: 20-30 instructions "in flight" at one time; even more (50+?) instructions in the future. • Standard solutions: need lots of “independent work” to fill pipelines • Loop unrolling for vectorizable loops (some compilers can do this) • Loop jamming, so that there are long loop bodies with lots of independent work (some compilers can do some of this) • Logic refactoring, so that IFs are outside the loops, not inside (compilers can NOT do this)

  5. Caches and Memory Traffic • Memory traffic a prime predictor for performance • McCalpin's "STREAM" benchmarks • Want stride 1 data access, especially for “store” sequences • Want small data structures that “live in cache” or (where possible) even scalars that “live in registers.” • Parallel cache-line conflicts "can cost 100X performance"--SGI • Standard solutions: • Loop unrolling and loop jamming lead to value re-use (some compilers can do some of this) • Loop refactoring and data structure reorganization (some compilers can do loop refactoring but none do major data structure reorganization)

  6. Expensive Operations • Use of X**0.5 instead of SQRT(X) (this is also less accurate) • use of divides and reciprocals • we can see even examples of X=A/B/C/D in the code, instead of X=A/(B*C*D) • use RPS* variables • rationalize fractions • EXP(A)*EXP(B) vs. EXP(A+B) (happens in LWRAD) • repeated calculations of the same trig or log functions (happens in SOUND)

  7. Logic Re-Factoring Simplified example adapted from MRFPBL: DO K=1,KL DO I=1,ILX QX(I,K) =QVB(I,J,K)*RPSB(I,J) QCX(I,K)=0. QIX(I,K)=0. END DO END DO IF ( IMOIST(IN).NE.1)THEN DO K=1,KL DO I=1,ILX QCX(I,K)=QCB(I,J,K)*RPSB(I,J) IF(IICE.EQ.1)QIX(I,K)=QIB(I,J,K)*RPSB(I,J) END DO END DO END IF

  8. IF ( IMOIST(IN).EQ.1)THEN • DO K=1,KL • DO I=1,ILX • QX(I,K) =QVB(I,J,K)*RPSB(I,J) • QCX(I,K)=0. • QIX(I,K)=0. • END DO • END DO • ELSE IF ( IICE.NE.1)THEN ! where imoist.ne.1: • DO K=1,KL • DO I=1,ILX • QX(I,K) =QVB(I,J,K)*RPSB(I,J) • QCX(I,K)=QCB(I,J,K)*RPSB(I,J) • QIX(I,K)=0. • END DO • END DO • ELSE ! imoist.ne.1 and iice.eq.1 • DO K=1,KL • DO I=1,ILX • QX(I,K) =QVB(I,J,K)*RPSB(I,J) • QCX(I,K)=QCB(I,J,K)*RPSB(I,J) • QIX(I,K)=QIB(I,J,K)*RPSB(I,J) • END DO • END DO • END IF

  9. EXMOISS Optimizations • Inside the (innermost) miter loop: RGV(K) =AMAX1( RGV(K)/DSIGMA(K), RGV(K-1)/DSIGMA(K-1) )*DSIGMA(K) RGVC(K)=AMAX1(RGVC(K)/DSIGMA(K),RGVC(K-1)/DSIGMA(K-1) )*DSIGMA(K) • Equivalent to DSRAT(K)=DSIGMA(K)/DSIGMA(K-1)) !! K-only pre-calculation ….. RGV(K) =AMAX1( RGV(K), RGV(K-1)*DSRAT(K) ) RGVC(K)=AMAX1(RGVC(K), RGVC(K-1)*DSRAT(K) ) • Rewrite loop structure and arrays as follows: • outermost I-loop, enclosing • sequence of K-loops, then • miter loop, enclosing internal K-loop • working arrays subscripted by K only (or are scalars, when possible)

  10. EXMOISS Optimizations, cont’d • Rain-accumulation numerics: • original adds one miter-step of one layer of rain to 2-D array of cumulative rain totals; serious truncation error for long runs. • Optimized version adds up vertical-column advection-step total in a scalar; then adds that scalar to the cumulative total—better round-off, less memory traffic. • New version is twice as fast, greatly reduced round-off errors • Generates noticeably different MM5 model results • no evident bias in the changed results • caused/amplified by interaction with convective cloud parameterizations? • See plots to come

  11. Other Routines • Routines: SOUND, SOLVE3, EXMOISS, GSPBL, LWRAD, MRFPBL, HADV, VADV • Typical speedup factors for these routines • 1.1-1.6 on SGI, • 1.5-2.1 (but 2.54 for GSPBL) on IBM SP • Frequently, optimized versions have reduced round-off • Some optimizations will improve both vector a microprocessor performance • Side effects: reduced cache footprint in EXMOISS, MRFPBL caused 5-8% speedup in SOUND, SOLVE3 on SGI Octane! (less effect on O-2000)

  12. Food for Thought • What does all this—especially the numerical sensitivities—say for future model formulations such as WRF? • Double-precision-only model? (and best-available values for physics constants!) • Ensemble forecasts? (These are very easy to achieve with the current MM5—just multiply some state variable by PSA, then by RPSA! ) • (Most radically) stochastic models that predict cell means and variances instead of deterministic point-values? (Due to theorems in integral operator theory, these have better stability and continuity properties than today’s deterministic models but sub-gridscale processes will be a challenge to formulate!)

More Related