1 / 14

Benchmarks of a Weather Forecasting Research Model

Benchmarks of a Weather Forecasting Research Model. Daniel B. Weber, Ph.D. Research Scientist CAPS/University of Oklahoma ****CONFIDENTIAL**** August 3, 2001. UNM Los Lobos INTEL Benchmark Summary.

sherry
Download Presentation

Benchmarks of a Weather Forecasting Research Model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Benchmarks of a Weather Forecasting Research Model Daniel B. Weber, Ph.D. Research Scientist CAPS/University of Oklahoma ****CONFIDENTIAL**** August 3, 2001

  2. UNM Los Lobos INTEL Benchmark Summary • 20% increase in compute time for 2proc/node configuration on Intel Based systems due to bus competition • File system very slow on Intel based systems without fiber channel • File system is a weak link (UNM-LL) • 5.5mb/sec sustained for 480 2proc/node tests writing 2.1mb files from 8 separate processors simultaneously • passing through linux file server not r6000

  3. ALPHA Benchmark Summary • ES-40 Alpha EV-67 (TCS) is 5 times faster computationally than the INTEL PIII/733 • Alpha (TCS) file system is very slow at times, need to look at the configuration, shows potential for very fast transfer rates • MPI overhead for a 256 processor TCS job is on the order of 15%, very good network performance.

  4. ALPHA Benchmark Summary • ES-45 Alpha EV-67 (TCS) is 1.5 times faster computationally than the ES-40 • 4-5 times faster than Intel PIII-1Ghz (using the Intel F90 compiler).

  5. ARPS Optimization Revisited • Two modes: • Loop Optimization • MPI optimization • MPI requirements 30% on 450 processors on the Platinum IA-32 NCSA Cluster. • Calculations (70+%) primarily 3-D DO-Loops.

  6. ARPS Optimization Revisited • MPI Optimization: • Hide communications via calculations • Requires hand coding and knowledge of the computational structure - a very time intensive task. • Maximum gain is limited to the communication costs (30%), realistically we may obtain a 15% improvement

  7. ARPS Optimization Revisited • Loop Optimization for Vector Processors • Issues: • Length of vector pipeline, the longer the better. • KMA work shows nearly 75% peak (6GFLOPS per processor on the SX-5). • Code was hand tuned, hundreds of loops.

  8. ARPS Optimization Revisited • Loop Optimization for Scalar Processors • Issues: • Cheap, fast processors. • Cache reuse is very important. • Rethink the order/layout of the computational structure of ARPS. • Some optimization was included in 1997, that removed redundant computations and combined loops (good for both Vector and Scalar machines). • CPU utilization only 10-20% of peak.

  9. ARPS Optimization Revisited • New Approach to Loop Optimization • Combine loops further, the result is reduced loads and stores. This is very important on the new Intel technology. • Cache reuse is critical! • Force improvements in the compiler technology. • Our goal is to generate optimizations that are platform INDEPENDENT. • Example

  10. Horizontal Advection - Original Version • DO k=2,nz-2 ! compute avgx(u) * difx(u) • DO j=1,ny-1 • DO i=1,nx-1 • tem2(i,j,k)=tema*(u(i,j,k,2)+u(i+1,j,k,2))*(u(i+1,j,k,2)-u(i,j,k,2)) • END DO’s • DO k=2,nz-2 ! compute avg2x(u)*dif2x(u) • DO j=1,ny-1 • DO i=2,nx-1 • tem3(i,j,k)=tema*(u(i-1,j,k,2)+u(i+1,j,k,2))*(u(i+1,j,k,2)-u(i-1,j,k,2)) • END DO’s • DO k=2,nz-2 ! compute 4/3*avgx(tem2)+1/3*avg2x(tem3) • DO j=1,ny-1 ! signs are reversed for force array. • DO i=3,nx-2 • uforce(i,j,k)=uforce(i,j,k) • : +tema*(tem3(i+2,j,k)+tem3(i-1,j,k)) • : -temb*(tem2(i-1,j,k)+tem2(i,j,k)) • END DO’s

  11. Horizontal Advection - Modified Version • Three loops are merged into one large loop that reuses data and reduces loads and stores. • DO k=2,nz-2 • DO j=1,ny-1 • DO i=3,nx-2 • uforce(i,j,k)=uforce(i,j,k) • : +tema*((u(i,j,k,2)+u(i+2,j,k,2))*(u(i+2,j,k,2)-u(i,j,k,2)) • : +(u(i-2,j,k,2)+u(i,j,k,2))*(u(i,j,k,2)-u(i-2,j,k,2))) • : -temb*((u(i,j,k,2)+u(i+1,j,k,2))*(u(i+1,j,k,2)-u(i,j,k,2)) • : + (u(i-1,j,k,2)+u(i,j,k,2))*(u(i,j,k,2)-u(i-1,j,k,2))) • END DO’s...

  12. optimized original

  13. optimized original

More Related