Benchmarks of a Weather Forecasting Research Model

Benchmarks of a Weather Forecasting Research Model Daniel B. Weber, Ph.D. Research Scientist CAPS/University of Oklahoma ****CONFIDENTIAL**** August 3, 2001

UNM Los Lobos INTEL Benchmark Summary • 20% increase in compute time for 2proc/node configuration on Intel Based systems due to bus competition • File system very slow on Intel based systems without fiber channel • File system is a weak link (UNM-LL) • 5.5mb/sec sustained for 480 2proc/node tests writing 2.1mb files from 8 separate processors simultaneously • passing through linux file server not r6000

ALPHA Benchmark Summary • ES-40 Alpha EV-67 (TCS) is 5 times faster computationally than the INTEL PIII/733 • Alpha (TCS) file system is very slow at times, need to look at the configuration, shows potential for very fast transfer rates • MPI overhead for a 256 processor TCS job is on the order of 15%, very good network performance.

ALPHA Benchmark Summary • ES-45 Alpha EV-67 (TCS) is 1.5 times faster computationally than the ES-40 • 4-5 times faster than Intel PIII-1Ghz (using the Intel F90 compiler).

ARPS Optimization Revisited • Two modes: • Loop Optimization • MPI optimization • MPI requirements 30% on 450 processors on the Platinum IA-32 NCSA Cluster. • Calculations (70+%) primarily 3-D DO-Loops.

ARPS Optimization Revisited • MPI Optimization: • Hide communications via calculations • Requires hand coding and knowledge of the computational structure - a very time intensive task. • Maximum gain is limited to the communication costs (30%), realistically we may obtain a 15% improvement

ARPS Optimization Revisited • Loop Optimization for Vector Processors • Issues: • Length of vector pipeline, the longer the better. • KMA work shows nearly 75% peak (6GFLOPS per processor on the SX-5). • Code was hand tuned, hundreds of loops.

ARPS Optimization Revisited • Loop Optimization for Scalar Processors • Issues: • Cheap, fast processors. • Cache reuse is very important. • Rethink the order/layout of the computational structure of ARPS. • Some optimization was included in 1997, that removed redundant computations and combined loops (good for both Vector and Scalar machines). • CPU utilization only 10-20% of peak.

ARPS Optimization Revisited • New Approach to Loop Optimization • Combine loops further, the result is reduced loads and stores. This is very important on the new Intel technology. • Cache reuse is critical! • Force improvements in the compiler technology. • Our goal is to generate optimizations that are platform INDEPENDENT. • Example

Horizontal Advection - Original Version • DO k=2,nz-2 ! compute avgx(u) * difx(u) • DO j=1,ny-1 • DO i=1,nx-1 • tem2(i,j,k)=tema*(u(i,j,k,2)+u(i+1,j,k,2))*(u(i+1,j,k,2)-u(i,j,k,2)) • END DO’s • DO k=2,nz-2 ! compute avg2x(u)*dif2x(u) • DO j=1,ny-1 • DO i=2,nx-1 • tem3(i,j,k)=tema*(u(i-1,j,k,2)+u(i+1,j,k,2))*(u(i+1,j,k,2)-u(i-1,j,k,2)) • END DO’s • DO k=2,nz-2 ! compute 4/3*avgx(tem2)+1/3*avg2x(tem3) • DO j=1,ny-1 ! signs are reversed for force array. • DO i=3,nx-2 • uforce(i,j,k)=uforce(i,j,k) • : +tema*(tem3(i+2,j,k)+tem3(i-1,j,k)) • : -temb*(tem2(i-1,j,k)+tem2(i,j,k)) • END DO’s

Horizontal Advection - Modified Version • Three loops are merged into one large loop that reuses data and reduces loads and stores. • DO k=2,nz-2 • DO j=1,ny-1 • DO i=3,nx-2 • uforce(i,j,k)=uforce(i,j,k) • : +tema*((u(i,j,k,2)+u(i+2,j,k,2))*(u(i+2,j,k,2)-u(i,j,k,2)) • : +(u(i-2,j,k,2)+u(i,j,k,2))*(u(i,j,k,2)-u(i-2,j,k,2))) • : -temb*((u(i,j,k,2)+u(i+1,j,k,2))*(u(i+1,j,k,2)-u(i,j,k,2)) • : + (u(i-1,j,k,2)+u(i,j,k,2))*(u(i,j,k,2)-u(i-1,j,k,2))) • END DO’s...

optimized original

Benchmarks of a Weather Forecasting Research Model