380 likes | 531 Views
The Synergy between Computer Science and CFD Modeling. Dov Kruger Anne Pence Alan Blumberg. Background of the Davidson Laboratory Team. Professor Alan Blumberg, co-author of POM and author of ECOMSED Dov Kruger, computer science, transitioning into ocean research
E N D
The Synergy between Computer Science and CFD Modeling Dov Kruger Anne Pence Alan Blumberg
Background of the Davidson Laboratory Team • Professor Alan Blumberg, co-author of POM and author of ECOMSED • Dov Kruger, computer science, transitioning into ocean research • Anne Pence, PhD candidate with a background in naval architecture and Oceanography
CFD is a Computing War Against Big Problems • Overview of 5 key areas, and focus on our research in single and multiple CPU performance and accuracy • Understanding the battleground: Architectural Features of Computers • Having a tactical plan: coding efficiency • Effective Use of resources: • Parallelism on shared memory architectures • Auto-parallelizing compilers • OpenMP • Parallelism using distributed computers (MPI) • Hitting the correct target: Accuracy
Ignorance is useful, knowledge is crucial • We questioned everything, including equation of state • Surprising results: a new equation of state, and a new style of coding CFD models that coalesces loops and is significantly faster on modern computers • Nothing worked until Professor Blumberg • Conveyed the analytics of the core routines • Designed suitable test cases • Debugged problems
Doubling Performance of Barotropic Mode: 10 months + 2.5 hours • Blindly optimizing the code is very difficult • It took time to understand the model • Problems • Validating results • Finding the right test case • Incremental test procedure • Once the correct strategy was in place, it took only 2.5 hours to double the performance of barotropic mode • More optimizations are possible, but they get exponentially more difficult • We propose not to do them by hand
Computer Architecture • Locality = Speed! • Registers • Different Levels of cache • Main Memory • Swapping • Disk • Sequential access is much faster than non-sequential • Locality is also accuracy • Multiple execution units • The limiting factor today is memory bandwidth • ECOMSED and POM are both memory limited
Efficiency • There is a cost structure to operations • Changes over time • Today’s Highest Costs • Writing to memory • Cache must flush to memory • Reading from memory • Statistically, cached • exp • log • sqrt • Division • Multiplication
Heuristics for Writing Fast Code • Efficiency doesn't hurt • It may not help • Minimize cost of operations • Different on each machine • But similar enough that heuristics will produce good results on most architectures • On average, performance will improve if expressions are sufficiently large
Example: Computing Bottom Stress: original do j=2,jmm1 do i=2,imm1 ubar(i,j,kbm1)=0.5*(ua(i,j)+ua(i+1,j)) vbar(i,j,kbm1)=0.5*(va(i,j)+va(i,j+1)) enddo enddo do j=2,jmm1 do i=2,imm1 qbar(i,j)=sqrt(ubar(i,j,kbm1)*ubar(i,j,kbm1)+ $ vbar(i,j,kbm1)*vbar(i,j,kbm1)) enddo enddo do j=2,jmm1 do i=2,imm1 if (fsm(i,j).gt.0.0) then tau(i,j,kb)=10000.*cbc(i,j)/ $ varbf(i,j)*qbar(i,j)*qbar(i,j) endif enddo enddo
Bottom Stress: 21 Times Faster do j=2,jmm1 do i=2,imm1 tempuBAR = (UA(I,J)+UA(I+1,J)) tempVBAR = (VA(I,J)+VA(I,J+1)) TAU(I,J,KB)=CBC2(I,J)*(tempUBAR**2 + tempVBAR**2) enddo enddo
The Current State of Compiler Optimization • Compilers typically do not rearrange floating point expressions • For fear of subtle roundoff effects • Instruction scheduling can still be done as long as the two data streams do not interact • Example: (x * y) - (x / y) + (x + y) – (x – y) • Common subexpressions are well-handled provided they are in the same form • Algebraic equivalences are not considered • Constant subexpressions are pre-evaluated at compile time in C and FORTRAN
Examples: Writing Fast Code • Only scalars are considered constant • Writing to an array is not optimized awayvar1(i,j,k) = c1 * c2 * c3var1(i,j,k) = var1(i,j,k) * c4var1(i,j,k) = c1 * c2 * c3 * c4c1 * var2(i,j,k) * c3 * var3(i,j,k) * c4c1*c3*c4 * var2(i,j,k) * var3(i,j,k)var2(i,j,k) * var3(i,j,k) * c1*c3*c4var2(i,j,k) * var3(i,j,k) * (c1*c3*c4)var1(i,j,k) / c1(i,j,k)
Skipping Land • Goals • Save CPU by only processing water boxes • Cost very little CPU for all-water problems • IF-tests slow down the CPU, so preferably avoid them • Stripwise “unstructured view” of the grid avoids additional tests • inefficient on a vector machine • k-order would be better to test an entire water column at once, but this would require re-engineering the entire model
Skipping land, continued istart,jrow, iend 2,2,2 5,2,6 2,3,2 5,3,6 2,4,6 2,5,6 2,6,6 istart,jrow, iend
High Performance Advection • Scalar Fluxes in one preferred direction • No memory access for flux • Full machine precision xfluxi Xfluxi+1 Var(i,j,k)=xFluxiP1 – xFluxixFluxi = xFluxiP1
Advection, cont yfluxJ(1) = 0 yFluxJ(im) = 0 diffusiveyfluxJ(1) = 0 diffusiveyFluxJ(im) = 0 do k=2,kbm1 do i=2,imm1 yfluxJ(i) = 0 ! because of land boundary condition diffusiveYFluxJ(i) = 0 enddo do j=2,jmm1 xfluxI = 0 ! because of land boundary condition do i=2,imm1 xfluxIP1 = (f(i,j,k)+f(i+1,j,k)) * xmfl3d(i+1,j,k) yFluxJP1 = (f(i,j,k)+f(i,j+1,k)) * ymfl3d(i,j+1,k) diffusiveXFluxIP1 = -aamx(i,j,k) * addDTi(i,j,BACK) * $ (fb(i,j,k)-fb(i-1,j,k))*uMaskedH2_2H1(i,j) diffusiveYFluxJP1 = -aamy(i,j,k) * addDTj(i,j, BACK) * $ (fb(i,j,k)-fb(i,j-1,k))*vMaskedH1_2H2(i,j)
Advection, take 3 ff(i,j,k) = $ ( $ fB(i,j,k) * volT(i,j,BACK) - !advective part $ dti2 * 0.5 * invDZ(k) * $ ( $ (f(i,j,k-1)+f(i,j,k))*w(i,j,k) - $ (f(i,j,k)+f(i,j,k+1))*w(i,j,k+1)*ART(i,j) + $ xFluxIP1 - xFluxI + yFluxJP1 - yFluxJ(i) $ ) + ! diffusive part $ diffusiveXfluxIP1 - diffusiveXFluxI + $ diffusiveYFluxJP1 - diffusiveYFluxJ(i) $ ) * $ invVolT(i,j,FUTURE) xFluxI = xFluxIp1 diffusiveXFluxI = diffusiveXFluxIp1 yFluxJ(i) = yFluxJP1 diffusiveYFluxJ(i)=diffusiveYFluxJP1 enddo enddo enddo
K-order is Faster • POM and ECOMSED are currently stored with I varying most frequently • var(i,j,k) • For most algorithms, traversal order is irrelevant, except: • Vertical models are naturally ordered top to bottom • A model stored and accessed in k-order can implement the vertical profile algorithms much more efficiently
Performance of PROFT/PROFS • On a sample problem • 3 times faster for profiling temperature • 8 times faster for profiling salinity • This includes the penalty of traversing in the wrong memory order for T,S • For a k-ordered model, code would be even faster • Techniques as discussed before, and also • Removal of exponentiation where unnecessary
Shared Memory Parallelism • Compiler automatically recognizes opportunities for multiple CPUs to split up loops • Can be • Automatic (under compiler control) • Manual (OpenMP) • Problem is memory bandwidth
Memory Bandwidth and Shared Memory Computers • Different levels • Commodity PC dual processors • Mid-range shared-memory computers • Larger cache • Floating point performance of CPUs are critical • Special-purpose machines with exotic memory and caches • Write-back cache synchronization
MPI • Message Passing Interface requires manual coding intervention • The programmer can split the domain across multiple computers • Messages are passed exchanging information between models • This will not speed up small models due to the overhead of message passing • The volume of the model must be large with respect to the data exchange interface
Accuracy • Values in registers retain extra bits of precision • This can substantially improve computational results • Every time values are stored to single precision arrays, this extra information is lost • Example: Vectorization on the Pentium 4 • Very fast (4 simultaneous 32-bit operations) • Values change substantially in the model • Loss of significant figures
Simulation is not Prediction • Roundoff error slowly burns away digits of precision • No escape from information entropy • Roundoff is indistinguishable from small differences in the initial conditions • Single precision is not enough • After 12 hours, salinity results can differ by 3ppt on different computers
Calibration of Model Run on Different Machines? • Calibration represents balancing all the constituents of the equations as they are computed by the model • Some parts of the model are more sensitive to roundoff error than others • Some parts are more driven by forcing functions and are therefore more stable • Turbulence closure is particularly twitchy
Density Computation • The Equation of State has been defined by Millero and Poisson in UNESCO • Accuracy is high • On the order of 6 digits • Original data (>500 samples) 5 digits • Computational cost is also high • Many terms including s1.5 • How much accuracy is needed?
Comparison: Fofonoff 1952 vs. UNESCO • ECOMSED used Fofonoff 1952 until recently • Approximately 0.6% difference at worst case at approximate 10°C, 10ppt • In computer model, the absolute density is irrelevant • The density difference between adjacent cells drives the forcing • In a 20ppt difference • No significant difference in X1 • Small differences in distribution of salt gradient between 0 and 20 ppt.
c1 = 6.38582008014815d-12 c2 = -1.07670862741285d-09 c3 = 9.73156784811086d-08 c4 = -9.0042779076007d-06 c5 = 6.60339902342078d-05 c6 = -0.000132403263360079d0 c7 = 4.91156586724696d-12 c8 = -7.81209323957205d-10 c9 = 6.72419650942208d-08 c10 = -3.5945086374423d-06 c11 = 0.000809532551425d0 c12 = -1.59139276805119d-07 c13 = -2.00957653399204d-12 c14 = 1.28333259714417d-10 c15 = 2.34587013019438d-09 do k=1,kb-1 do j=2,jm-1 do i=2,im-1 rho(i,j,k) = $ ((((c13 * t(i,j,k) + c14) * t(i,j,k) + c15) * $ s(i,j,k) + c12) * s(i,j,k) + $ ((((c7 * t(i,j,k) + c8) * t(i,j,k)) + c9) * $ t(i,j,k) + c10) * t(i,j,k) + c11) * s(i,j,k)+ $ ((((c1 * t(i,j,k) + c2) * t(i,j,k) + c3) * $ t(i,j,k)+c4) * t(i,j,k) + c5) * t(i,j,k) + c6 enddo enddo enddo A Faster Algorithm: densityDKAP
UNESCO, DKAP and MS • Dr. Martin Senator of Davidson Laboratory contributed a different fit, MSSuperior to UNESCO in fit to the original data
Consistent Attention to Accuracy • In POM and ECOMSED, density computation is performed • Rho is computed for each cell • In POM, density is stored as a ratio from rhoref to save digits • Differences are subtracted between adjacent cells • Salinity: difference of 0.001 ppt yields a result in the 7th decimal place • Temperature: difference of 0.01 °C yields a result in the 7th decimal place • Result: 0 digits accuracy • Conclusion: Compute density in double precision. Anything else may be inaccurate
Double precision Density vs. Single Precision • Double precision is more stable • Differential Calculation of density will be presumably much more accurate, and therefore even more stable
Calculate Density Difference Analytically • For greatest accuracy, calculate differential density directly • Slightly slower, but full accuracy • Density(s,t) = Pn(s,t) • DiffDensity(s,t,ds,dt)
Future Directions in Hardware • New processors with more registers • Bigger expressions become even more advantageous • Improved shared memory performance • Improved writeback cache design, larger caches, faster interconnects • Intelligent memory subsystems preloading values • Sequential access becomes even more important
Improved Software • We propose to write a Model Construction Toolkit (MCT) that will do all the optimizations mentioned here, and more • Reimplement POM/ECOMSED using the new toolkit • Allow calling FORTRAN routines • Enter algorithms either in a FORTRAN-like syntax or in a higher-level syntax based on common notation for difference equations • Automatically optimize expressions • Compute array subexpressions only when changed • Compute results using an optimal function given the situation • The result: More robust, reliable modeling, at much higher speed, with less effort
Acknowledgements • Professor Roger Pinkham, for superb numerical and statistical knowledge and insight • Dr. Martin Senator for his densityMS fit and help getting started with R • The makers of R, an incredible statistics package • John Gilson, who reviewed the presentation and helped us get the right computing focus, and whose discussions first got us started thinking about a model construction toolkit