370 likes | 376 Views
Explore the connection between computer science and computational fluid dynamics (CFD) modeling, highlighting research in CPU performance, coding efficiency, parallelism, and accuracy. Discover surprising results and optimization techniques for faster CFD models.
E N D
The Synergy between Computer Science and CFD Modeling Dov Kruger Anne Pence Alan Blumberg
Background of the Davidson Laboratory Team • Professor Alan Blumberg, co-author of POM and author of ECOMSED • Dov Kruger, computer science, transitioning into ocean research • Anne Pence, PhD candidate with a background in naval architecture and Oceanography
CFD is a Computing War Against Big Problems • Overview of 5 key areas, and focus on our research in single and multiple CPU performance and accuracy • Understanding the battleground: Architectural Features of Computers • Having a tactical plan: coding efficiency • Effective Use of resources: • Parallelism on shared memory architectures • Auto-parallelizing compilers • OpenMP • Parallelism using distributed computers (MPI) • Hitting the correct target: Accuracy
Ignorance is useful, knowledge is crucial • We questioned everything, including equation of state • Surprising results: a new equation of state, and a new style of coding CFD models that coalesces loops and is significantly faster on modern computers • Nothing worked until Professor Blumberg • Conveyed the analytics of the core routines • Designed suitable test cases • Debugged problems
Doubling Performance of Barotropic Mode: 10 months + 2.5 hours • Blindly optimizing the code is very difficult • It took time to understand the model • Problems • Validating results • Finding the right test case • Incremental test procedure • Once the correct strategy was in place, it took only 2.5 hours to double the performance of barotropic mode • More optimizations are possible, but they get exponentially more difficult • We propose not to do them by hand
Computer Architecture • Locality = Speed! • Registers • Different Levels of cache • Main Memory • Swapping • Disk • Sequential access is much faster than non-sequential • Locality is also accuracy • Multiple execution units • The limiting factor today is memory bandwidth • ECOMSED and POM are both memory limited
Efficiency • There is a cost structure to operations • Changes over time • Today’s Highest Costs • Writing to memory • Cache must flush to memory • Reading from memory • Statistically, cached • exp • log • sqrt • Division • Multiplication
Heuristics for Writing Fast Code • Efficiency doesn't hurt • It may not help • Minimize cost of operations • Different on each machine • But similar enough that heuristics will produce good results on most architectures • On average, performance will improve if expressions are sufficiently large
Example: Computing Bottom Stress: original do j=2,jmm1 do i=2,imm1 ubar(i,j,kbm1)=0.5*(ua(i,j)+ua(i+1,j)) vbar(i,j,kbm1)=0.5*(va(i,j)+va(i,j+1)) enddo enddo do j=2,jmm1 do i=2,imm1 qbar(i,j)=sqrt(ubar(i,j,kbm1)*ubar(i,j,kbm1)+ $ vbar(i,j,kbm1)*vbar(i,j,kbm1)) enddo enddo do j=2,jmm1 do i=2,imm1 if (fsm(i,j).gt.0.0) then tau(i,j,kb)=10000.*cbc(i,j)/ $ varbf(i,j)*qbar(i,j)*qbar(i,j) endif enddo enddo
Bottom Stress: 21 Times Faster do j=2,jmm1 do i=2,imm1 tempuBAR = (UA(I,J)+UA(I+1,J)) tempVBAR = (VA(I,J)+VA(I,J+1)) TAU(I,J,KB)=CBC2(I,J)*(tempUBAR**2 + tempVBAR**2) enddo enddo
The Current State of Compiler Optimization • Compilers typically do not rearrange floating point expressions • For fear of subtle roundoff effects • Instruction scheduling can still be done as long as the two data streams do not interact • Example: (x * y) - (x / y) + (x + y) – (x – y) • Common subexpressions are well-handled provided they are in the same form • Algebraic equivalences are not considered • Constant subexpressions are pre-evaluated at compile time in C and FORTRAN
Examples: Writing Fast Code • Only scalars are considered constant • Writing to an array is not optimized awayvar1(i,j,k) = c1 * c2 * c3var1(i,j,k) = var1(i,j,k) * c4var1(i,j,k) = c1 * c2 * c3 * c4c1 * var2(i,j,k) * c3 * var3(i,j,k) * c4c1*c3*c4 * var2(i,j,k) * var3(i,j,k)var2(i,j,k) * var3(i,j,k) * c1*c3*c4var2(i,j,k) * var3(i,j,k) * (c1*c3*c4)var1(i,j,k) / c1(i,j,k)
Skipping Land • Goals • Save CPU by only processing water boxes • Cost very little CPU for all-water problems • IF-tests slow down the CPU, so preferably avoid them • Stripwise “unstructured view” of the grid avoids additional tests • inefficient on a vector machine • k-order would be better to test an entire water column at once, but this would require re-engineering the entire model
Skipping land, continued istart,jrow, iend 2,2,2 5,2,6 2,3,2 5,3,6 2,4,6 2,5,6 2,6,6 istart,jrow, iend
High Performance Advection • Scalar Fluxes in one preferred direction • No memory access for flux • Full machine precision xfluxi Xfluxi+1 Var(i,j,k)=xFluxiP1 – xFluxixFluxi = xFluxiP1
Advection, cont yfluxJ(1) = 0 yFluxJ(im) = 0 diffusiveyfluxJ(1) = 0 diffusiveyFluxJ(im) = 0 do k=2,kbm1 do i=2,imm1 yfluxJ(i) = 0 ! because of land boundary condition diffusiveYFluxJ(i) = 0 enddo do j=2,jmm1 xfluxI = 0 ! because of land boundary condition do i=2,imm1 xfluxIP1 = (f(i,j,k)+f(i+1,j,k)) * xmfl3d(i+1,j,k) yFluxJP1 = (f(i,j,k)+f(i,j+1,k)) * ymfl3d(i,j+1,k) diffusiveXFluxIP1 = -aamx(i,j,k) * addDTi(i,j,BACK) * $ (fb(i,j,k)-fb(i-1,j,k))*uMaskedH2_2H1(i,j) diffusiveYFluxJP1 = -aamy(i,j,k) * addDTj(i,j, BACK) * $ (fb(i,j,k)-fb(i,j-1,k))*vMaskedH1_2H2(i,j)
Advection, take 3 ff(i,j,k) = $ ( $ fB(i,j,k) * volT(i,j,BACK) - !advective part $ dti2 * 0.5 * invDZ(k) * $ ( $ (f(i,j,k-1)+f(i,j,k))*w(i,j,k) - $ (f(i,j,k)+f(i,j,k+1))*w(i,j,k+1)*ART(i,j) + $ xFluxIP1 - xFluxI + yFluxJP1 - yFluxJ(i) $ ) + ! diffusive part $ diffusiveXfluxIP1 - diffusiveXFluxI + $ diffusiveYFluxJP1 - diffusiveYFluxJ(i) $ ) * $ invVolT(i,j,FUTURE) xFluxI = xFluxIp1 diffusiveXFluxI = diffusiveXFluxIp1 yFluxJ(i) = yFluxJP1 diffusiveYFluxJ(i)=diffusiveYFluxJP1 enddo enddo enddo
K-order is Faster • POM and ECOMSED are currently stored with I varying most frequently • var(i,j,k) • For most algorithms, traversal order is irrelevant, except: • Vertical models are naturally ordered top to bottom • A model stored and accessed in k-order can implement the vertical profile algorithms much more efficiently
Performance of PROFT/PROFS • On a sample problem • 3 times faster for profiling temperature • 8 times faster for profiling salinity • This includes the penalty of traversing in the wrong memory order for T,S • For a k-ordered model, code would be even faster • Techniques as discussed before, and also • Removal of exponentiation where unnecessary
Shared Memory Parallelism • Compiler automatically recognizes opportunities for multiple CPUs to split up loops • Can be • Automatic (under compiler control) • Manual (OpenMP) • Problem is memory bandwidth
Memory Bandwidth and Shared Memory Computers • Different levels • Commodity PC dual processors • Mid-range shared-memory computers • Larger cache • Floating point performance of CPUs are critical • Special-purpose machines with exotic memory and caches • Write-back cache synchronization
MPI • Message Passing Interface requires manual coding intervention • The programmer can split the domain across multiple computers • Messages are passed exchanging information between models • This will not speed up small models due to the overhead of message passing • The volume of the model must be large with respect to the data exchange interface
Accuracy • Values in registers retain extra bits of precision • This can substantially improve computational results • Every time values are stored to single precision arrays, this extra information is lost • Example: Vectorization on the Pentium 4 • Very fast (4 simultaneous 32-bit operations) • Values change substantially in the model • Loss of significant figures
Simulation is not Prediction • Roundoff error slowly burns away digits of precision • No escape from information entropy • Roundoff is indistinguishable from small differences in the initial conditions • Single precision is not enough • After 12 hours, salinity results can differ by 3ppt on different computers
Calibration of Model Run on Different Machines? • Calibration represents balancing all the constituents of the equations as they are computed by the model • Some parts of the model are more sensitive to roundoff error than others • Some parts are more driven by forcing functions and are therefore more stable • Turbulence closure is particularly twitchy
Density Computation • The Equation of State has been defined by Millero and Poisson in UNESCO • Accuracy is high • On the order of 6 digits • Original data (>500 samples) 5 digits • Computational cost is also high • Many terms including s1.5 • How much accuracy is needed?
Comparison: Fofonoff 1952 vs. UNESCO • ECOMSED used Fofonoff 1952 until recently • Approximately 0.6% difference at worst case at approximate 10°C, 10ppt • In computer model, the absolute density is irrelevant • The density difference between adjacent cells drives the forcing • In a 20ppt difference • No significant difference in X1 • Small differences in distribution of salt gradient between 0 and 20 ppt.
c1 = 6.38582008014815d-12 c2 = -1.07670862741285d-09 c3 = 9.73156784811086d-08 c4 = -9.0042779076007d-06 c5 = 6.60339902342078d-05 c6 = -0.000132403263360079d0 c7 = 4.91156586724696d-12 c8 = -7.81209323957205d-10 c9 = 6.72419650942208d-08 c10 = -3.5945086374423d-06 c11 = 0.000809532551425d0 c12 = -1.59139276805119d-07 c13 = -2.00957653399204d-12 c14 = 1.28333259714417d-10 c15 = 2.34587013019438d-09 do k=1,kb-1 do j=2,jm-1 do i=2,im-1 rho(i,j,k) = $ ((((c13 * t(i,j,k) + c14) * t(i,j,k) + c15) * $ s(i,j,k) + c12) * s(i,j,k) + $ ((((c7 * t(i,j,k) + c8) * t(i,j,k)) + c9) * $ t(i,j,k) + c10) * t(i,j,k) + c11) * s(i,j,k)+ $ ((((c1 * t(i,j,k) + c2) * t(i,j,k) + c3) * $ t(i,j,k)+c4) * t(i,j,k) + c5) * t(i,j,k) + c6 enddo enddo enddo A Faster Algorithm: densityDKAP
UNESCO, DKAP and MS • Dr. Martin Senator of Davidson Laboratory contributed a different fit, MSSuperior to UNESCO in fit to the original data
Consistent Attention to Accuracy • In POM and ECOMSED, density computation is performed • Rho is computed for each cell • In POM, density is stored as a ratio from rhoref to save digits • Differences are subtracted between adjacent cells • Salinity: difference of 0.001 ppt yields a result in the 7th decimal place • Temperature: difference of 0.01 °C yields a result in the 7th decimal place • Result: 0 digits accuracy • Conclusion: Compute density in double precision. Anything else may be inaccurate
Double precision Density vs. Single Precision • Double precision is more stable • Differential Calculation of density will be presumably much more accurate, and therefore even more stable
Calculate Density Difference Analytically • For greatest accuracy, calculate differential density directly • Slightly slower, but full accuracy • Density(s,t) = Pn(s,t) • DiffDensity(s,t,ds,dt)
Future Directions in Hardware • New processors with more registers • Bigger expressions become even more advantageous • Improved shared memory performance • Improved writeback cache design, larger caches, faster interconnects • Intelligent memory subsystems preloading values • Sequential access becomes even more important
Improved Software • We propose to write a Model Construction Toolkit (MCT) that will do all the optimizations mentioned here, and more • Reimplement POM/ECOMSED using the new toolkit • Allow calling FORTRAN routines • Enter algorithms either in a FORTRAN-like syntax or in a higher-level syntax based on common notation for difference equations • Automatically optimize expressions • Compute array subexpressions only when changed • Compute results using an optimal function given the situation • The result: More robust, reliable modeling, at much higher speed, with less effort
Acknowledgements • Professor Roger Pinkham, for superb numerical and statistical knowledge and insight • Dr. Martin Senator for his densityMS fit and help getting started with R • The makers of R, an incredible statistics package • John Gilson, who reviewed the presentation and helped us get the right computing focus, and whose discussions first got us started thinking about a model construction toolkit