The Synergy between Computer Science and CFD Modeling

The Synergy between Computer Science and CFD Modeling Dov Kruger Anne Pence Alan Blumberg

Background of the Davidson Laboratory Team • Professor Alan Blumberg, co-author of POM and author of ECOMSED • Dov Kruger, computer science, transitioning into ocean research • Anne Pence, PhD candidate with a background in naval architecture and Oceanography

CFD is a Computing War Against Big Problems • Overview of 5 key areas, and focus on our research in single and multiple CPU performance and accuracy • Understanding the battleground: Architectural Features of Computers • Having a tactical plan: coding efficiency • Effective Use of resources: • Parallelism on shared memory architectures • Auto-parallelizing compilers • OpenMP • Parallelism using distributed computers (MPI) • Hitting the correct target: Accuracy

Ignorance is useful, knowledge is crucial • We questioned everything, including equation of state • Surprising results: a new equation of state, and a new style of coding CFD models that coalesces loops and is significantly faster on modern computers • Nothing worked until Professor Blumberg • Conveyed the analytics of the core routines • Designed suitable test cases • Debugged problems

Doubling Performance of Barotropic Mode: 10 months + 2.5 hours • Blindly optimizing the code is very difficult • It took time to understand the model • Problems • Validating results • Finding the right test case • Incremental test procedure • Once the correct strategy was in place, it took only 2.5 hours to double the performance of barotropic mode • More optimizations are possible, but they get exponentially more difficult • We propose not to do them by hand

A Sample Problem

Computer Architecture • Locality = Speed! • Registers • Different Levels of cache • Main Memory • Swapping • Disk • Sequential access is much faster than non-sequential • Locality is also accuracy • Multiple execution units • The limiting factor today is memory bandwidth • ECOMSED and POM are both memory limited

Efficiency • There is a cost structure to operations • Changes over time • Today’s Highest Costs • Writing to memory • Cache must flush to memory • Reading from memory • Statistically, cached • exp • log • sqrt • Division • Multiplication

Heuristics for Writing Fast Code • Efficiency doesn't hurt • It may not help • Minimize cost of operations • Different on each machine • But similar enough that heuristics will produce good results on most architectures • On average, performance will improve if expressions are sufficiently large

Example: Computing Bottom Stress: original do j=2,jmm1 do i=2,imm1 ubar(i,j,kbm1)=0.5*(ua(i,j)+ua(i+1,j)) vbar(i,j,kbm1)=0.5*(va(i,j)+va(i,j+1)) enddo enddo do j=2,jmm1 do i=2,imm1 qbar(i,j)=sqrt(ubar(i,j,kbm1)*ubar(i,j,kbm1)+ $ vbar(i,j,kbm1)*vbar(i,j,kbm1)) enddo enddo do j=2,jmm1 do i=2,imm1 if (fsm(i,j).gt.0.0) then tau(i,j,kb)=10000.*cbc(i,j)/ $ varbf(i,j)*qbar(i,j)*qbar(i,j) endif enddo enddo

Bottom Stress: 21 Times Faster do j=2,jmm1 do i=2,imm1 tempuBAR = (UA(I,J)+UA(I+1,J)) tempVBAR = (VA(I,J)+VA(I,J+1)) TAU(I,J,KB)=CBC2(I,J)*(tempUBAR**2 + tempVBAR**2) enddo enddo

The Current State of Compiler Optimization • Compilers typically do not rearrange floating point expressions • For fear of subtle roundoff effects • Instruction scheduling can still be done as long as the two data streams do not interact • Example: (x * y) - (x / y) + (x + y) – (x – y) • Common subexpressions are well-handled provided they are in the same form • Algebraic equivalences are not considered • Constant subexpressions are pre-evaluated at compile time in C and FORTRAN

Examples: Writing Fast Code • Only scalars are considered constant • Writing to an array is not optimized awayvar1(i,j,k) = c1 * c2 * c3var1(i,j,k) = var1(i,j,k) * c4var1(i,j,k) = c1 * c2 * c3 * c4c1 * var2(i,j,k) * c3 * var3(i,j,k) * c4c1*c3*c4 * var2(i,j,k) * var3(i,j,k)var2(i,j,k) * var3(i,j,k) * c1*c3*c4var2(i,j,k) * var3(i,j,k) * (c1*c3*c4)var1(i,j,k) / c1(i,j,k)

Skipping Land • Goals • Save CPU by only processing water boxes • Cost very little CPU for all-water problems • IF-tests slow down the CPU, so preferably avoid them • Stripwise “unstructured view” of the grid avoids additional tests • inefficient on a vector machine • k-order would be better to test an entire water column at once, but this would require re-engineering the entire model

Skipping land, continued istart,jrow, iend 2,2,2 5,2,6 2,3,2 5,3,6 2,4,6 2,5,6 2,6,6 istart,jrow, iend

High Performance Advection • Scalar Fluxes in one preferred direction • No memory access for flux • Full machine precision xfluxi Xfluxi+1 Var(i,j,k)=xFluxiP1 – xFluxixFluxi = xFluxiP1

Advection, cont yfluxJ(1) = 0 yFluxJ(im) = 0 diffusiveyfluxJ(1) = 0 diffusiveyFluxJ(im) = 0 do k=2,kbm1 do i=2,imm1 yfluxJ(i) = 0 ! because of land boundary condition diffusiveYFluxJ(i) = 0 enddo do j=2,jmm1 xfluxI = 0 ! because of land boundary condition do i=2,imm1 xfluxIP1 = (f(i,j,k)+f(i+1,j,k)) * xmfl3d(i+1,j,k) yFluxJP1 = (f(i,j,k)+f(i,j+1,k)) * ymfl3d(i,j+1,k) diffusiveXFluxIP1 = -aamx(i,j,k) * addDTi(i,j,BACK) * $ (fb(i,j,k)-fb(i-1,j,k))*uMaskedH2_2H1(i,j) diffusiveYFluxJP1 = -aamy(i,j,k) * addDTj(i,j, BACK) * $ (fb(i,j,k)-fb(i,j-1,k))*vMaskedH1_2H2(i,j)

Advection, take 3 ff(i,j,k) = $ ( $ fB(i,j,k) * volT(i,j,BACK) - !advective part $ dti2 * 0.5 * invDZ(k) * $ ( $ (f(i,j,k-1)+f(i,j,k))*w(i,j,k) - $ (f(i,j,k)+f(i,j,k+1))*w(i,j,k+1)*ART(i,j) + $ xFluxIP1 - xFluxI + yFluxJP1 - yFluxJ(i) $ ) + ! diffusive part $ diffusiveXfluxIP1 - diffusiveXFluxI + $ diffusiveYFluxJP1 - diffusiveYFluxJ(i) $ ) * $ invVolT(i,j,FUTURE) xFluxI = xFluxIp1 diffusiveXFluxI = diffusiveXFluxIp1 yFluxJ(i) = yFluxJP1 diffusiveYFluxJ(i)=diffusiveYFluxJP1 enddo enddo enddo

K-order is Faster • POM and ECOMSED are currently stored with I varying most frequently • var(i,j,k) • For most algorithms, traversal order is irrelevant, except: • Vertical models are naturally ordered top to bottom • A model stored and accessed in k-order can implement the vertical profile algorithms much more efficiently

Performance of PROFT/PROFS • On a sample problem • 3 times faster for profiling temperature • 8 times faster for profiling salinity • This includes the penalty of traversing in the wrong memory order for T,S • For a k-ordered model, code would be even faster • Techniques as discussed before, and also • Removal of exponentiation where unnecessary

Shared Memory Parallelism • Compiler automatically recognizes opportunities for multiple CPUs to split up loops • Can be • Automatic (under compiler control) • Manual (OpenMP) • Problem is memory bandwidth

Memory Bandwidth and Shared Memory Computers • Different levels • Commodity PC dual processors • Mid-range shared-memory computers • Larger cache • Floating point performance of CPUs are critical • Special-purpose machines with exotic memory and caches • Write-back cache synchronization

MPI • Message Passing Interface requires manual coding intervention • The programmer can split the domain across multiple computers • Messages are passed exchanging information between models • This will not speed up small models due to the overhead of message passing • The volume of the model must be large with respect to the data exchange interface

Accuracy • Values in registers retain extra bits of precision • This can substantially improve computational results • Every time values are stored to single precision arrays, this extra information is lost • Example: Vectorization on the Pentium 4 • Very fast (4 simultaneous 32-bit operations) • Values change substantially in the model • Loss of significant figures

Simulation is not Prediction • Roundoff error slowly burns away digits of precision • No escape from information entropy • Roundoff is indistinguishable from small differences in the initial conditions • Single precision is not enough • After 12 hours, salinity results can differ by 3ppt on different computers

Calibration of Model Run on Different Machines? • Calibration represents balancing all the constituents of the equations as they are computed by the model • Some parts of the model are more sensitive to roundoff error than others • Some parts are more driven by forcing functions and are therefore more stable • Turbulence closure is particularly twitchy

Density Computation • The Equation of State has been defined by Millero and Poisson in UNESCO • Accuracy is high • On the order of 6 digits • Original data (>500 samples) 5 digits • Computational cost is also high • Many terms including s1.5 • How much accuracy is needed?

Comparison: Fofonoff 1952 vs. UNESCO • ECOMSED used Fofonoff 1952 until recently • Approximately 0.6% difference at worst case at approximate 10°C, 10ppt • In computer model, the absolute density is irrelevant • The density difference between adjacent cells drives the forcing • In a 20ppt difference • No significant difference in X1 • Small differences in distribution of salt gradient between 0 and 20 ppt.

Difference between UNESCO and Fofonoff 1950

c1 = 6.38582008014815d-12 c2 = -1.07670862741285d-09 c3 = 9.73156784811086d-08 c4 = -9.0042779076007d-06 c5 = 6.60339902342078d-05 c6 = -0.000132403263360079d0 c7 = 4.91156586724696d-12 c8 = -7.81209323957205d-10 c9 = 6.72419650942208d-08 c10 = -3.5945086374423d-06 c11 = 0.000809532551425d0 c12 = -1.59139276805119d-07 c13 = -2.00957653399204d-12 c14 = 1.28333259714417d-10 c15 = 2.34587013019438d-09 do k=1,kb-1 do j=2,jm-1 do i=2,im-1 rho(i,j,k) = $ ((((c13 * t(i,j,k) + c14) * t(i,j,k) + c15) * $ s(i,j,k) + c12) * s(i,j,k) + $ ((((c7 * t(i,j,k) + c8) * t(i,j,k)) + c9) * $ t(i,j,k) + c10) * t(i,j,k) + c11) * s(i,j,k)+ $ ((((c1 * t(i,j,k) + c2) * t(i,j,k) + c3) * $ t(i,j,k)+c4) * t(i,j,k) + c5) * t(i,j,k) + c6 enddo enddo enddo A Faster Algorithm: densityDKAP

UNESCO, DKAP and MS • Dr. Martin Senator of Davidson Laboratory contributed a different fit, MSSuperior to UNESCO in fit to the original data

Consistent Attention to Accuracy • In POM and ECOMSED, density computation is performed • Rho is computed for each cell • In POM, density is stored as a ratio from rhoref to save digits • Differences are subtracted between adjacent cells • Salinity: difference of 0.001 ppt yields a result in the 7th decimal place • Temperature: difference of 0.01 °C yields a result in the 7th decimal place • Result: 0 digits accuracy • Conclusion: Compute density in double precision. Anything else may be inaccurate

Double precision Density vs. Single Precision • Double precision is more stable • Differential Calculation of density will be presumably much more accurate, and therefore even more stable

Calculate Density Difference Analytically • For greatest accuracy, calculate differential density directly • Slightly slower, but full accuracy • Density(s,t) = Pn(s,t) • DiffDensity(s,t,ds,dt)

Future Directions in Hardware • New processors with more registers • Bigger expressions become even more advantageous • Improved shared memory performance • Improved writeback cache design, larger caches, faster interconnects • Intelligent memory subsystems preloading values • Sequential access becomes even more important

Improved Software • We propose to write a Model Construction Toolkit (MCT) that will do all the optimizations mentioned here, and more • Reimplement POM/ECOMSED using the new toolkit • Allow calling FORTRAN routines • Enter algorithms either in a FORTRAN-like syntax or in a higher-level syntax based on common notation for difference equations • Automatically optimize expressions • Compute array subexpressions only when changed • Compute results using an optimal function given the situation • The result: More robust, reliable modeling, at much higher speed, with less effort

Acknowledgements • Professor Roger Pinkham, for superb numerical and statistical knowledge and insight • Dr. Martin Senator for his densityMS fit and help getting started with R • The makers of R, an incredible statistics package • John Gilson, who reviewed the presentation and helped us get the right computing focus, and whose discussions first got us started thinking about a model construction toolkit

The Synergy between Computer Science and CFD Modeling

The Synergy between Computer Science and CFD Modeling

Presentation Transcript

“Synergy between people and technology”

Connections between Computer Science and Biology

CFD Modeling of Turbulent Flows

The Differences between Science and Pseudo-Science

Synergy Between Fipronil and Amitraz and Synergy Between ML-3,277,198 and Amitraz in a Rhipicephalus sanguineus Residu

Computer Science and Christianity: A Modeling Approach

Science Clouds and CFD

“Synergy between people and technology”

Synergy between Ontologization and Quality Management

Synergy and co-operation between Science Parks and Business Incubators

The Synergy of Computer Science and Mathematics

Maximizing the Synergy between Man and Machine

“Synergy between people and technology”

Computer Modeling (CFD*)

Waller Creek Intake CFD Modeling and Simulation

Waller Creek Intake CFD Modeling and Simulation

Waller Creek Intake CFD Modeling and Simulation

The Synergy between Logic Synthesis and Equivalence Checking

Computer Science and Christianity: A Modeling Approach

Difference Between Computer Science and Cybersecurity Master’s Program

Data Science and Machine Learning Synergy

Differences Between BCA Data Science and Traditional Computer Science Programs