1 / 37

The Synergy between Computer Science and CFD Modeling

The Synergy between Computer Science and CFD Modeling. Dov Kruger Anne Pence Alan Blumberg. Background of the Davidson Laboratory Team. Professor Alan Blumberg, co-author of POM and author of ECOMSED Dov Kruger, computer science, transitioning into ocean research

kevlyn
Download Presentation

The Synergy between Computer Science and CFD Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Synergy between Computer Science and CFD Modeling Dov Kruger Anne Pence Alan Blumberg

  2. Background of the Davidson Laboratory Team • Professor Alan Blumberg, co-author of POM and author of ECOMSED • Dov Kruger, computer science, transitioning into ocean research • Anne Pence, PhD candidate with a background in naval architecture and Oceanography

  3. CFD is a Computing War Against Big Problems • Overview of 5 key areas, and focus on our research in single and multiple CPU performance and accuracy • Understanding the battleground: Architectural Features of Computers • Having a tactical plan: coding efficiency • Effective Use of resources: • Parallelism on shared memory architectures • Auto-parallelizing compilers • OpenMP • Parallelism using distributed computers (MPI) • Hitting the correct target: Accuracy

  4. Ignorance is useful, knowledge is crucial • We questioned everything, including equation of state • Surprising results: a new equation of state, and a new style of coding CFD models that coalesces loops and is significantly faster on modern computers • Nothing worked until Professor Blumberg • Conveyed the analytics of the core routines • Designed suitable test cases • Debugged problems

  5. Doubling Performance of Barotropic Mode: 10 months + 2.5 hours • Blindly optimizing the code is very difficult • It took time to understand the model • Problems • Validating results • Finding the right test case • Incremental test procedure • Once the correct strategy was in place, it took only 2.5 hours to double the performance of barotropic mode • More optimizations are possible, but they get exponentially more difficult • We propose not to do them by hand

  6. A Sample Problem

  7. Computer Architecture • Locality = Speed! • Registers • Different Levels of cache • Main Memory • Swapping • Disk • Sequential access is much faster than non-sequential • Locality is also accuracy • Multiple execution units • The limiting factor today is memory bandwidth • ECOMSED and POM are both memory limited

  8. Efficiency • There is a cost structure to operations • Changes over time • Today’s Highest Costs • Writing to memory • Cache must flush to memory • Reading from memory • Statistically, cached • exp • log • sqrt • Division • Multiplication

  9. Heuristics for Writing Fast Code • Efficiency doesn't hurt • It may not help • Minimize cost of operations • Different on each machine • But similar enough that heuristics will produce good results on most architectures • On average, performance will improve if expressions are sufficiently large

  10. Example: Computing Bottom Stress: original do j=2,jmm1 do i=2,imm1 ubar(i,j,kbm1)=0.5*(ua(i,j)+ua(i+1,j)) vbar(i,j,kbm1)=0.5*(va(i,j)+va(i,j+1)) enddo enddo do j=2,jmm1 do i=2,imm1 qbar(i,j)=sqrt(ubar(i,j,kbm1)*ubar(i,j,kbm1)+ $ vbar(i,j,kbm1)*vbar(i,j,kbm1)) enddo enddo do j=2,jmm1 do i=2,imm1 if (fsm(i,j).gt.0.0) then tau(i,j,kb)=10000.*cbc(i,j)/ $ varbf(i,j)*qbar(i,j)*qbar(i,j) endif enddo enddo

  11. Bottom Stress: 21 Times Faster do j=2,jmm1 do i=2,imm1 tempuBAR = (UA(I,J)+UA(I+1,J)) tempVBAR = (VA(I,J)+VA(I,J+1)) TAU(I,J,KB)=CBC2(I,J)*(tempUBAR**2 + tempVBAR**2) enddo enddo

  12. The Current State of Compiler Optimization • Compilers typically do not rearrange floating point expressions • For fear of subtle roundoff effects • Instruction scheduling can still be done as long as the two data streams do not interact • Example: (x * y) - (x / y) + (x + y) – (x – y) • Common subexpressions are well-handled provided they are in the same form • Algebraic equivalences are not considered • Constant subexpressions are pre-evaluated at compile time in C and FORTRAN

  13. Examples: Writing Fast Code • Only scalars are considered constant • Writing to an array is not optimized awayvar1(i,j,k) = c1 * c2 * c3var1(i,j,k) = var1(i,j,k) * c4var1(i,j,k) = c1 * c2 * c3 * c4c1 * var2(i,j,k) * c3 * var3(i,j,k) * c4c1*c3*c4 * var2(i,j,k) * var3(i,j,k)var2(i,j,k) * var3(i,j,k) * c1*c3*c4var2(i,j,k) * var3(i,j,k) * (c1*c3*c4)var1(i,j,k) / c1(i,j,k)

  14. Skipping Land • Goals • Save CPU by only processing water boxes • Cost very little CPU for all-water problems • IF-tests slow down the CPU, so preferably avoid them • Stripwise “unstructured view” of the grid avoids additional tests • inefficient on a vector machine • k-order would be better to test an entire water column at once, but this would require re-engineering the entire model

  15. Skipping land, continued istart,jrow, iend 2,2,2 5,2,6 2,3,2 5,3,6 2,4,6 2,5,6 2,6,6 istart,jrow, iend

  16. High Performance Advection • Scalar Fluxes in one preferred direction • No memory access for flux • Full machine precision xfluxi Xfluxi+1 Var(i,j,k)=xFluxiP1 – xFluxixFluxi = xFluxiP1

  17. Advection, cont yfluxJ(1) = 0 yFluxJ(im) = 0 diffusiveyfluxJ(1) = 0 diffusiveyFluxJ(im) = 0 do k=2,kbm1 do i=2,imm1 yfluxJ(i) = 0 ! because of land boundary condition diffusiveYFluxJ(i) = 0 enddo do j=2,jmm1 xfluxI = 0 ! because of land boundary condition do i=2,imm1 xfluxIP1 = (f(i,j,k)+f(i+1,j,k)) * xmfl3d(i+1,j,k) yFluxJP1 = (f(i,j,k)+f(i,j+1,k)) * ymfl3d(i,j+1,k) diffusiveXFluxIP1 = -aamx(i,j,k) * addDTi(i,j,BACK) * $ (fb(i,j,k)-fb(i-1,j,k))*uMaskedH2_2H1(i,j) diffusiveYFluxJP1 = -aamy(i,j,k) * addDTj(i,j, BACK) * $ (fb(i,j,k)-fb(i,j-1,k))*vMaskedH1_2H2(i,j)

  18. Advection, take 3 ff(i,j,k) = $ ( $ fB(i,j,k) * volT(i,j,BACK) - !advective part $ dti2 * 0.5 * invDZ(k) * $ ( $ (f(i,j,k-1)+f(i,j,k))*w(i,j,k) - $ (f(i,j,k)+f(i,j,k+1))*w(i,j,k+1)*ART(i,j) + $ xFluxIP1 - xFluxI + yFluxJP1 - yFluxJ(i) $ ) + ! diffusive part $ diffusiveXfluxIP1 - diffusiveXFluxI + $ diffusiveYFluxJP1 - diffusiveYFluxJ(i) $ ) * $ invVolT(i,j,FUTURE) xFluxI = xFluxIp1 diffusiveXFluxI = diffusiveXFluxIp1 yFluxJ(i) = yFluxJP1 diffusiveYFluxJ(i)=diffusiveYFluxJP1 enddo enddo enddo

  19. K-order is Faster • POM and ECOMSED are currently stored with I varying most frequently • var(i,j,k) • For most algorithms, traversal order is irrelevant, except: • Vertical models are naturally ordered top to bottom • A model stored and accessed in k-order can implement the vertical profile algorithms much more efficiently

  20. Performance of PROFT/PROFS • On a sample problem • 3 times faster for profiling temperature • 8 times faster for profiling salinity • This includes the penalty of traversing in the wrong memory order for T,S • For a k-ordered model, code would be even faster • Techniques as discussed before, and also • Removal of exponentiation where unnecessary

  21. Shared Memory Parallelism • Compiler automatically recognizes opportunities for multiple CPUs to split up loops • Can be • Automatic (under compiler control) • Manual (OpenMP) • Problem is memory bandwidth

  22. Memory Bandwidth and Shared Memory Computers • Different levels • Commodity PC dual processors • Mid-range shared-memory computers • Larger cache • Floating point performance of CPUs are critical • Special-purpose machines with exotic memory and caches • Write-back cache synchronization

  23. MPI • Message Passing Interface requires manual coding intervention • The programmer can split the domain across multiple computers • Messages are passed exchanging information between models • This will not speed up small models due to the overhead of message passing • The volume of the model must be large with respect to the data exchange interface

  24. Accuracy • Values in registers retain extra bits of precision • This can substantially improve computational results • Every time values are stored to single precision arrays, this extra information is lost • Example: Vectorization on the Pentium 4 • Very fast (4 simultaneous 32-bit operations) • Values change substantially in the model • Loss of significant figures

  25. Simulation is not Prediction • Roundoff error slowly burns away digits of precision • No escape from information entropy • Roundoff is indistinguishable from small differences in the initial conditions • Single precision is not enough • After 12 hours, salinity results can differ by 3ppt on different computers

  26. Calibration of Model Run on Different Machines? • Calibration represents balancing all the constituents of the equations as they are computed by the model • Some parts of the model are more sensitive to roundoff error than others • Some parts are more driven by forcing functions and are therefore more stable • Turbulence closure is particularly twitchy

  27. Density Computation • The Equation of State has been defined by Millero and Poisson in UNESCO • Accuracy is high • On the order of 6 digits • Original data (>500 samples) 5 digits • Computational cost is also high • Many terms including s1.5 • How much accuracy is needed?

  28. Comparison: Fofonoff 1952 vs. UNESCO • ECOMSED used Fofonoff 1952 until recently • Approximately 0.6% difference at worst case at approximate 10°C, 10ppt • In computer model, the absolute density is irrelevant • The density difference between adjacent cells drives the forcing • In a 20ppt difference • No significant difference in X1 • Small differences in distribution of salt gradient between 0 and 20 ppt.

  29. Difference between UNESCO and Fofonoff 1950

  30. c1 = 6.38582008014815d-12 c2 = -1.07670862741285d-09 c3 = 9.73156784811086d-08 c4 = -9.0042779076007d-06 c5 = 6.60339902342078d-05 c6 = -0.000132403263360079d0 c7 = 4.91156586724696d-12 c8 = -7.81209323957205d-10 c9 = 6.72419650942208d-08 c10 = -3.5945086374423d-06 c11 = 0.000809532551425d0 c12 = -1.59139276805119d-07 c13 = -2.00957653399204d-12 c14 = 1.28333259714417d-10 c15 = 2.34587013019438d-09 do k=1,kb-1 do j=2,jm-1 do i=2,im-1 rho(i,j,k) = $ ((((c13 * t(i,j,k) + c14) * t(i,j,k) + c15) * $ s(i,j,k) + c12) * s(i,j,k) + $ ((((c7 * t(i,j,k) + c8) * t(i,j,k)) + c9) * $ t(i,j,k) + c10) * t(i,j,k) + c11) * s(i,j,k)+ $ ((((c1 * t(i,j,k) + c2) * t(i,j,k) + c3) * $ t(i,j,k)+c4) * t(i,j,k) + c5) * t(i,j,k) + c6 enddo enddo enddo A Faster Algorithm: densityDKAP

  31. UNESCO, DKAP and MS • Dr. Martin Senator of Davidson Laboratory contributed a different fit, MSSuperior to UNESCO in fit to the original data

  32. Consistent Attention to Accuracy • In POM and ECOMSED, density computation is performed • Rho is computed for each cell • In POM, density is stored as a ratio from rhoref to save digits • Differences are subtracted between adjacent cells • Salinity: difference of 0.001 ppt yields a result in the 7th decimal place • Temperature: difference of 0.01 °C yields a result in the 7th decimal place • Result: 0 digits accuracy • Conclusion: Compute density in double precision. Anything else may be inaccurate

  33. Double precision Density vs. Single Precision • Double precision is more stable • Differential Calculation of density will be presumably much more accurate, and therefore even more stable

  34. Calculate Density Difference Analytically • For greatest accuracy, calculate differential density directly • Slightly slower, but full accuracy • Density(s,t) = Pn(s,t) • DiffDensity(s,t,ds,dt)

  35. Future Directions in Hardware • New processors with more registers • Bigger expressions become even more advantageous • Improved shared memory performance • Improved writeback cache design, larger caches, faster interconnects • Intelligent memory subsystems preloading values • Sequential access becomes even more important

  36. Improved Software • We propose to write a Model Construction Toolkit (MCT) that will do all the optimizations mentioned here, and more • Reimplement POM/ECOMSED using the new toolkit • Allow calling FORTRAN routines • Enter algorithms either in a FORTRAN-like syntax or in a higher-level syntax based on common notation for difference equations • Automatically optimize expressions • Compute array subexpressions only when changed • Compute results using an optimal function given the situation • The result: More robust, reliable modeling, at much higher speed, with less effort

  37. Acknowledgements • Professor Roger Pinkham, for superb numerical and statistical knowledge and insight • Dr. Martin Senator for his densityMS fit and help getting started with R • The makers of R, an incredible statistics package • John Gilson, who reviewed the presentation and helped us get the right computing focus, and whose discussions first got us started thinking about a model construction toolkit

More Related