1 / 51

Optimizing Performance of the Lattice Boltzmann Method for Complex Structures

Optimizing Performance of the Lattice Boltzmann Method for Complex Structures. Friedrich-Alexander University Erlangen/Nuremberg Department of Computer Science 10 (System Simulation) Regional Computing Center of Erlangen (RRZE). Outline. Introduction Lattice Boltzmann Method

dillon
Download Presentation

Optimizing Performance of the Lattice Boltzmann Method for Complex Structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department of Computer Science 10 (System Simulation) Regional Computing Center of Erlangen (RRZE)

  2. Outline • Introduction • Lattice Boltzmann Method • Implementation Aspects • Application • Implementation • Optimization • Results • Conclusion Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  3. Lattice Boltzmann Method • Boltzmann Equation • Discretization of particle velocity space (finite set of discrete velocities) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  4. Lattice Boltzmann Method • Different discretization schemes • Numerical accuracy and stability • Computational speed and simplicity D3Q15 D3Q19 D3Q27 Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  5. Lattice Boltzmann Method • Discretization in space x and time t: collision step: streaming step: Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  6. Lattice Boltzmann Method (Implementation Aspects) • Discretization in space x and time t: collision step: streaming step: • Stream-Collide (Pull-Method) • Get the distributions from the neighboring cells in the source arrayand store the relaxated values to onecell in the destination array • Collide-Stream (Push-Method) • Take the distributions from one cellin the source array and store therelaxated values to the neighboringcells in the destination array W source destination Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  7. Lattice Boltzmann Method (Implementation Aspects) • Walls and Obstacles: Bounce Back rule Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  8. Implementation Aspects • Data Dependencies • Two Grids • Compressed Grid  Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  9. Implementation Aspects double precision f(0:xMax+1,0:yMax+1,0:zMax+1,0:18,0:1) do z=1,zMax do y=1,yMax do x=1,xMax if( fluidcell(x,y,z) ) then LOAD f(x,y,z, 0:18,t) Relaxation (complex computations) SAVE f(x ,y ,z , 0,t+1) SAVE f(x+1,y+1,z , 1,t+1) SAVE f(x ,y+1,z , 2,t+1) SAVE f(x-1,y+1,z , 3,t+1) … SAVE f(x ,y-1,z-1,18,t+1) endif enddo enddo enddo Collide Stream Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  10. Application Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  11. P C Porous Media Combustion Application: Porous Media Combustion • New technology for heating installations: Porous Media Combustion • Fuel-air-mixture does no longer react in a free flame • Combustion process takes place inside the pores of a porous medium that is placed in the reaction area 60mm 20mm Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  12. P C Porous Media Combustion Application: Porous Media Combustion Figures by courtesy of LSTM Uni-Erlangen, Thomas Zeiser Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  13. P C Porous Media Combustion Application: Porous Media Combustion • Various applications: modern steam engines, vehicle heaters Figures by courtesy of LSTM Uni-Erlangen, Thomas Zeiser Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  14. Application: Introducing Complex Geometries used • Porous Medium from PMC: Silicon-Carbide (SiC) • Many small obstacles • Obstacle/fluid-ratio: ~2% • High number of fluid-solid faces Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  15. Application: Introducing Complex Geometries used • Second Test-Geometry: • “MC“ • Huge obstacles, only fewfluid tubes • Obstacle/fluid-ratio: ~50% • Low number of fluid-solid faces Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  16. Implementation Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  17. Implementation • Collision and streaming step in same loop • Push-Method • Data representation in 1D-Array • Stores only Fluid Cells ( saves memory) • Indirect addressing of target cells (by extra connectivity array) • Boundary conditions (Bounce Back) handled implicitly Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  18. Implementation • Indirect addressing and implicit Bounce Back obstacle wall i-1 i i+1 i-1 i i+1 connectivity array Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  19. Implementation • Indirect addressing and implicit Bounce Back obstacle wall i-1 i i+1 i-1 i i+1 connectivity array Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  20. Implementation • Preprocessor • Sets up connectivity array • Specifies all domain parameters: • Geometry and obstacles • Traversing scheme • Solver • Reads in preprocessed information • Performs lattice Boltzmann method (in single loop) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  21. Implementation Rating: 1D-Array compared to standard implementation using multidimensional array and three loops • Advantages • Saves memory The more obstacles in the domain the higher is the compression • Implicit Bounce Back No extra routine or if-statement needed • Drawbacks • Indirect addressing Prevents compiler from vectorization and other optimization techniques  Consequently, worse performance Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  22. Optimization - Outline • Memory Traversing Schemes • Space-Filling Curves • Blocking • Memory Layouts • Further Optimization Techniques Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  23. Memory Traversing Schemes: Space-Filling Curves • What is a space-filling curve? Loosely spoken: A one dimensional curve that fills a higher dimensional space • Which curves were used? • Hilbert • Peano • How are they constructed? Again, loosely: By a mapping from a one-dimensional interval to the higher dimensional space Then, by recursion new mapping of each part of the interval Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  24. Memory Traversing Schemes: Space-Filling Curves • And how works construction really? 1 0 Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  25. “Hilbert bne“ “Hilbert nwf“ “Hilbert fws“ “Hilbert enb“ “Peano fsw“ Memory Traversing Schemes: Space-Filling Curves • How does that look in 3D? • OK, but how to construct them? Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  26. Memory Traversing Schemes: Space-Filling Curves • How to construct them? • Table based approach • Hilbert: 48 Productions with 8 entries and 7 connectors • Peano: 8 Productions with 27 entries and 26 connectors Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  27. Memory Traversing Schemes: Space-Filling Curves • Summary for Space-Filling Curves • Recursive production by segmentation Limitation in system sizes: • Hilbert: 23n • Peano: 33n • Increase spatial locality • Enable mesh-refinement Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  28. Memory Traversing Schemes: Blocking • Implicit blocking technique • Arrangement of data in a blocking manner • Increases spatial locality Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  29. Memory Traversing Schemes • Notes on Memory Traversing Schemes Pure preprocessing technique: • Only arrangement of data in memory changed • No change for solver • No overhead for solver (e.g. loop overhead for blocking) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  30. Memory Layouts: Collision Optimized Layout • Standard Array-Layout: F(i,x,y,z,t) “Array-of-Structures“ • Collision optimized • Optimal read access:2 cache lines per LUP • Bad write access:19 stores in 19 cache linesBut: Depending on systemsize some of them arealready/still in cache Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  31. Memory Layouts: Collision Optimized Layout • 18 write accesseson one cell from3 different z-layers • 8 write accesseson one cell from3 different rows Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  32. Memory Layouts: Collision Optimized Layout • Performance of Collision-Optimized Layout (P4, 512kB L2) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  33. Memory Layouts: Propagation Optimized Layout • Optimized Array-Layout: F(x,y,z,i,t) “Structure-of-Arrays“ • stride-1-access on x (inner loop) • 19 cache lines per 16 LUPsin read and write process • 1 cache miss each 16th memory access Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  34. Memory Layouts: Propagation Optimized Layout • Performance on Pentium 4, 512kB L2 Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  35. Further Optimization Techniques • Additional Bottlenecks • Large loop body (causes register spills on IA32) • Concurrent writing to 19 different cache lines interferes with number of write combine buffers on IA32 (6 for Intel Xeon/Nocona) • Indirect addressing prevents IA32 hardware prefetcher from preloading values for target cells (due to bounce back at obstacles) • Solutions (implemented in Solver): • Split up loop in 5 loops of length Nx • Manual Block Preload Technique (Drawback: Both techniques need a loop blocking scheme) These solutions only needed for IA32 Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  36. Results - Outline • Architecture Descriptions • Comparison 1D-Solver to Standard Solver • Memory Layouts • For Standard Solver • For 1D-Solver • Influence of Geometry • MC • SiC • Memory Traversing Schemes • Space-Filling Curves • Blocking Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  37. Architecture Description: Nocona/Irwindale • Test System: Test machine at RRZE • CPUs: Nocona (Irwindale), 3.6GHz, 2MB L2-Cache • Memory: DDR400, 6.4GB/s • Architectural specialties: • EM64T extension • Hyperthreading • One memory bus for both processors Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  38. Architecture Description: Itanium2 / Altix • Test System: RRZE SGI Altix • CPUs: Itanium 2, 1.3 GHz, 3MB L3-Cache • Memory: 112GB “distributed shared memory“ • Architectural specialties: • Itanium 2: • “EPIC“ (Explicitly Parallel Instruction Computing) • No out-of-order • Parallelization of commands in the grip of compiler ( bundles) • L1-Cache only for Integer • Altix: • ccNUMA with NUMALink 3 • Memory connected hierarchically by SHUBs Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  39. Architecture Description: AMD Opteron • Test System: LSS HPC-Cluster • CPUs: AMD Opteron, 2.2GHz, 1MB L2-CacheIA32-compatible • Memory: DDR333, 5.2GB/s • Architectural specialties: • Compute nodes with four CPUs • 4GB RAM per CPU, each CPU can access 16GB per ccNUMA • Interconnect: • CPUs on one node: HyperTransport (6.4GB/s) • Nodes: Infiniband (10GB/s) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  40. Comparison 1D-Array Solver to Standard Solver Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  41. Memory Layouts for Standard Solver Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  42. Memory Layouts on Itanium 2 Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  43. Memory Layouts on AMD Opteron Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  44. Memory Layouts on Nocona/Irwindale Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  45. Influence of Geometry: SiC-foam (~2% obstacles) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  46. Influence of Geometry: MC (~50% obstacles) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  47. Memory Traversing Schemes Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  48. Memory Traversing Schemes Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  49. Conclusion • 1D-Array Data representation makes performance independent of obstacle to fluid ratio • Memory traversing by Space-Filling Curves results in similar performance as spatial blocking • Implementation of SFCs is not worth the effort (if they are used as memory traversing alternative only) • Together with indirect addressing Collision Optimized Layout with blocking is best technique if cache is larger than 1 MB • Indeed, there are cases where Propagation Optimized Layout is not best Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

  50. Outlook • Future work could concern: • Space-Filling Curves: • Kind of staggered SFCs, for every direction own curve • Avoid waste of underused cache lines where lattice sites are neighboring cells which are visited much later • Galerkin-discretization or point wise evaluation of LBM to enable stack-implementation in conjunction with SFCs • BUT: For real-world problems construction on non-cubic grids is necessary at first • Search for vectorization enhancing techniques to over-come problems with indirect addressing on Itanium 2 • Search for reasons why Collision Optimized Layout is better than Propagation Optimized Layout Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg

More Related