530 likes | 805 Views
Optimizing Performance of the Lattice Boltzmann Method for Complex Structures. Friedrich-Alexander University Erlangen/Nuremberg Department of Computer Science 10 (System Simulation) Regional Computing Center of Erlangen (RRZE). Outline. Introduction Lattice Boltzmann Method
E N D
Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department of Computer Science 10 (System Simulation) Regional Computing Center of Erlangen (RRZE)
Outline • Introduction • Lattice Boltzmann Method • Implementation Aspects • Application • Implementation • Optimization • Results • Conclusion Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Lattice Boltzmann Method • Boltzmann Equation • Discretization of particle velocity space (finite set of discrete velocities) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Lattice Boltzmann Method • Different discretization schemes • Numerical accuracy and stability • Computational speed and simplicity D3Q15 D3Q19 D3Q27 Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Lattice Boltzmann Method • Discretization in space x and time t: collision step: streaming step: Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Lattice Boltzmann Method (Implementation Aspects) • Discretization in space x and time t: collision step: streaming step: • Stream-Collide (Pull-Method) • Get the distributions from the neighboring cells in the source arrayand store the relaxated values to onecell in the destination array • Collide-Stream (Push-Method) • Take the distributions from one cellin the source array and store therelaxated values to the neighboringcells in the destination array W source destination Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Lattice Boltzmann Method (Implementation Aspects) • Walls and Obstacles: Bounce Back rule Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Implementation Aspects • Data Dependencies • Two Grids • Compressed Grid Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Implementation Aspects double precision f(0:xMax+1,0:yMax+1,0:zMax+1,0:18,0:1) do z=1,zMax do y=1,yMax do x=1,xMax if( fluidcell(x,y,z) ) then LOAD f(x,y,z, 0:18,t) Relaxation (complex computations) SAVE f(x ,y ,z , 0,t+1) SAVE f(x+1,y+1,z , 1,t+1) SAVE f(x ,y+1,z , 2,t+1) SAVE f(x-1,y+1,z , 3,t+1) … SAVE f(x ,y-1,z-1,18,t+1) endif enddo enddo enddo Collide Stream Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Application Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
P C Porous Media Combustion Application: Porous Media Combustion • New technology for heating installations: Porous Media Combustion • Fuel-air-mixture does no longer react in a free flame • Combustion process takes place inside the pores of a porous medium that is placed in the reaction area 60mm 20mm Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
P C Porous Media Combustion Application: Porous Media Combustion Figures by courtesy of LSTM Uni-Erlangen, Thomas Zeiser Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
P C Porous Media Combustion Application: Porous Media Combustion • Various applications: modern steam engines, vehicle heaters Figures by courtesy of LSTM Uni-Erlangen, Thomas Zeiser Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Application: Introducing Complex Geometries used • Porous Medium from PMC: Silicon-Carbide (SiC) • Many small obstacles • Obstacle/fluid-ratio: ~2% • High number of fluid-solid faces Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Application: Introducing Complex Geometries used • Second Test-Geometry: • “MC“ • Huge obstacles, only fewfluid tubes • Obstacle/fluid-ratio: ~50% • Low number of fluid-solid faces Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Implementation Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Implementation • Collision and streaming step in same loop • Push-Method • Data representation in 1D-Array • Stores only Fluid Cells ( saves memory) • Indirect addressing of target cells (by extra connectivity array) • Boundary conditions (Bounce Back) handled implicitly Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Implementation • Indirect addressing and implicit Bounce Back obstacle wall i-1 i i+1 i-1 i i+1 connectivity array Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Implementation • Indirect addressing and implicit Bounce Back obstacle wall i-1 i i+1 i-1 i i+1 connectivity array Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Implementation • Preprocessor • Sets up connectivity array • Specifies all domain parameters: • Geometry and obstacles • Traversing scheme • Solver • Reads in preprocessed information • Performs lattice Boltzmann method (in single loop) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Implementation Rating: 1D-Array compared to standard implementation using multidimensional array and three loops • Advantages • Saves memory The more obstacles in the domain the higher is the compression • Implicit Bounce Back No extra routine or if-statement needed • Drawbacks • Indirect addressing Prevents compiler from vectorization and other optimization techniques Consequently, worse performance Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Optimization - Outline • Memory Traversing Schemes • Space-Filling Curves • Blocking • Memory Layouts • Further Optimization Techniques Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Memory Traversing Schemes: Space-Filling Curves • What is a space-filling curve? Loosely spoken: A one dimensional curve that fills a higher dimensional space • Which curves were used? • Hilbert • Peano • How are they constructed? Again, loosely: By a mapping from a one-dimensional interval to the higher dimensional space Then, by recursion new mapping of each part of the interval Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Memory Traversing Schemes: Space-Filling Curves • And how works construction really? 1 0 Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
“Hilbert bne“ “Hilbert nwf“ “Hilbert fws“ “Hilbert enb“ “Peano fsw“ Memory Traversing Schemes: Space-Filling Curves • How does that look in 3D? • OK, but how to construct them? Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Memory Traversing Schemes: Space-Filling Curves • How to construct them? • Table based approach • Hilbert: 48 Productions with 8 entries and 7 connectors • Peano: 8 Productions with 27 entries and 26 connectors Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Memory Traversing Schemes: Space-Filling Curves • Summary for Space-Filling Curves • Recursive production by segmentation Limitation in system sizes: • Hilbert: 23n • Peano: 33n • Increase spatial locality • Enable mesh-refinement Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Memory Traversing Schemes: Blocking • Implicit blocking technique • Arrangement of data in a blocking manner • Increases spatial locality Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Memory Traversing Schemes • Notes on Memory Traversing Schemes Pure preprocessing technique: • Only arrangement of data in memory changed • No change for solver • No overhead for solver (e.g. loop overhead for blocking) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Memory Layouts: Collision Optimized Layout • Standard Array-Layout: F(i,x,y,z,t) “Array-of-Structures“ • Collision optimized • Optimal read access:2 cache lines per LUP • Bad write access:19 stores in 19 cache linesBut: Depending on systemsize some of them arealready/still in cache Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Memory Layouts: Collision Optimized Layout • 18 write accesseson one cell from3 different z-layers • 8 write accesseson one cell from3 different rows Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Memory Layouts: Collision Optimized Layout • Performance of Collision-Optimized Layout (P4, 512kB L2) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Memory Layouts: Propagation Optimized Layout • Optimized Array-Layout: F(x,y,z,i,t) “Structure-of-Arrays“ • stride-1-access on x (inner loop) • 19 cache lines per 16 LUPsin read and write process • 1 cache miss each 16th memory access Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Memory Layouts: Propagation Optimized Layout • Performance on Pentium 4, 512kB L2 Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Further Optimization Techniques • Additional Bottlenecks • Large loop body (causes register spills on IA32) • Concurrent writing to 19 different cache lines interferes with number of write combine buffers on IA32 (6 for Intel Xeon/Nocona) • Indirect addressing prevents IA32 hardware prefetcher from preloading values for target cells (due to bounce back at obstacles) • Solutions (implemented in Solver): • Split up loop in 5 loops of length Nx • Manual Block Preload Technique (Drawback: Both techniques need a loop blocking scheme) These solutions only needed for IA32 Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Results - Outline • Architecture Descriptions • Comparison 1D-Solver to Standard Solver • Memory Layouts • For Standard Solver • For 1D-Solver • Influence of Geometry • MC • SiC • Memory Traversing Schemes • Space-Filling Curves • Blocking Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Architecture Description: Nocona/Irwindale • Test System: Test machine at RRZE • CPUs: Nocona (Irwindale), 3.6GHz, 2MB L2-Cache • Memory: DDR400, 6.4GB/s • Architectural specialties: • EM64T extension • Hyperthreading • One memory bus for both processors Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Architecture Description: Itanium2 / Altix • Test System: RRZE SGI Altix • CPUs: Itanium 2, 1.3 GHz, 3MB L3-Cache • Memory: 112GB “distributed shared memory“ • Architectural specialties: • Itanium 2: • “EPIC“ (Explicitly Parallel Instruction Computing) • No out-of-order • Parallelization of commands in the grip of compiler ( bundles) • L1-Cache only for Integer • Altix: • ccNUMA with NUMALink 3 • Memory connected hierarchically by SHUBs Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Architecture Description: AMD Opteron • Test System: LSS HPC-Cluster • CPUs: AMD Opteron, 2.2GHz, 1MB L2-CacheIA32-compatible • Memory: DDR333, 5.2GB/s • Architectural specialties: • Compute nodes with four CPUs • 4GB RAM per CPU, each CPU can access 16GB per ccNUMA • Interconnect: • CPUs on one node: HyperTransport (6.4GB/s) • Nodes: Infiniband (10GB/s) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Comparison 1D-Array Solver to Standard Solver Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Memory Layouts for Standard Solver Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Memory Layouts on Itanium 2 Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Memory Layouts on AMD Opteron Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Memory Layouts on Nocona/Irwindale Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Influence of Geometry: SiC-foam (~2% obstacles) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Influence of Geometry: MC (~50% obstacles) Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Memory Traversing Schemes Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Memory Traversing Schemes Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Conclusion • 1D-Array Data representation makes performance independent of obstacle to fluid ratio • Memory traversing by Space-Filling Curves results in similar performance as spatial blocking • Implementation of SFCs is not worth the effort (if they are used as memory traversing alternative only) • Together with indirect addressing Collision Optimized Layout with blocking is best technique if cache is larger than 1 MB • Indeed, there are cases where Propagation Optimized Layout is not best Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg
Outlook • Future work could concern: • Space-Filling Curves: • Kind of staggered SFCs, for every direction own curve • Avoid waste of underused cache lines where lattice sites are neighboring cells which are visited much later • Galerkin-discretization or point wise evaluation of LBM to enable stack-implementation in conjunction with SFCs • BUT: For real-world problems construction on non-cubic grids is necessary at first • Search for vectorization enhancing techniques to over-come problems with indirect addressing on Itanium 2 • Search for reasons why Collision Optimized Layout is better than Propagation Optimized Layout Stefan Donath Friedrich-Alexander University Erlangen-Nuremberg