340 likes | 631 Views
Mesh-free numerical methods for large-scale engineering problems. Basics of GPU-based computing. D.Stevens, M.Lees, P. Orsini Supervisors: Prof. Henry Power, Herve Morvan January 2008. Outline:. Meshless numerical methods using Radial Basis Functions (RBFs) Basic RBF interpolation
E N D
Mesh-free numerical methods for large-scale engineering problems. Basics of GPU-based computing. D.Stevens, M.Lees, P. Orsini Supervisors: Prof. Henry Power, Herve Morvan January 2008
Outline: • Meshless numerical methods using Radial Basis Functions (RBFs) • Basic RBF interpolation • Brief overview of the work done in our research group, under the direction of Prof. Henry Power • Example results • Large scale future problems • GPU computing • Introduction to the principle of using graphics processors (GPUs), as floating point co-processors • Current state of GPU hardware and software • Basic implementation strategy for numerical simulations
GABARDINE - EU project • GABARDINE: • Groundwater Artificial recharge Based on Alternative sources of wateR: aDvanced INtegrated technologies and managEment • Aims to investigate the viability of artificial recharge in groundwater aquifers, and produce decision support mechanisms • Partners include: Portugal, Germany, Greece, Spain, Belgium, Israel, Palestine • University of Nottingham is developing numerical methods to handle: • Phreatic aquifers (with associated moving surfaces) • Transport processes • Unsaturated zone problems (with nonlinear Governing equations)
RBF Interpolation Methods Nunknowns… • Apply the above at each of the N test locations:
Kansa’s Method (Diffusion operator) Collocating:
Hermitian Method • Operators are also applied to basis functions: SYMMETRIC SYSTEM
RBF – Features and Drawbacks • Partial derivatives are obtained cheaply and accurately by differentiating the (known) basis functions • Leads to a highly flexible formulation allowing boundary operators to be implemented exactly and directly • Once solution weights are obtained, a continuous solution can be reconstructed over the entire solution domain • A densely populated linear system must be solved to obtain the solution weights • Leads to high computational cost – O(N2), and numerical conditioning issues with large systems, setting a practical limit of ~1000 points
Formulation: LHI method Initial value problem: • Hermitian Interpolation using solution values, and boundary operator(s) if present
LHI method formulation cont… • Form N systems, based on a local support: H ~ 10 x 10 matrix • Hence, can reconstruct the solution in the vicinity of local system k via: where:
Apply internal operator Apply Dirichlet operator Apply Boundary operator CV-RBF (Modified CV Scheme) • CV-RBF approach • Classic CV approach Polynomial interpolation to compute the flux RBF interpolation to compute the flux
Our Code Simulation Workflow Dataset Generation Post Processing Simulation RBF specific GridGen TecPlot Pre-Processing RBF Triangle CAD CVRBF Meshless Meshless
Convection-Diffusion: Validation • Both methods have been validated against a series of 1D and 3D linear and nonlinear advection-diffusion reaction problems, eg:
CV-RBF: Infiltration well + Pumping Well diameter: Pumping location: 25m from the infiltration well (height y=15m) Infiltration-Pumping rate: Soil properties:
CV-RBF: Infiltration well + Pumping • 3D model: Mesh (60000 cells) and BC Boundary conditions Everywhere else
CV-RBF: Infiltration well + Pumping • Piezometric head contour and streamlines at t=30h plane at y=29m plane at z=25m Length scale: 100m Maximum Displacements:
LHI: Infiltration model - Setup Solving Richards’ equation: • Infiltrate over 10m x 10m region at ground surface • Infiltration pressure = 2m • Zero-flux boundary at base (‘solid wall’)Fixed pressure distribution at sides • Initial pressure:
LHI: Infiltration model – Soil properties • Saturated conductivity: • Storativity: • Using Van-Genuchten soil representation:
LHI: Infiltration model - Results • 11,585 points arranged in 17 layers • ‘Short’ runs: solution to 48 hours • ‘Long’ run: solution to 160 hours
Richards’ equation - First example • Using the steady-state example given inTracy (2006)* for the solution of Richards’ equation, with: • On a domain: • With: Top face All other faces * F.T.Tracy, Clean two- and three-dimensional analytical solutions of Richards’ equation for testing numerical solvers
Richards’ equation - First example With: (11 x 11 x 11) and(21 x 21 x 21) uniformly spaced points α = 0.164 N = 11 N = 22 α = 0.328 N =11 N = 22
Richards’ equation - First example (error analysis) Finite Volume – max error Improvement factor: L2 error norm Max error 9.34 129.5 706.6 • Good improvement over finite volume results from Tracy paper, particularly with rapidly decaying K and θ functions • Reasonable convergence rate, with increasing point density
Future work • Future work will focus on large-scale problems, including: • Regional scale models of real-world experiments in Portugal and Greece • Country-scale models of aquifer pumping and seawater intrusion across Israel and Gaza • The large problem size will require a parallel implementation for efficient solution – hence our interest in HPC and GPUs • Practical implementation will require the parallelisation of large, sparsely-populated, iterative matrix solvers • To our knowledge, we are the only group working on large-scale hydrology problems using meshless numerical techniques
GPU Computing: • GPU: Graphics Processing Unit • Originally designed to accelerate floating-point heavy calculations in computer games eg. • Pixel shader effects (Lighting effects, reflection/refraction, other effects) • Geometry setup (character meshes etc) • Solid-body physics (…not yet widely adapted) • Massively parallel architecture – currently up to 128 floating point processing units • Recent hardware (from Nov 2006) and software (Feb 2007) advances have allowed programmable processing units (rather than units specialised for pixel or vertex processing) • Has led to "General Purpose GPUs" - ‘GPGPUs’
GPU Computing: • GPUs are extremely efficient at handling add-multiply instructions in small ‘packets’ (usually the main computational cost in numerical simulations) • FP capacity outstrips CPUs, in both theoretical capacity and efficiency (if properly implemented)
GPU Computing: • Modern GPUs effectively work like a shared-memory cluster: • GPUs have an extremely fast (~1000Mhz vs ~400Mhz), dedicated onboard memory • Onboard memory sizes currently range up to 1.5Gb (in addition to system memory)
CUDA - BLAS and FFT libraries • Available Examples/Demos: • Parallel bitonic sort • Matrix multiplication • Matrix transpose • Performance profiling using timers • Parallel prefix sum (scan) of large arrays • Image convolution • 1D DWT using Haar wavelet • graphics interoperation examples • CUDA BLAS and FFT examples • CPU-GPU C and C++ code integration • Binomial Option Pricing • Black-Scholes Option Pricing • Monte-Carlo Option Pricing • Parallel Mersenne Twister • Parallel Histogram • Image Denoising • Sobel Edge Detection Filter • The CUDA toolkit is a C language development environment for CUDA-enabled GPUs. • Two libraries implemented on top of CUDA: • Basic Linear Algebra System (BLAS) • Fast Fourier Transform (FFT) • Pre-parallelised routines.
TESLA - GPUs for HPC • Deskside (2 GPUs 3GB) and Rackmount (4 GPUs 6GB) options • With ~500Gflops per GPU, that is ~2 Teraflops per rack • Deskside is listed at around $4200
GPU Computing – Some results GPU specs: • Use CUDA – nVidia’s C-based API for GPUs • Example: Multiplication of densely populated matrices • O(N3) algorithm… • Matrices are broken down into vector portions, and sent to the GPU stream processors
Various CPUs vs GPU • Note: Performance of dual-core and quad-core CPUs is approximated from an idealised parallelisation (ie. 100% efficiency)
More GPU propaganda: • Good anecdotal evidence for improvement in real-world simulations is available from those who have switched to GPU computing: • Dr. John Stone, Beckmann Institute of Advanced Technology, NAMD virus simulations: • 110 CPU hours on SGI Itanium supercomputer => 47minutes with a single GPU • Represents a 240-fold speedup • Dr. Graham Pullan, University of Cambridge, CFD with turbine blades (LES and RANS models) • 40x absolute speedup switching from a CPU cluster to ‘a few’ GPUs • Use 10 million cells on GPU, up from 500,000 on CPU cluster
Closing words • GPU performance is advancing at a much faster rate than CPUs. This is expected to continue for some time yet • With CUDA and BLAS, exploiting parallelism of the GPUs is in some cases easier than traditional MPI approaches • Later this year: • 2-3 times the performance of current hardware (over 1 TFLOP per card) • Native 64bit capability • More info: • www.nvidia.com/tesla • www.nvidia.com/cuda • www.GPGPU.org