210 likes | 471 Views
创新计算机体系结构设计的 FMM 算法分析. 吕超 上海交通大学 软件学院 2010. 内容提要. 课题背景 前期工作 N-body 问题简介 FMM 算法分析 针对 FMM 优化的配置策略 结论. 课题背景. 项目来源: 新概念高效能计算机体系结构及系统研究开发 国家 863 计划重点项目( 2009AA012201 ) 上海市科委重大科技攻关项目( 08dz501600 ) 课题内容:新型体系结构设计的应用分析及前端设计 前期应用分析 体系结构的前端设计 编译器 / 软件平台设计 应用优化
E N D
创新计算机体系结构设计的FMM算法分析 • 吕超 • 上海交通大学 • 软件学院 • 2010
内容提要 • 课题背景 • 前期工作 • N-body 问题简介 • FMM 算法分析 • 针对 FMM 优化的配置策略 • 结论
课题背景 • 项目来源:新概念高效能计算机体系结构及系统研究开发 • 国家863 计划重点项目(2009AA012201) • 上海市科委重大科技攻关项目(08dz501600) • 课题内容:新型体系结构设计的应用分析及前端设计 • 前期应用分析 • 体系结构的前端设计 • 编译器/软件平台设计 • 应用优化 • 主要目标:设计针对高性能计算的可重构专用处理机体系结构
前期工作 • 高性能计算应用分析 • CT 和 MRI 的图像重建 • 基于SURF算法的图像局部特征提取与匹配 • 应用的模拟与优化 • 基于多核 CPU 并行的 SURF 算法优化与分析 • 基于 GPGPU 的 SURF 算法实现(CUDA-SURF) • 基于 CPU 和 GPU 异构平台的 SURF 算法优化
N-body 问题简介 • 引入目的 • 作为体系结构的典型应用加以分析 • 给出针对应用优化的体系结构设计策略 • N-body 问题 • 又称多体问题,是天体物理学、流体力学以及分子动力学的基本问题之一 • 用来模拟一个系统中相互作用的粒子的运动规律 • 高性能计算的典型应用 • 数学意义:一组已知初始值的常微分方程
N-body 问题简介(续) • 常见算法 • PP(Particle to Particle)算法 • 应用公式直接计算 • 时间复杂度 O(N2) • PM(Particle Mesh Method)算法 • 利用粒子网格,将多个点的作用看作整体(计算网格的势能) • 时间复杂度 O(NlogN) • TM(Tree Method)算法 • 应用公式直接计算 • 时间复杂度 O(N2)
Graphics Pipeline / Programmable Hardware / Unified Shading Model / NVIDIA GeForce 8800 GTX Overview of GPU Architecture
Graphics Pipeline • The Vertex/Geometry Stage • transforms each vertex from object space into screen space • assembles the vertices into triangles • traditionally performs lighting calculations on each vertex. • The Rasterization Stage • determines the screen positions covered by each triangle • interpolates per-vertex parameters across the triangle. • The Fragment/Pixel Stage • computes the color for each fragment • The Composition/Display Stage • assembles fragments into an image of pixels,
Programmable Hardware • In Programmable Graphics Pipeline • User-defined vertex program • User-defined fragment program • Limitations • Simple, incomplete instruction sets. • Fragment program data types are mostly fixed-point. • Limited number of instructions and a small number of registers. • Limited number of inputs and outputs • No conditional branching
Unified Shader Model • Unified Shader Model must • Have at least 65 k static instructions and unlimited dynamic instructions • Support both 32-bit integers and 32-bit floating-point numbers • Allow an arbitrary number of both direct and indirect reads from global memory (texture) • Support dynamic flow control in the form of loops and branches • Current GPUs support the unified Shader Model 4.0 on both vertex and fragment shaders
Unified Shading Architecture NVIDIA GeForce 8800 GTX Architecture • Green grid – Streaming Multiprocessor • Grid of purple board – Thread Processor • 16 streaming processors of 8 thread processors each.
Unified Shading Architecture (Con.) NVIDIA GeForce 8800 GTX – Thread Processor • One thread processor contains a pair of streaming multiprocessors • One streaming multiprocessor contains shared instruction and data caches, control logic, a 16 KB shared memory, eight stream processors, and two special function units.
GPU Programming Model / GPU Programming Flow Control / GPGPU Techniques / GPGPU Applications How To Program GPGPU
GPU Programming Model • GPU programming model contains • graphics API terminology • stream programming model • A typical GPGPU program using fragment processor is structured as • Segment the general-purpose program into independent parallel sections (kernels) • Specify the range of computation / the size of the output stream to invoke a kernel • Use rasterizer to generate a fragment for every pixel location in the quad • Each of the generated fragments is then processed by the active kernel fragment program • The output of the fragment program is a value (or vector of values) per fragment
GPU Programming Flow Control • Three basic implementations of data-parallel branching • Predication • Both sides of branch are evaluated • Multiple Instruction Multiple Data (MIMD) branching • Different processors flow different paths • Single Instruction Multiple Data (SIMD) branching • If identical for all pixels in the group, only the taken side of the branch must be evaluated. • if one or more of the processors evaluates the branch condition differently, then both sides must be evaluated and the results predicated. Better to Move Branching Up The Pipeline
GPGPU Techniques • Stream Operations: • Map and Reduce – Straightforward [BFH∗04b] BUCK I., FOLEY T., HORN D., SUGERMAN J., FATAHALIAN K., HOUSTON M., HANRAHAN P.: Brook for GPUs: Stream computing on graphics hardware. ACM Transactions on Graphics 23, 3 (Aug. 2004), 777–786. • Scatter and Gather – Avoid Scatter [Buc05b] BUCK I.: Taking the plunge into GPU computing. In GPU Gems 2, Pharr M., (Ed.). Addison Wesley, Mar. 2005, ch. 32, pp. 509–519. • Scan – All-prefix-sums operation[HS86] [Ble90][Hor05][HSC*05][SLO06, GGK06] • Filtering – Using a combination of scan and search, O(log(n)) archived. [Hor05] HORN D.: Stream reduction operations for GPGPU applications. In GPU Gems 2, Pharr M., (Ed.). Addison Wesley, Mar. 2005, ch. 36, pp. 573–589. • Sort – Based on sorting networks, such as parallel bitonic merge sort [BP04,CND03,GZ06,KSW04,KW05a,PDC∗03, Pur04] • Search – Binary search / Nearest neighbor search [Hor05, PDC∗03, Pur04] / [Ben75, FS05, PDC∗03, Pur04]
GPGPU Techniques (Con.) • Data Structures • Iteration • Dense structures supported straightforward • Sparse arrays • adaptive arrays, • and grid-of-list structures require more complex iteration constructs [BFGS03, KW03, LKHW04]. • Generalized Arrays via Address Translation • Address translator converts between 1D array and 2D texture [LKO05,PBMH02] • Optimization techniques for pre-computing these address translation operations before the fragment processor [BFGS03, CHL04, KW03,LKHW04] • Differential Equations, Linear Algebra, Data Queries …
GPGPU Applications • Physically Based Simulation • Signal and Image Processing • Computer Vision • Image Processing • Signal Processing • Tone Mapping • Audio • Image / Video Processing • Global Illumination • Ray tracing, photon mapping, radiosity, subsurface scattering … • Geometric Computing • Databases and Data Mining
Conclusion • Highly parallel nature • But currently only data-parallel for general purpose computation • Many applications can be mapped on GPU • But no double-precision, scatter and efficient branching supported • Program with graphics API • But hard to understand and use • What we are looking forward to • More programmable and flexible hardware needed • High-level programming model needed
The End Thank you