创新计算机体系结构设计的 FMM 算法分析

创新计算机体系结构设计的FMM算法分析 • 吕超 • 上海交通大学 • 软件学院 • 2010

内容提要 • 课题背景 • 前期工作 • N-body 问题简介 • FMM 算法分析 • 针对 FMM 优化的配置策略 • 结论

课题背景 • 项目来源：新概念高效能计算机体系结构及系统研究开发 • 国家863 计划重点项目（2009AA012201） • 上海市科委重大科技攻关项目（08dz501600） • 课题内容：新型体系结构设计的应用分析及前端设计 • 前期应用分析 • 体系结构的前端设计 • 编译器/软件平台设计 • 应用优化 • 主要目标：设计针对高性能计算的可重构专用处理机体系结构

前期工作 • 高性能计算应用分析 • CT 和 MRI 的图像重建 • 基于SURF算法的图像局部特征提取与匹配 • 应用的模拟与优化 • 基于多核 CPU 并行的 SURF 算法优化与分析 • 基于 GPGPU 的 SURF 算法实现（CUDA-SURF） • 基于 CPU 和 GPU 异构平台的 SURF 算法优化

N-body 问题简介 • 引入目的 • 作为体系结构的典型应用加以分析 • 给出针对应用优化的体系结构设计策略 • N-body 问题 • 又称多体问题，是天体物理学、流体力学以及分子动力学的基本问题之一 • 用来模拟一个系统中相互作用的粒子的运动规律 • 高性能计算的典型应用 • 数学意义：一组已知初始值的常微分方程

N-body 问题简介（续） • 常见算法 • PP（Particle to Particle）算法 • 应用公式直接计算 • 时间复杂度 O(N2) • PM（Particle Mesh Method）算法 • 利用粒子网格，将多个点的作用看作整体（计算网格的势能） • 时间复杂度 O(NlogN) • TM（Tree Method）算法 • 应用公式直接计算 • 时间复杂度 O(N2)

Graphics Pipeline / Programmable Hardware / Unified Shading Model / NVIDIA GeForce 8800 GTX Overview of GPU Architecture

Graphics Pipeline • The Vertex/Geometry Stage • transforms each vertex from object space into screen space • assembles the vertices into triangles • traditionally performs lighting calculations on each vertex. • The Rasterization Stage • determines the screen positions covered by each triangle • interpolates per-vertex parameters across the triangle. • The Fragment/Pixel Stage • computes the color for each fragment • The Composition/Display Stage • assembles fragments into an image of pixels,

Programmable Hardware • In Programmable Graphics Pipeline • User-defined vertex program • User-defined fragment program • Limitations • Simple, incomplete instruction sets. • Fragment program data types are mostly fixed-point. • Limited number of instructions and a small number of registers. • Limited number of inputs and outputs • No conditional branching

Unified Shader Model • Unified Shader Model must • Have at least 65 k static instructions and unlimited dynamic instructions • Support both 32-bit integers and 32-bit floating-point numbers • Allow an arbitrary number of both direct and indirect reads from global memory (texture) • Support dynamic flow control in the form of loops and branches • Current GPUs support the unified Shader Model 4.0 on both vertex and fragment shaders

Unified Shading Architecture NVIDIA GeForce 8800 GTX Architecture • Green grid – Streaming Multiprocessor • Grid of purple board – Thread Processor • 16 streaming processors of 8 thread processors each.

Unified Shading Architecture (Con.) NVIDIA GeForce 8800 GTX – Thread Processor • One thread processor contains a pair of streaming multiprocessors • One streaming multiprocessor contains shared instruction and data caches, control logic, a 16 KB shared memory, eight stream processors, and two special function units.

GPU Programming Model / GPU Programming Flow Control / GPGPU Techniques / GPGPU Applications How To Program GPGPU

GPU Programming Model • GPU programming model contains • graphics API terminology • stream programming model • A typical GPGPU program using fragment processor is structured as • Segment the general-purpose program into independent parallel sections (kernels) • Specify the range of computation / the size of the output stream to invoke a kernel • Use rasterizer to generate a fragment for every pixel location in the quad • Each of the generated fragments is then processed by the active kernel fragment program • The output of the fragment program is a value (or vector of values) per fragment

GPU Programming Flow Control • Three basic implementations of data-parallel branching • Predication • Both sides of branch are evaluated • Multiple Instruction Multiple Data (MIMD) branching • Different processors flow different paths • Single Instruction Multiple Data (SIMD) branching • If identical for all pixels in the group, only the taken side of the branch must be evaluated. • if one or more of the processors evaluates the branch condition differently, then both sides must be evaluated and the results predicated. Better to Move Branching Up The Pipeline

GPGPU Techniques • Stream Operations: • Map and Reduce – Straightforward [BFH∗04b] BUCK I., FOLEY T., HORN D., SUGERMAN J., FATAHALIAN K., HOUSTON M., HANRAHAN P.: Brook for GPUs: Stream computing on graphics hardware. ACM Transactions on Graphics 23, 3 (Aug. 2004), 777–786. • Scatter and Gather – Avoid Scatter [Buc05b] BUCK I.: Taking the plunge into GPU computing. In GPU Gems 2, Pharr M., (Ed.). Addison Wesley, Mar. 2005, ch. 32, pp. 509–519. • Scan – All-prefix-sums operation[HS86] [Ble90][Hor05][HSC*05][SLO06, GGK06] • Filtering – Using a combination of scan and search, O(log(n)) archived. [Hor05] HORN D.: Stream reduction operations for GPGPU applications. In GPU Gems 2, Pharr M., (Ed.). Addison Wesley, Mar. 2005, ch. 36, pp. 573–589. • Sort – Based on sorting networks, such as parallel bitonic merge sort [BP04,CND03,GZ06,KSW04,KW05a,PDC∗03, Pur04] • Search – Binary search / Nearest neighbor search [Hor05, PDC∗03, Pur04] / [Ben75, FS05, PDC∗03, Pur04]

GPGPU Techniques (Con.) • Data Structures • Iteration • Dense structures supported straightforward • Sparse arrays • adaptive arrays, • and grid-of-list structures require more complex iteration constructs [BFGS03, KW03, LKHW04]. • Generalized Arrays via Address Translation • Address translator converts between 1D array and 2D texture [LKO05,PBMH02] • Optimization techniques for pre-computing these address translation operations before the fragment processor [BFGS03, CHL04, KW03,LKHW04] • Differential Equations, Linear Algebra, Data Queries …

GPGPU Applications • Physically Based Simulation • Signal and Image Processing • Computer Vision • Image Processing • Signal Processing • Tone Mapping • Audio • Image / Video Processing • Global Illumination • Ray tracing, photon mapping, radiosity, subsurface scattering … • Geometric Computing • Databases and Data Mining

Conclusion • Highly parallel nature • But currently only data-parallel for general purpose computation • Many applications can be mapped on GPU • But no double-precision, scatter and efficient branching supported • Program with graphics API • But hard to understand and use • What we are looking forward to • More programmable and flexible hardware needed • High-level programming model needed

Any question?

The End Thank you

创新计算机体系结构设计的 FMM 算法分析

创新计算机体系结构设计的 FMM 算法分析

Presentation Transcript