Brook for GPUs

Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan February 10th, 2003

Stream Execution Unit Scalar Execution Unit Stream Register File text Memory text System Network DRDRAM Network Interface Brook: general purpose streaming language • developed for PCA Program/Merrimac • compiler: RStream • Reservoir Labs • DARPA PCA Program • Stanford: SmartMemories • UT Austin: TRIPS • MIT: RAW • Brook version 0.2 spec: http://merrimac.stanford.edu • Brook for GPUs: http://brook.sourceforce.net

Brook: general purpose streaming language • stream programming model • enforce data parallel computing • streams • encourage arithmetic intensity • kernels • C with streams

Brook for gpus • demonstrate gpu streaming coprocessor • make programming gpus easier • hide texture/pbuffer data management • hide graphics based constructs in CG/HLSL • hide rendering passes • virtualize resources • performance! • … on applications that matter • highlight gpu areas for improvement • features required general purpose stream computing

system outline .br Brook source files brcc source to source compiler brt Brook run-time library

Brook languagestreams • streams • collection of records requiring similar computation • particle positions, voxels, FEM cell, … float3 positions<200>; float3 velocityfield<100,100,100>; • encourage data parallelism

Brook languagekernels • kernels • functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a<100>; float b<100>; float c<100>; foo(a,b,c); for (i=0; i<100; i++) c[i] = a[i]+b[i]; • no dependencies between stream elements • encourage high arithmetic intensity

Brook languagekernels • Ray Triangle Intersection kernel void krnIntersectTriangle(Ray ray<>, Triangle tris[], RayState oldraystate<>, GridTrilist trilist[], out Hit candidatehit<>) { float idx, det, inv_det; float3 edge1, edge2, pvec, tvec, qvec; if(oldraystate.state.y > 0) { idx = trilist[oldraystate.state.w].trinum; edge1 = tris[idx].v1 - tris[idx].v0; edge2 = tris[idx].v2 - tris[idx].v0; pvec = cross(ray.d, edge2); det = dot(edge1, pvec); inv_det = 1.0f/det; tvec = ray.o - tris[idx].v0; candidatehit.data.y = dot( tvec, pvec ) * inv_det; qvec = cross( tvec, edge1 ); candidatehit.data.z = dot( ray.d, qvec ) * inv_det; candidatehit.data.x = dot( edge2, qvec ) * inv_det; candidatehit.data.w = idx; } else { candidatehit.data = float4(0,0,0,-1); } }

Brook languageadditional features • reductions • scalar • stream • stride & repeat • GatherOp & ScatterOp • a[i] += p • p = a[i]++

brcc compilerinfrastructure • based on ctool • http://ctool.sourceforge.net • parser • build code tree • extend C grammar to accept Brook • convert • tree transformations • codegen • generate cg & hlsl code • call cgc, fxc • generate stub function

Applications Ray-tracer FFT Segmentation Linear Algebra: • BLAS, LINPACK, LAPACK

Brook Performance

GPU Gotchas Time Registers Used

GPU Gotchas NVIDIA NV3x: Register usage vs. Time Time Registers Used

GPU Gotchas NVIDIA: • Register Penalty • Render to Texture Limitation • Requires explicit copy or heavy pbuffer solution • Superbuffer extension needed http://mirror.ati.com/developer/SIGGRAPH03/Percy_OpenGL_Extensions SIG03.pdf

GPU Gotchas ATI Radeon 9800 Pro • Limited dependent texture lookup • 96 instructions • 24-bit floating point • s16e7 Integers up to 131,072 (s23e8: 16,777,216) Memory Refs 1 Math Ops Memory Refs 2 Math Ops Memory Refs 3 Math Ops Memory Refs 4 Math Ops

GPU Catch-Up! • Integer & Bit Ops & Double Precision • Memory Addressing • CGC/FXC Performance • Hand code performance critical code • No native reduction support • No native scatter support • p[i] = a (indirect write) • No programmable blend • GatherOp / ScatterOp • Limited 4x4 output • Brook virtualized kernel outputs • Readback still slow • NV35 OpenGL: 600 MB/sec Download 170 MB/sec Readback • ATI DirectX: 550 MB/sec Download 50 MB/sec Readback

SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster GPUs of the future (we hope) • Complete Instruction Sets • Integers, Bit Ops, Doubles, Mem Access • Integration • Streaming coprocessor not just a rendering device • Streaming architectures

Brook for GPUs • Release v0.3 available on Sourceforge • Project Page • http://graphics.stanford.edu/projects/brook • Source • http://www.sourceforge.net/projects/brook • Over 4K downloads! • Questions? Fly-fishing fly images from The English Fly Fishing Shop

Brook for GPUs

Brook for GPUs

Presentation Transcript

Brook for GPUs

Peter Brook

Exploiting Disruptive Technology: GPUs for Physics

Why GPUs?

THE BROOK

Sourcery VSIPL++ for NVIDIA CUDA GPUs

GPUs and Accelerators

Brook Trout

Brook Trout

Brook+

Allen Brook

(Brook-Mere)

Brook for GPUs

Muddy Brook vs. Alder Brook

Houses For Sale Jane brook

Brook Crompton

Brook Trout

Scalable Clustering for Vision using GPUs