190 likes | 204 Views
Brook is a general-purpose streaming language for GPUs that enforces data parallel computing through streams and kernels, aiming to make GPU programming easier and more performance-efficient. It hides low-level graphics details, virtualizes resources, and optimizes arithmetic intensity. The language streamlines GPU coprocessor usage and supports powerful features like reductions and scatter operations. Learn more about its compiler, runtime library, and programming model.
E N D
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan February 10th, 2003
Stream Execution Unit Scalar Execution Unit Stream Register File text Memory text System Network DRDRAM Network Interface Brook: general purpose streaming language • developed for PCA Program/Merrimac • compiler: RStream • Reservoir Labs • DARPA PCA Program • Stanford: SmartMemories • UT Austin: TRIPS • MIT: RAW • Brook version 0.2 spec: http://merrimac.stanford.edu • Brook for GPUs: http://brook.sourceforce.net
Brook: general purpose streaming language • stream programming model • enforce data parallel computing • streams • encourage arithmetic intensity • kernels • C with streams
Brook for gpus • demonstrate gpu streaming coprocessor • make programming gpus easier • hide texture/pbuffer data management • hide graphics based constructs in CG/HLSL • hide rendering passes • virtualize resources • performance! • … on applications that matter • highlight gpu areas for improvement • features required general purpose stream computing
system outline .br Brook source files brcc source to source compiler brt Brook run-time library
Brook languagestreams • streams • collection of records requiring similar computation • particle positions, voxels, FEM cell, … float3 positions<200>; float3 velocityfield<100,100,100>; • encourage data parallelism
Brook languagekernels • kernels • functions applied to streams • similar to for_all construct kernel void foo (float a<>, float b<>, out float result<>) { result = a + b; } float a<100>; float b<100>; float c<100>; foo(a,b,c); for (i=0; i<100; i++) c[i] = a[i]+b[i]; • no dependencies between stream elements • encourage high arithmetic intensity
Brook languagekernels • Ray Triangle Intersection kernel void krnIntersectTriangle(Ray ray<>, Triangle tris[], RayState oldraystate<>, GridTrilist trilist[], out Hit candidatehit<>) { float idx, det, inv_det; float3 edge1, edge2, pvec, tvec, qvec; if(oldraystate.state.y > 0) { idx = trilist[oldraystate.state.w].trinum; edge1 = tris[idx].v1 - tris[idx].v0; edge2 = tris[idx].v2 - tris[idx].v0; pvec = cross(ray.d, edge2); det = dot(edge1, pvec); inv_det = 1.0f/det; tvec = ray.o - tris[idx].v0; candidatehit.data.y = dot( tvec, pvec ) * inv_det; qvec = cross( tvec, edge1 ); candidatehit.data.z = dot( ray.d, qvec ) * inv_det; candidatehit.data.x = dot( edge2, qvec ) * inv_det; candidatehit.data.w = idx; } else { candidatehit.data = float4(0,0,0,-1); } }
Brook languageadditional features • reductions • scalar • stream • stride & repeat • GatherOp & ScatterOp • a[i] += p • p = a[i]++
brcc compilerinfrastructure • based on ctool • http://ctool.sourceforge.net • parser • build code tree • extend C grammar to accept Brook • convert • tree transformations • codegen • generate cg & hlsl code • call cgc, fxc • generate stub function
Applications Ray-tracer FFT Segmentation Linear Algebra: • BLAS, LINPACK, LAPACK
GPU Gotchas Time Registers Used
GPU Gotchas NVIDIA NV3x: Register usage vs. Time Time Registers Used
GPU Gotchas NVIDIA: • Register Penalty • Render to Texture Limitation • Requires explicit copy or heavy pbuffer solution • Superbuffer extension needed http://mirror.ati.com/developer/SIGGRAPH03/Percy_OpenGL_Extensions SIG03.pdf
GPU Gotchas ATI Radeon 9800 Pro • Limited dependent texture lookup • 96 instructions • 24-bit floating point • s16e7 Integers up to 131,072 (s23e8: 16,777,216) Memory Refs 1 Math Ops Memory Refs 2 Math Ops Memory Refs 3 Math Ops Memory Refs 4 Math Ops
GPU Catch-Up! • Integer & Bit Ops & Double Precision • Memory Addressing • CGC/FXC Performance • Hand code performance critical code • No native reduction support • No native scatter support • p[i] = a (indirect write) • No programmable blend • GatherOp / ScatterOp • Limited 4x4 output • Brook virtualized kernel outputs • Readback still slow • NV35 OpenGL: 600 MB/sec Download 170 MB/sec Readback • ATI DirectX: 550 MB/sec Download 50 MB/sec Readback
SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster GPUs of the future (we hope) • Complete Instruction Sets • Integers, Bit Ops, Doubles, Mem Access • Integration • Streaming coprocessor not just a rendering device • Streaming architectures
Brook for GPUs • Release v0.3 available on Sourceforge • Project Page • http://graphics.stanford.edu/projects/brook • Source • http://www.sourceforge.net/projects/brook • Over 4K downloads! • Questions? Fly-fishing fly images from The English Fly Fishing Shop