180 likes | 353 Views
An Evaluation of Graphics Processors as Stream Co-Processors. Half-baked paper Francois Labonte, Ian Buck, Mark Horowitz, Christos Kozyrakis. Graphic Chips are fast. And they’re now programmable. Application. Command. Geometry. Vertex Program. Rasterization. Texture. Fragment Program.
E N D
An Evaluation of Graphics Processors as Stream Co-Processors Half-baked paper Francois Labonte, Ian Buck, Mark Horowitz, Christos Kozyrakis
And they’re now programmable Application Command Geometry Vertex Program Rasterization Texture Fragment Program Texture Fragment Display Traditional Pipeline Programmable Pipeline
GPU ISA • Short vectors of 4 words • Can be bytes, 16b fp, 32b fp • Vector instructions • ADD, SUB, MUL, MAX, MIN, SGE, SLT, MAD, CMP • DP3, DP4 • Scalar instructions • RCP, RSQ, LG2, COS, SIN • Texture instructions (Memory read) • Can also be dependent fetches • Swizzle – allows position of words in operand vector to be switched • Word mask – write enable on each output word
Two programmable parts • Vertex Programs are run on each vertex • Vertex programs can have conditionals • But no texture access • 4 parallel pipes in current generation • Fragment Programs are run on each fragments (pixels) • No Conditionals • Many textures, dependent texture access • 8 parallel pipes in current generation • We’ll concentrate on these from now on
Meanwhile at Stanford and other places • Some people are obsessed with stream computing. • Being one of these, to program different architectures using streams, we have devised a Stream Virtual Machine • The SVM abstracts an underlying architecture to allow a compiler to produce stream code for it.
SVM concept • SVM is a co-processor model, a thread processor controls a stream co-processor which has a special fast memory • Not a far fetch from GPUs, CPU is thread processor, Graphics memory is Stream Register File, GPU is stream processor
The Goal of this Paper • Evaluate the performance of the GPU as a SVM • Look at mechanism that are available, • Limitation of gpu programming • Architectural issues with current generation • Look at Both Nvidia and ATI’s best offering • Nvidia GeForceFX 5900 Ultra (NV35) • ATI Radeon 9800 Pro (R350)
Bandwidth between host and GPU • Host->GPU 350MB/s • GPU->Host 181 MB/s • AGP 2.0 (4x)1066MB/s, AGP 3.0 (8x) 2133MB/s
1d viewof Fragments memory Stride Strided memory access • Study memory accesses of 2d textures: Most common access when using output of fragment program as input to a next pass. • We are mapping a 1d memory space into 2d • There are 2 ways to do it: row major, column major • We are doing strided memory accesses where the stride is varied
Strided Bandwidth • Alu limited till 3 textures, 13GB/s is max
Random memory access • Experiment setup: • Read a texture randomly initialized • Use texture read to index multiple other textures (thus being randomly accessed) • Get to know size of cache
Random Memory Access Bandwidth • Small texture cache size (8x8x4x4=1kB) need more points between 8 and 16
Floating Point Ops per second • ATI stable at 3G Inst/s, 380MHz clock 8 parallel pipelines => 3040 G Inst/s • Some inst are implemented in multiple inst (cos = 10) • NVidia rocks for all mul and add
Nvidia’s funky architecture • NV 35 is rumored to have process 2x2 fragments, so 4 times what you see up there which has 3 madd units => 12 * 450MHz = 5400
Dependence on number of live registers • ATI solid as a rock • The Nvidia Ferrari quickly becomes a skateboard, from previous diagram, my guess is you lose the 2 final mul units if you use more than 3 registers… drop of 1/3, but then it gets worst, much worst.
Reductions • In a kernel sometimes you want to do an operation that is commutative, associative ex: for(i=0; i<N;i++){ Do some work Sum+= work; } • This is fully parallel if we have N processors, we can compute N sums and add them at the end • On GPU we cannot carry state from one fragment to another, so we need to do multiple pass to combine in a tree fashion result. This is a programming issue (gpu doesn’t let us access
Drops at square textures? (need to check) • Plot against reduction if we could carry state across each fragment