1 / 18

An Evaluation of Graphics Processors as Stream Co-Processors

An Evaluation of Graphics Processors as Stream Co-Processors. Half-baked paper Francois Labonte, Ian Buck, Mark Horowitz, Christos Kozyrakis. Graphic Chips are fast. And they’re now programmable. Application. Command. Geometry. Vertex Program. Rasterization. Texture. Fragment Program.

edric
Download Presentation

An Evaluation of Graphics Processors as Stream Co-Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Evaluation of Graphics Processors as Stream Co-Processors Half-baked paper Francois Labonte, Ian Buck, Mark Horowitz, Christos Kozyrakis

  2. Graphic Chips are fast

  3. And they’re now programmable Application Command Geometry Vertex Program Rasterization Texture Fragment Program Texture Fragment Display Traditional Pipeline Programmable Pipeline

  4. GPU ISA • Short vectors of 4 words • Can be bytes, 16b fp, 32b fp • Vector instructions • ADD, SUB, MUL, MAX, MIN, SGE, SLT, MAD, CMP • DP3, DP4 • Scalar instructions • RCP, RSQ, LG2, COS, SIN • Texture instructions (Memory read) • Can also be dependent fetches • Swizzle – allows position of words in operand vector to be switched • Word mask – write enable on each output word

  5. Two programmable parts • Vertex Programs are run on each vertex • Vertex programs can have conditionals • But no texture access • 4 parallel pipes in current generation • Fragment Programs are run on each fragments (pixels) • No Conditionals • Many textures, dependent texture access • 8 parallel pipes in current generation • We’ll concentrate on these from now on

  6. Meanwhile at Stanford and other places • Some people are obsessed with stream computing. • Being one of these, to program different architectures using streams, we have devised a Stream Virtual Machine • The SVM abstracts an underlying architecture to allow a compiler to produce stream code for it.

  7. SVM concept • SVM is a co-processor model, a thread processor controls a stream co-processor which has a special fast memory • Not a far fetch from GPUs, CPU is thread processor, Graphics memory is Stream Register File, GPU is stream processor

  8. The Goal of this Paper • Evaluate the performance of the GPU as a SVM • Look at mechanism that are available, • Limitation of gpu programming • Architectural issues with current generation • Look at Both Nvidia and ATI’s best offering • Nvidia GeForceFX 5900 Ultra (NV35) • ATI Radeon 9800 Pro (R350)

  9. Bandwidth between host and GPU • Host->GPU 350MB/s • GPU->Host 181 MB/s • AGP 2.0 (4x)1066MB/s, AGP 3.0 (8x) 2133MB/s

  10. 1d viewof Fragments memory Stride Strided memory access • Study memory accesses of 2d textures: Most common access when using output of fragment program as input to a next pass. • We are mapping a 1d memory space into 2d • There are 2 ways to do it: row major, column major • We are doing strided memory accesses where the stride is varied

  11. Strided Bandwidth • Alu limited till 3 textures, 13GB/s is max

  12. Random memory access • Experiment setup: • Read a texture randomly initialized • Use texture read to index multiple other textures (thus being randomly accessed) • Get to know size of cache

  13. Random Memory Access Bandwidth • Small texture cache size (8x8x4x4=1kB) need more points between 8 and 16

  14. Floating Point Ops per second • ATI stable at 3G Inst/s, 380MHz clock 8 parallel pipelines => 3040 G Inst/s • Some inst are implemented in multiple inst (cos = 10) • NVidia rocks for all mul and add

  15. Nvidia’s funky architecture • NV 35 is rumored to have process 2x2 fragments, so 4 times what you see up there which has 3 madd units => 12 * 450MHz = 5400

  16. Dependence on number of live registers • ATI solid as a rock • The Nvidia Ferrari quickly becomes a skateboard, from previous diagram, my guess is you lose the 2 final mul units if you use more than 3 registers… drop of 1/3, but then it gets worst, much worst.

  17. Reductions • In a kernel sometimes you want to do an operation that is commutative, associative ex: for(i=0; i<N;i++){ Do some work Sum+= work; } • This is fully parallel if we have N processors, we can compute N sums and add them at the end • On GPU we cannot carry state from one fragment to another, so we need to do multiple pass to combine in a tree fashion result. This is a programming issue (gpu doesn’t let us access

  18. Drops at square textures? (need to check) • Plot against reduction if we could carry state across each fragment

More Related