160 likes | 173 Views
This thesis discusses the identification of a bottleneck in GPU rasterization of small triangles and proposes a scalable and area-free GPU design to efficiently rasterize them. The implementation of the design has been completed, but performance is still poor. New research ideas include improving memory access and reducing shading on GPUs using quad-fragment merging.
E N D
Microtriangle thesis mid-status September 22th, 2010
My Thesis • So far • Summer work • New ideas
So far • Indentifiedimportantrasterizationbottleneck in GPUswhenprocessingsmalltriangles(<10 pixels). • Found a scalable and “almost” area-free new GPU design/pipeline toefficientlyrasterizeuTriangles: • Uses theshadercores: increasing GPU resource. • Independentfragment-in-triangletests (no setup) • Crack-free byusingFixed Point arithmetic. • BB optimization reduces ≈50% shaderrasterizationwork per uTriangle. • Switchtothetraditionalrasterization pipeline for macro “large” triangles.
Summer work • Finishedtheimplementation of the pipeline formixedflow of macro/micro triangles: • Now, triangles of differentsizesinside a DrawPrim*() execute in thecorresponding pipeline, concurrently. • Addedsupport in ACD totransparentlymanage and syncseparate pixel shaderprograms, for macro and micro trianglefragments, in the ATTILA shaderinstructionmemory. • Fixedsomebugsfound in thepreviousimplemented data pipeline (a yearago).
Summer work • Implementationisfunctionalbut performance isstillpoor (about 0.3x of traditionalrasterizeralone) • Notrelatedwiththerasterizationcost (lowshaderusage) • Possible pipeline bottlenecks and/orneedtoadjustqueuesizes. • New uTriangleworkloadto test theimplementation: • MadeOpenGLappwhichrenders a highlytessellatedmodel of theStanfordBunny (69K triangles) approachingthe camera. • EachframeprojectstheBunnywithincreasedtrianglesize (closertothe camera).
A new testing workload • The Bunny uTriangle mesh: Frame 1 Bunnyfills ≈ 0.4% of theviewport ≈ 8.34 tris/pix 1024x1024 RTT – 69K Tri model Frame 60 Bunnyfills ≈ 2% of theviewport • ≈ 3.3 tris/pix Frame 100 Bunnyfills ≈ 15% of theviewport • ≈ 0.22 tris/pix Frame 150 Bunnyfills ≈ 22% of theviewport • ≈ 0.15 tris/pix Happybuddha and ChinesseDragon: 1 MillionTriangles • Will use even more extremelytessellatedmodels:
New research ideas • Triangles are nowrasterizedfaster: • By a more efficientcomputation of therasterizationjob in thefastest/widest GPU resource (Shaders). • But…the GPU pipeline stillremainsomeinefficienciesduetothe data structuresdesignedtoleveragelargetriangles. • Overshadingdueto a quad (2x2 pixels) generatedforeachuTriangle. • Secondissue: memoryaccess (framebuffer, textures) foruTrianglefragments are no longerguaranteedoptimalfor caches (Hilbertpattern) sincenowitdependsonsurfacetessellation.
ShadeuTrisusingstampsisinefficient • WhenuTrisfilljustone pixel -> shaderunitsused at 1/4 afterrasterization. • The more uTris per pixel, thelowerutilization. • Applyvector compactionafterrast and beforeshading. • Formvectors of multi-trianglequads. • Fatahalian, K., Boulos, S., Hegarty, J., Akeley, K., Mark, W. R., Moreton, H., and Hanrahan, P. 2010. Reducing shading on GPUs using quad-fragment merging. In ACM SIGGRAPH 2010 Papers. • Derivativescomputationmustbedecoupled. Sparse shader vectors
Vector Compaction Shader Inputs • Waitrasterizationcompletion of severalsparse vector threadsbeforeinterp/shading (use a threadfenceinstruction). • Mergevalidthreads in new dense vectorsthat resume theexecution (becomereadyforschedule). • Shader vector slots can befreedfor new inputs as result of compacting. VectorThread 3 Ready-to-Execute Vector Thread 11 VectorThread 8 Vector Thread 7 Vector Thread 7 Vector Thread 10 Vector Thread 10 Vector Thread 12 Vector Thread 12 Vector Thread 12 FULL-WIDTH interpolation/ fragment shading Rasterization Vector Thread 7 1 Thread fence 3 1 Vector Thread 7 Vector Thread 0 Compact queue Vector Thread 2 Freed Slots 2 3 2
Vector Compaction Shader Inputs • Vector compactionimplieschangingthethread´sinitiallyasigned SIMD lane. • Directimplicationonthe RF organization: • Heavily-multiported single RF: • Long latency • Banked RF: • Fewerports • Needtomigrateregistervalueswiththreads as they are compacted. VectorThread 3 Ready-to-Execute Vector Thread 7 SIMD Pipeline I-Fetch Vector Thread 10 Decode R R R F F F A A A L L L U U U Compact queue Writeback Sparse vector Compacted vector
Vector Compaction • # banks = SIMD length • # banks = 4 (fragments in a quad). RF0 Th0 RF1 Th1 RF2 Th2 … RF15 Th15 Th16 Th17 Th18 Th31 RF0 Th0 RF1 Th1 RF2 Th2 RF3 Th3 Compacted threads doesn´t need to migrate register values as long as they stay in the same quad lane. Th4 Th5 Th6 Th7 Th8 Th9 Th10 Th11 Th12 Th13 Th14 Th15 Th16 Th17 Th18 Th19 … Th20 Th21 Th22 Bottom-Left Pixel Bottom-Right Pixel Top-left Pixel Top-Right Pixel
Vector Compaction • 4 banks : ideal case for uTriangle meshes • no migration needed. • > 1 pixel can still be merged in a smart way to avoid as much as possible register migration. Stamp Generation Rasterization Compaction
Texture memory access Texture cache line footprint • Macrotriangle fragments are rasterized in special order (Hilbert, Morton) which favors texture cache locality. Regular macro triangle
Memory texture access Texture cache line footprint • Withmicrotriangles, textureaccessesdependonsurfacetesselationorder -> cache localityislost. Tessellated Patch
Texture memory access • Proposed idea: • For uTriangles, vector compaction can get texture locality back by preferably grouping threads that map the same cache line, together in the same compacted vector. • Probably requires a much longer compact queue.