340 likes | 425 Views
Implementing the Render Cache and the Edge-and-Point Image on Graphics Hardware. Edgar Velázquez-Armendáriz Eugene Lee Bruce Walter Kavita Bala. Motivation. High quality shading is still too slow. Not ready for interactivity. It is slow even on the GPU. Potential applications.
E N D
Implementing the Render Cache and the Edge-and-Point Image on Graphics Hardware Edgar Velázquez-Armendáriz Eugene Lee Bruce Walter Kavita Bala GI 2006, Québec, June 9th 2006
Motivation • High quality shading is still too slow. • Not ready for interactivity. • It is slow even on the GPU. • Potential applications. • Architecture. • Modeling. • Movies.
Overview • GPU acceleration of the Render Cache and the Edge-and-Point Image (EPI). Points Edges and Points Render Cache reconstruction EPI reconstruction
Render Cache overview Projection Depth cull Interpolation
Edge-and-Point Image overview Naive EPI • Alternative display representation • Edge-constrained interpolation preserves sharp features • Fast anti-aliasing
Presented work • Mapping to the hardware • The algorithm’s components differ from standard hardware rendering. • Overcome GPU limitations. • Results • GPU strategies. • Better interactivity.
Related Work • Interactive. • Shading cache. [Tole02] • Corrective texturing. [Stamminger00] • Tapestry. [Simmons00] • Adaptive Frameless Rendering. [Dayal05] • Distance impostors. [Szirmay-Kalos05] • Non-interactive. • Irradiance caching. [Smky05] • Pure Hardware implementations. • Ray tracing. [Purcell02, Carr06] • Photon mapping. [Purcell03]
Talk overview • Algorithm overview. • Mapping to the hardware: strategies and challenges. • Results. • Discussion.
Public availability • The complete Cg source of the shaders is available online: http://www.cs.cornell.edu/~kb/projects/epigpu/
Talk overview • Algorithm overview. • Mapping to the hardware: strategies and challenges. • Results. • Discussion.
Mapping to the hardware • Sections are grouped on computational similarity: • Point processing • Edge finding • Edge constrained interpolation • Most of the processing has been moved to the GPU.
Point processing • Point Cloud as Vertex Buffer Object (VBO) and Texture. • Multiple Render Targets (MRT) used to write all information in a single pass. • Simplified predicted projection. • Not as accurate as the regular projection. 4 one-pixel points 1 splat point using one quarter of the point cloud
Point processing: Update Vertex and Pixel shaders Point projector Point Cloud Point Image • Render Cache’s structures are complex to map. • We cannot modify pipelined GPU data. • Use additional passes.
Point processing: Bandwidth issues • Point projection is bandwidth limited. • Point cloud update. • New samples request. • Write to the point cloud only the new samples. • We use vertex scatter. • Faster than replacing all the point cloud. • A static VBO is projected three times faster than a constantly modified one.
Silhouette detection • The original EPI uses hierarchical trees. • Does not map well to GPU. • Brute force method on the GPU. • Avoid edges transfer every frame. • Faster than hierarchical structures! • Shadow edge detection left on the CPU. Edge texture Model edges
Silhouette detection: Limitations • GPU silhouette detection is limited by the fill rate. • Texture memory constraints. • We need to keep all vertices as VBO. • Vertices and normals as textures. • One results texture. • Normals stored as fp16 to reduce space.
Edge Raster • Raster edges with subpixel precision. • Depends on model complexity. • Extended lines as described in SEN03. • Filtered depth as read-only depth buffer. • Free occlusion culling! No depth texture With depth texture
Edge Constrained Interpolation • Multi-pass pixel shaders. • Very long. • A lot of texture accesses. • Image resolution dependent. • Use look-up tables encoded as textures. • Avoid control code in shaders. • Encode original EPI operations.
Future trends • Branching granularity. • Some filters require fine granularity to take advance of dynamic branching. • This issue is being solved with newer cards beginning with ATI X1000 series. • Bit operations not directly supported. • DirectX 10 will support them. • Bottom line: GPU implementation will get better and faster.
Limitations • Fill rate and texture access. • These characteristics constantly improve with newer hardware with more pipelines and faster clock frequencies. • Improve by diminishing shaders length. • Number of registers used is still important. • A 180 instructions shader with 25 registers performs 50% slower than a 215 instructions shader with and 24 registers on our GPU.
Talk overview • Algorithm overview. • Mapping to the hardware: strategies and challenges. • Results. • Discussion.
Test platform • Test environment. • Software written in C++, Cg 1.4rc, and Java through JNI under Windows XP. • Pentium 4 EE 3.2 Ghz dual core, 2 GB RAM, dual Nvidia GeForce 7800 GTX (81.85). • Test scenes. • Cornell Box • Chains • Mackintosh Room • David Head • Dragon
Results: FPS • GPU version is 60–110% faster than the original. • Speed up increases along with scene complexity.
Talk overview • Algorithm overview. • Mapping to the hardware: strategies and challenges. • Results. • Discussion.
Discussion • Point projection, even though it maps straightforwardly to the GPU is the bottleneck. • Image filters are very fast in spite of their multiple texture accesses and multiple passes. • We originally thought the opposite would be true!
Discussion • Projection is not optimal. • We wanted to use Vertex Texture Fetch (VTF) for mapping the point cloud update but it was slower than Render to Vertex Array (RTV). • Dual GPU rendering with Scalable Link Interface (SLI) showed marginal gains.
Future performance • Texture accesses are very fast and efficient. • Transferring vertex data on the GPU is too slow to be fully useful. • Scatter write on pixel shaders and geometry shaders may allow complete data management on the GPU.
Conclusions • We presented a hybrid GPU/CPU system for the Render Cache and the EPI using commodity graphics hardware. • Our implementation is 60−110% faster than a pure CPU implementation and frees the CPU up for other operations. • System’s performance is likely to improve with the current trend of GPUs.
Questions? Implementing the Render Cache and the Edge-and-Point Image on Graphics Hardware http://www.cs.cornell.edu/~kb/projects/epigpu/