220 likes | 358 Views
Caching Strategies for Textures. Paul Arthur Navratil. Overview. Conceptual summary Design and Analysis of a Cache Architecture for Texture Mapping (Hakura and Gupta 1997) Prefetching in a Texture Cache Architecture (Igehy, Eldrige, and Proudfoot 1998) Discussion!. Mip mapping.
E N D
Caching Strategies for Textures Paul Arthur Navratil
Overview • Conceptual summary • Design and Analysis of a Cache Architecture for Texture Mapping (Hakura and Gupta 1997) • Prefetching in a Texture Cache Architecture (Igehy, Eldrige, and Proudfoot 1998) • Discussion!
Mip mapping • Achieves acceptable performance texture mapping • Interpolation between fixed levels of detail is a constant computation cost per fragment • Reduces aliasing [Williams p.4] • Efficient memory use • Memory access pattern is well understood
Hakura and Gupta: Problem • Motivation: need high bandwidth, low latency memory access for texture mapping • Previous work uses brute-force • Dedicated DRAM for each fragment generator [Akeley p.3] • SGI RealityEngine can have 320MB texture memory, but only 16MB of unique texture memory!
Hakura and Gupta: Idea • Observation: If textures exhibit spatial and temporal localities, design a system to exploit them • Use SRAM cache for each fragment generatorHave a single, shared DRAM texture memory • Advantages • Unique texture memory is larger • Uses cheaper chip (SRAM over DRAM) • SRAM gives higher bandwidth and lower latency
Hakura and Gupta: Locality • Mip mapping has inherent spatial locality • Four contiguous texels on each of two levels for trilinear interpolation, with texel area close to pixel area • Texture mapping has two temporal localities • Overlapping texel usage along contiguous fragment generation • Repeating texture across image [color images.ps]
Hakura and Gupta: Caching • Observation: Increase in DRAM density has decreased DRAM bandwidth! • Cache decreases bandwidth requirement by decreasing accesses to texture memory • Block transfers from memory to cache maximize DRAM bandwidth utilization • Texture memory can be shared (not dedicated) • No cache coherence issues • Cache characterized by: • Cache size • Cache line size • Associativity • Which combination is best?
Hakura and Gupta: Texture Representation in Memory • Base case: Linear (Non-Blocked) • Williams original representation misses spatial locality • Use contiguous RGBA values per texel [Hakura p.5] • Observations: • Gradual level-of-detail change uses more of a fetched cache line • Higher line size drops cold miss rate • Principle of Texture Thrift: amount of texture info required to render is proportional to the resolution of the image, and is independent of the number of surfaces and the size of the texture [Peachey 90] • In examples, workset limited to one textureWorst case bound by either texture size or screen size • This representation is sensitive to the texture orientation on screen.
Hakura and Gupta: Texture Representation in Memory • Blocked case: convert 2-D arrays into 4-D arrays. • Address calculation is a two-step process • Block size remains constant across mipmap levels • Observations: • Reduces dependency on texture orientation, and utilizes spatial locality • Lowest miss rates occur when block size matches cache line size [Hakura p.7] • Increasing line size alone creates worse miss rates • Can use 2-way associative cache to avoid conflict with blocks of different mipmap levels (see Igehy)
Hakura and Gupta: Rasterization • Rasterization order affects texture access pattern, and thus cache behavior also • Use tiling (chunking) to utilize spatial locality • If tiles are too large, the working set will be larger than the cache size, and capacity misses will result [Hakura p.9] • Smaller triangles in image reduce this effect
Hakura and Gupta: Performance • Rendering performance and memory bandwidth are good measures of a texture mapping system • Fragment generator observations • Machine must access more than one texel per cycle • Must hide memory latency to achieve maximum throughput (address precomputation) • SRAM cache observations • Multiple banks with interleaced lines for multiple texel access • Interleave texels within each block • Without multi-texel access, trilinear interpolation can compare texels only once every two cycles!
Hakura and Gupta: Conclusions • Caching yields a three-fold to fifteen-fold reduction in memory bandwidth requirements • Cache should be at least 16 KB and 2-way associative • Long cache lines better utilize bandwidth (with a slight increase in bandwidth requirements) • Block size should match cache line size • Rasterization pattern should be tiled
Igehy et al: Problem • Motivation: Memory bandwidth and latency are (becoming) bottleneck for texture systems • Previous work shows caching benefits [Hakura97; Cox98], but fails to hide memory latency • Little literature on prefetching texels: • used in some systems, but the algorithms are not described (proprietary) e.g. [Torborg and Kajiya, 1996]
Igehy et al: Idea • Combine prefetching and caching in an architecture with a clear description • Advantages: • Simple • Robust to variations in bandwidth requirements and latencies • Achieves within 3% of performance of a zero-latency system
Igehy et al: Traditional Prefetching (no cache) • When a fragment is ready for texturing, queue it and request the texels • Fragment stays in queue for time equal to memory latency • If the queue is sized correctly, latency will be masked • Problems: • If covering large request rate and latency, early prefetch can cause cache miss • Tags must be checked at double-rate to maximize throughput (prefetch check and read check) • Prefetch buffer size must increase as request rate and latency increase
Igehy et al: Texture Prefetching • Differences from traditional prefetch: • Tag checks occur once per texel, before cache access • Add reorder buffer to handle early return of texel data • New cache blocks only put in cache when associated fragment reaches head of the queue • Cache organization: • Four banks each, with adjacent levels of mipmap in alternating banks • Data interleaved so the four accesses for bilinear interpolation can occur in parallel • Can process 8 requests in parallel, which is enough for trilinear interpolation
Igehy et al: Texture Properties • Texture caching effectiveness is scene dependent • Observation: unique-texel-to-fragment ratio is lower bound on number of texels that must be fetched per frame (unless utilizing inter-frame locality) • Want a low unique-texel-to-fragment ratio! • Ratio affected by: • Magnification (lowers ratio) • Repetition (lowers ratio if cache holds entire texture) • Minification (ratio depends on texel-area-to-pixel-area ratio)
Igehy et al: Memory Organization • Use 6-D texture representation in Hakura [Igehy p.5] • Rasterize in tiled pattern (not scan-line) • Cache associativity does not appreciably affect miss rate • Design minimizes conflict misses • General formula for determining associativity: • m independent n-way associative caches can handle a rate of m bilinear accesses (four texels) per cycle to m*n textures (or texture levels in mipmap)
Igehy et al: Bandwidth • Average texel requests per frame are not enough to determine actual requirements • High-request bursts occur [Igehy p.6]e.g. color map vs. light map • When system misses ideal (zero-latency) performance, bandwidth is to blame [Igehy p.8] • e.g. AGP vs. NUMA
Igehy et al: Conclusions • System that approximates zero-latency is possible • Achieved 97% utilization of available resources • Fragment queue should slightly exceed latency of memory system to account for miss bursts • Reserve reorder-buffer slot when memory request is made to avoid deadlock