Effective Non-Blocking Cache Architecture for High-Performance Texture Mapping

Effective Non-Blocking Cache Architecture for High-Performance Texture Mapping • Dukki Hong1 Youngduke Seo1 Youngsik Kim2 Kwon-Taek Kwon3 • Sang-Oak Woo3 Seok-Yoon Jung3 Kyoungwoo Lee4 Woo-Chan Park1 • 1Media Processor Lab., Sejong University • 2Korea Polytechnic University • 3SAIT of Samsung Electronics Co., Ltd. • 4Yonsei University • dkhong@rayman.sejong.ac.kr • http://rayman.sejong.ac.kr

Contents • Introduction • Related Work • Texture mapping • Non-blocking Scheme • Proposed Non-Blocking Texture Cache • The Proposed Architecture • Buffers for Non-blocking scheme • Execution Flow of The NBTC • Experimental Results • Conclusion

Introduction • Texture mapping • Core technique for 3D graphics • Maps texture images to the surface • Problem: a huge amount of memory access is required • Major bottleneck in graphics pipelines • Modern GPUs generally use texture caches to solve this problem • Improving texture cache performance • Improving cache hit rates • Reducing miss penalty • Reducing cache access time

Mobile 3D games • The visual quality of mobile 3D games have evolved enough to compare with PC games. • Detailed texture images • ex) Infinity blade : 2048 [GDC 2011] • Demandhigh texture mapping throughput <Gameloft: Asphalt Series> <Epic Games: Infinity Blade Series>

Our approach • Improving texture cache performance • Improving cache hit rates • Reducing miss penalty • Reducing cache access time • In this presentation, we introduce a non-blocking texture cache (NBTC) architecture • Out-of-order (OOO) execution • Conditional in-order (IO) completion • the same screen coordinate to support the standard API effectively “Our approach”

What is the texture mapping? • Texture mapping • Texture mapping is that glue n-D images onto geometrical objects • To increase realism <Texture Mapped Object> <Object> <Texture> • Texture filtering • Texture filtering is a operation for reducing artifacts of texture aliasing caused by the texture mapping • Bi-linear filtering : four samples per texture access • Tri-linear filtering : eight samples per texture access <Results of the texture filtering>

Related Work • Cache performance study • In [Hakura and Gupta 1997], the performance of a texture cache was measured with regard to various benchmarks • In [Igehy et al. 1999], the performance of a texture cache was studied with regard to multiple pixel pipelines • Pre-fetching scheme • In [Igehy et al. 1998], the latency generated during texture cache misses can be hidden by applying an explicit pre-fetching scheme • Survey of texture cache • The introduction of a texture cache and the integration of texture cache architectures into modern GPUs were studied in [Doggett 2012]

Related Work: Non-blocking Scheme • Non-blocking cache (NBC) • allows the following cache request while a cache miss is handled • Reducing the miss-induced processor stalls • Kroft firstly published a NBC using missing information/status holding registers (MSHR) that keep track of multiple miss information [Kroft 1981] • <Blocking Cache> • <Non-blocking Cache with MSHR> <Kroft’s MSHR>

Related Work: Inverted MSHR • Performance study with regard to non-blocking cache • Comparison with four different MSHRs [Farkas and Jouppi 1994]. • Implicitly addressed MSHR : Kroft’s MSHR • Explicitly addressed MSHR : complement version of implicitly MSHR • In-cache MSHR : each cache line as MSHR • The first three MSHRs : only one entry per miss block address • Inverted MSHR: single entry per possible destination • The number of entries = usable registers in a processor (possible destination) <Inverted MSHR organization> • Recent high-performance out-of-order (OOO) processor using the latest SPEC benchmark [Li et al. 2011] • A hit under two-misses non-blocking cache improved the OOO processor’s performance 17.76% more than the one using a blocking data cache

Proposed Non-Blocking Texture Cache

The Proposed Architecture • This architecture includes a typical blocking texture cache (BTC) of a level 1 (L1) cache as well as three kinds of buffers for non-blocking scheme: • Retry buffer • Guarantee IO completion • Waiting list buffer • Keep track of miss information • Block address buffer • Remove duplicate block address or or <Proposed NBTC architecture> texaddr

Retry Buffer: Fragment information • Feature • The most important property of the retry buffer (RB) is its support of IO completion • The RB stores fragment information by input order • The RB is designed as FIFO • Data Format of each RB entry • Valid bit : 0 = empty, 1 = occupied • Screen coordinate : screen coordinate for output display unit (x, y) • Texture request • Ready bit : 0 = invalid filtered texture data, 1 = valid filtered texture data • Filtered texture data : texture data for accomplished texture mapping

Waiting List Buffer: Texture requests • Features • The waiting list buffer (WLB) is similar to the inverted MSHR proposed in [Farkas and Jouppi 1994] • The WLB stores information of both missed and hit addresses • The texture address of the WLB plays a similar role as a register in the inverted MSHR • Data format of each WLB entry • Valid bit : 0 = empty, 1 = occupied • Texture ID : ID number of a texture request • Filtering information : the information to accomplish the texture mapping • Texel addrN : the texture address of necessary texture data • Texel data N : the texel data of Texel AddrN • Ready bit N : 0 = invalid texe data N, 1 = valid texel data N

Block Address Buffer: Texel requests • Feature • The block address buffer operates the DRAM access sequentially with regard to the texel request that caused a cache miss • The block address buffer removes duplicate DRAM requests • When data are loaded, all the removed DRAM requests are found • The block address buffer is designed as FIFO

Execution Flow of our NBTC Start Execute lookup RB Generate texture addresses Occurred miss Execute tag compare with texel requests All hits Hit handling case Miss handling case

Execution Flow: Hit Handling Case Hit handling case Read texel data from L1 cache Input texel data to texture mapping unit via MUX Execute texture mapping Update RB

Execution Flow: Miss Handling Case Miss handling case Read hit texel data from L1 cache “Concurrent execution” Input missed texture requests to WLB Input missed texel requests to BAB Remove duplicate texel requests Process the next texture request

Execution Flow: Miss Handling Case Miss handling case Read hit texel data from L1 cache “Concurrent execution” Input missed texture requests to WLB Input missed texel requests to BAB Remove duplicate texel requests Process the next texture request Forward the loaded data to WLB and cache Complete memory request Input texel data to texture mapping unit via MUX Determine the ready entry in WLB Invalidate the entry Execute texture mapping Update RB

Execution Flow: Update Retry Buffer Update RB Determine the ready entry in RB Determine whether IO completion Forward the ready entry to the shading unit Process the next fragment infromation

Experimental Results

Experimental Environment • Simulator configuration • mRPsim : announced by SAIT [Yoo et al. 2010] • Execution driven cycle-accurate simulator for SRP-based GPU • Modification of the texture mapping unit • Eight pixel processors • DRAM access latency cycles : 50, 100, 200, and 300 cycles • Benchmark • Taiji which has nearest, bi-linear, and tri-linear filtering modes • Cache configuration • Four-way set associative, eight-word block size and 32KByte cache size • The number of each buffer entries : 32

Pixel Shader Execution Cycle • Pixel shader cycle/frame • PS run cycle : running cycles • PS stall cycle : stall cycle • NBTC stall cycle : stall cycles due to the WLB full • The pixel shader’s execution cycle decreased from 12.47% (latency 50) to 41.64% (latency 300)

Cache Miss Rate • Cache miss rates • The NBTC’s cache miss rate increased slightly more than the BTC’s cache miss rate • The NBTC can handle thefollowing cache accesses in cases where a cache update is not completed

Memory Bandwidth Requirement • Memory bandwidth requirement • The memory bandwidth requirement of the NBTC increased up to 11% more than that of the BTC • Since the block address buffer removes duplicate DRAM requests, the increasing memory bandwidth requirement was relatively lower

Conclusion & Future Work • A non-blocking texture cache to improve the performance of texture caches • basic OOO executions maintaining IO completion for texture requests with the same screen coordinate • Three buffers to support the non-blocking scheme: • The retry buffer : IO completion • The waiting list buffer : tracking the miss information • The block address buffer : deleting the duplicate block address • We plan to also implement hardware for the proposed NBTC architecture and then will measure both the power consumption and the hardware area of the proposed NBTC architecture

Thank you for your attention • http://rayman.sejong.ac.kr

Backup Slides

Alpha blending order issue

Effective Non-Blocking Cache Architecture for High-Performance Texture Mapping

Effective Non-Blocking Cache Architecture for High-Performance Texture Mapping

Presentation Transcript

Texture Mapping

Texture Mapping

Texture Mapping

Texture Mapping

Texture Mapping

Texture Mapping

Texture Mapping

Texture Mapping

Texture Mapping

Non-Uniform Cache Architecture

Texture Mapping

Texture Mapping

Texture Mapping

Texture Mapping

Texture Mapping

Texture Mapping

Texture Mapping

Texture Mapping

Texture Mapping

Texture Mapping

Texture Mapping

Texture Mapping