160 likes | 304 Views
Improving cache performance of MPEG video codec. Anne Pratoomtong Aug,2002. Motivation. Limitation of the memory bandwidth and space.
E N D
Improving cache performance of MPEG video codec Anne Pratoomtong Aug,2002
Motivation • Limitation of the memory bandwidth and space. • Multimedia application are data intensive applications which implies that the data storage and transfer has a dominant impact on the power and area cost of the system. • Memory access pattern of most multimedia algorithm are complicate but predictable. • In a computation intensive algorithm such as motion estimation, memory bandwidth preservation and on-chip/off-chip memory organization should be investigate further.
Techniques • Hardware prefetching • Software prefetching • Code positioning • Data organization to reduce memory traffic
Hardware prefetching • One block lookahead (OBL) • When fetching block i. Also fetch block i+1 • Efficient with instruction cache where the access pattern are mostly 1-D consecutive. • Stream buffer • FIFO type queue that sits on the refill path to the main cache.
Hardware prefetching • The miss that causes block i to be brought into the cache, block i+1, i+2,…, i+n are also fetched into stream buffer. • Not very efficient for non unit strides data access pattern. • Stride prediction table (SPT) • Table, indexed by instruction address, holds the address of last access. LRU replacement policy. • Works well with medium to large cache size.
Hardware prefetching • Stream cache • Additional small stream cache that accessed in parallel with the main cache. • The prefetch data goes into stream cache instead of main cache. • Use with SPT to overcome the cache pollution that occur when using SPT with a very small cache. • For middle to large cache, the performance is approximately the same as the SPT implementation.
Hardware prefetching • 2D prefetching • Prefetch with constant stride • Stride value depend on data structure (image size in this case) • Image reference table maintains information on the displacement with respect to the physical address in order to find the next prefetch block.
Hardware prefetching cache with 1D blocks and 2D prefetch
Code positioning • DM cache offer the highest storage capacity on a given silicon area and require short access time. • Real time signal processing applications typically involve a limited set of functions executed periodically to process the incoming data. • A heuristic approach to reduce high I-cache miss rates in DM cache. Require trace profiling ability. • Rearranges functions in memory based on trace data so as to minimize cache line conflicts. • Partition look up table into smaller tables
Data organization to reduce memory traffic • Selective caching • Line locking • Locality hints • Scratch memory • Loop merging
HW/SW Co-Synthesis for SOC • Chooses cache sizes and allocates tasks to caches as part of co-synthesis. • Assume only one-level DM cache is modeled and tasks are well-contained in the level-1 cache • Partition an application into an acyclic task graph which contains nodes represented tasks connected by direct edges represented data dependencies between tasks.
HW/SW Co-Synthesis for SOC • Initial solution: Assign each task graphs the fastest PE that is available for the task. • PE and cache cost reduction: try to eliminate lightly loaded PE’s by moving the tasks on those PE’s to other PE’s that provide the best performance for the tasks. Tries to implement the remaining unmovable tasks with a cheaper PE. If such PE can’t be found, the current PE is kept but an attempt is made to cut its instruction and data cache sizes if applicable.
Future work • While hardware prefetching is widely implement in GPP platform, is it worth the extra area/power/complexity in SOC implementation or is it necessary? • These techniques are applied on software encoder/decoder application that run on GPP or multimedia processor, can they efficiently applied on a hardware/software co-design implementation of encoder/decoder on reconfigurable platform?
Future work • Characterized the memory traffic and access pattern among various video codec algorithms and standards and design the best techniques that can adapt according to the changes of applications.