1 / 16

EE398 – Project Presentation

EE398 – Project Presentation. Performance-Complexity Tradeoff H.264 Motion Search Ionut Hristodorescu ionuth@stanford.edu. Outline. H.264 motion search algorithm Mapping of the motion compensation algorithm on a memory hierarchy subsystem

shaquille
Download Presentation

EE398 – Project Presentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EE398 – Project Presentation Performance-Complexity Tradeoff H.264 Motion Search Ionut Hristodorescu ionuth@stanford.edu

  2. Outline • H.264 motion search algorithm • Mapping of the motion compensation algorithm on a memory hierarchy subsystem • Cache organization and its impact on the motion compensation speed • Making the internal H.264 data structures more cache friendly

  3. H.264 Motion Search Algorithm • Block Matching Algorithm • Computes the SADs for all the targets in a given area (exhaustive search) • Computationally intensive • Its complexity is equal or greater than the rest of the encoding steps • Takes most of the encoding time

  4. Mapping on a memory hierarchy subsystem • The luma/chroma is represented internally in the motion compensation algorithm as a line-by-line matrix • So, each line of a macroblock will be separated by (size(pel)*width) bytes • This means that accessing pels that are sitting on the same row will generate 1 cache miss/pel !!!

  5. Mapping on a memory hierarchy subsystem • To overcome the above, we could arrange the information so that consecutive block lines will sit in consecutive memory locations

  6. Mapping on a memory hierarchy subsystem • So, a natural representation of the chroma/luma matrixes would be as a sequential macroblock line by macroblock line • This way, the needed information is loaded into the cache quicker

  7. Mapping on a memory hierarchy subsystem • The advantages are immediate • Each macroblock line is 16 pels • So, we could fit 2 16-pels consecutive lines in a cache line • The macroblock is accessed now in a natural, sequential order

  8. Mapping on a memory hierarchy subsystem

  9. Mapping on a memory hierarchy subsystem • The biggest problem that arises now is with non-macroblock line boundary access • Each macroblock line sits at a 16-pel boundary in our representation so far • For macroblock line aligned access, this is great • How about non-macroblock line aligned access ?

  10. Mapping on a memory hierarchy subsystem • We have problems : imagine we want to access 16 pixels, but starting from position 4 in a macroblock line • In the original representation, this is no problem, since the original picture lines are sequential in memory • In our case, we will end up in the next consecutive macroblock line

  11. Mapping on a memory hierarchy subsystem • Solutions 1 : pretend we don’t know about this problem and let the encoder access the wrong pels • Solution 2 : check each time if we are crossing a macroblock line boundary and proceed accordingly • Solution 3 : keep two blocked versions of the picture : the original picture blocked and the shifted-by-32pels blocked

  12. Mapping on a memory hierarchy subsystem • We prefer solution 3 (even if it is more expensive in terms of memory) because this way the pels are accessed quicker • If pel_pos%32 < 16, we are going to pick up the pels from the blocked version of the original picture • Else, we are going to pick up the pels from the blocked versions of the original picture shifted-by-32

  13. Mapping on a memory hierarchy subsystem • 32pels will fit exactly in one cache line (or, for better processors, even 64 pels) • So, each time we access two macroblock lines, we will have no cache miss since the two macroblock lines will fit into a cache line

  14. Results • MET time decreased by approx. 8% compared to the non-blocked exhaustive search • Cache misses/pixel decreased to approx. 15 from 600-800 !!! • Rate-distortion ratio was preserved*

  15. Further optimizations • Assembly language coding of the SAD computation and in particular usage of the PSADW MMX instruction • Multi-threading of the motion compensation algorithm • By using Performance API (PAPI), we could measure the runtime behavior of the cache and introduce the cache misses into the motion cost function, much like in [1]

  16. Further optimizations • Intelligent prefetching of data • Extend the blocking algorithm to the entire motion estimation engine

More Related