1 / 25

Low-Cost Adaptive Data Prefetching

Europar 2008 slides

guest531
Download Presentation

Low-Cost Adaptive Data Prefetching

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Low-Cost Adaptive Data Prefetching Luis M. Ramos, José Luis Briz, Pablo E. Ibáñez and Víctor Viñals. University of Zaragoza (Spain) Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  2. Introduction Hardware Data Prefetching Effective to hide memory latency Recent successful proposals: GHB, SMS Simple mechanisms in commercial processors: UltraSPARC-IIIcu & SPARC64 VI (sequential tagged) Power4 & Power5 (sequential stream buffers) Intel Core (sequential & stride) Sequential Tagged prefetching (SEQT) Prefetches on a cache miss or on a 1st. use Highest speed-ups High pressure on mem. & perf. losses in hostile app. Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  3. Introduction Our aim: Use the simplest prefetcher (SEQT) Evaluate degree-distance policies and adaptive mechanisms Compare them with: Stride GHB P-DFCM SMS Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  4. Outline Prefetching mechanisms Experimental framework and benchmarks Preliminary results Performance Pressure to memory Degree-distance policies Results Conclusions and future work Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  5. Prefetching mechanisms Stride prefetching @’s separated by a constant distance Table indexed by PC on-miss insertion [Ibáñez et al. 98] SMS (Spatial Memory Streaming) Spatial access patterns Prefetches blocks inside a memory region Avoids useless blocks Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  6. Prefetching mechanisms Correlating prefetchers Tables store memory program behaviour (addresses or deltas) Indexed by address or PC GHB (Global History Buffer)  PC/DC Focused on reducing table sizes 2 tables, several accesses to calculate deltas P-DFCM Based on DFCM value predictor 2 tables, delta stream used to predict next delta Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  7. Experimental framework and benchmarks SimpleScalar 3.0 Alpha binaries Aggressive superscalar processor 3-level memory hierarchy (Itanium2) Spec2k Simple Simpoints 200 M instruction warming Selection rule: ideal L2 speed-up > 2% 4 MB 256 KB 16 KB Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  8. Preliminary results: performance a) CINT b) CFP Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  9. Preliminary results: pressure Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  10. Preliminary results: breakdown per application Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  11. Degree-distance policies Deg(4) prefetch demand hit i i+1 i+2 i+3 i+4 i+5 i+6 i+7 i+8 demand miss 1st. use of a prefetch time Deg(x) on miss & on 1st. use prefetches x blocks Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  12. Degree-distance policies Dist(4) prefetch demand hit i i+1 i+2 i+3 i+4 i+5 i+6 i+7 i+8 demand miss 1st. use of a prefetch time Dist(x) on miss & on 1st. use prefetches the x-th block Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  13. Degree-distance policies Deg-dist(4) prefetch demand hit i i+1 i+2 i+3 i+4 i+5 i+6 i+7 i+8 demand miss 1st. use of a prefetch time Deg-dist(x) on miss  x blocks on 1st. use  the x-th block Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  14. Degree-distance policies Deg(1-4) prefetch demand hit i i+1 i+2 i+3 i+4 i+5 i+6 i+7 i+8 demand miss 1st. use of a prefetch time Deg(1-x) degmiss = 1 deg1st use = x Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  15. Degree-distance policies Ad1(4) prefetch demand hit i i+1 i+2 i+3 i+4 i+5 i+6 i+7 deg demand miss 0 1st. use of a prefetch 0 time 0 Ad1(x) 01 degmiss = 1 deg1st use = f(usefulness) [0..x] 1 100x  deg-- 1 50x  deg++ Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  16. Degree-distance policies Ad2(4) prefetch demand hit i i+1 i+2 i+3 i+4 deg i-1 demand miss time 2 1st. use of a prefetch 2 2 Ad2(x) 2 degmiss = 1 (both dir.) deg1st use = f(usefulness) [0..x] 2 100x  deg-- 2 50x  deg++ k-4 k-3 k-2 k-1 k k+1 Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  17. Degree-distance policies Ad3(4) prefetch demand hit i i+1 i+2 i+3 i+4 i+5 i+6 i+7 deg demand miss 1 1st. use of a prefetch 1 time 1 Ad3(x) degmiss = 1 12 deg1st use = f(usefulness, timeliness, pollution) [0..x] 2 100x  deg-- 100x pollution  deg-- 2 50x  deg++ 50x late deg++ Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  18. Degree-distance policies Ad4(4,32) prefetch demand hit i i+1 i+2 i+3 i+4 deg i-1 demand miss time 2 1st. use of a prefetch 2 2 Ad4(x,y) 1 region [0..y-1] deg1st use = f(usefulness, region) [0..x] 1 100x  deg-- 1 50x  deg++ k-4 k-3 k-2 k-1 k k+1 Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  19. Degree-distance policies Ad5(4) prefetch demand hit i i+1 i+2 i+3 i+4 i+5 i+6 i+7 deg demand miss 0 1st. use of a prefetch 0 time 1 Ad5(x) [Dahlgren-93] 1 deg = f(usefulness) [0..x] 1 • same deg. on miss & on 1st. use • mechanism needed when deg==0 1 Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  20. Results: performance • SMS as reference • Ad have no losses • INT  deg 4 or 8 • FP  deg 8 or 16 • Dist & Ad5 the worse • The rest similar to Deg • Among Ad: INTAd4(8,32) (diff 1%) • FP Ad3(8) (diff 1% - 5%) • Ad4(8,32) & Ad2(8) best on average a) CINT b) CFP Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  21. Results: pressure Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  22. PAB i+1 i+2 i+3 i+4 i+2 i+3 i+4 i+5 i+1 Deg(4) Prefetch Engine PAB (4 entries) L2 i Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  23. PAB as a filter i+4 i+2 i+1 i+3 i+5 i+4 i+2 i+3 i+1 i+2 i+1 Deg(4) Prefetch Engine PAB (4 entries) L2 i • L2 lookups reduction: • 2% for Deg-dist • SMS 49% (but continues being the most demanding) • 25%-40% for the rest • Performance unaffected Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  24. Conclusions and future work Ways of tuning the aggressiveness of SEQT prefetchers Ad2(8) and Ad4(8,32) perform the best Adaptive: vary the degree according to prefetch usefulness Ad2 prefetches forward and backward Ad4 adjusts the degree for every of the 32 memory regions Both equal SMS in CINT and outperform it in CFP (60% less lookups in L2) Ad2: 2 bits/line; Ad4: 2b + 64B table; SMS 33KB PAB used to reduce the pressure on L2 (25%-40%) No losses & really low hardware cost Future work: use a realistic on-chip memory controller Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

  25. Thank you Euro-Par 2008 - Las Palmas de Gran Canaria - August 26-29th, 2008

More Related