1 / 43

Efficient Techniques for Bandwidth-Efficient Prefetching of LDS in Hybrid Systems

Explore advanced methods for prefetching linked data structures to enhance system performance by combining stream and content-directed prefetching. Evaluate and address challenges to improve prefetch accuracy.

lindatucker
Download Presentation

Efficient Techniques for Bandwidth-Efficient Prefetching of LDS in Hybrid Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems Eiman Ebrahimi* Onur Mutlu‡ Yale N. Patt* * HPS Research Group University of Texas at Austin ‡ Computer Architecture Laboratory Carnegie Mellon University

  2. Motivation • Prefetching can significantly reduce memory latency impact on performance • Stream prefetching very useful but unable to reduce latency of many misses • Access patterns that follow pointers in linked data structures (LDS) prevalent in many applications • High-performance and bandwidth-efficient LDS prefetchers are needed

  3. Potential Performance IPC delta of ideal LDS prefetching over stream prefetching 615

  4. Our Goal Develop techniques that 1) Enable low cost and bandwidth-efficient prefetching of linked data structure accesses 2) Efficiently combine such prefetchers with commonly-employed stream based prefetchers

  5. Develop techniques that 1) Enable low cost and bandwidth-efficient prefetching of linked data structure accesses 2) Efficiently combine such prefetchers with commonly-employed stream based prefetchers Our Goal

  6. Outline • Background • Efficient Content Directed LDS Prefetching • Managing Multiple Prefetchersin a Hybrid Prefetching System • Evaluation • Conclusion

  7. Content-Directed Prefetching (CDP) (Cooksey et al. ASPLOS ’02) • Requires no state  Attractive approach • Searches for pointers as data is fetched from memory • Virtual address predictor • Compares high order bits of values within cache line with cache line’s address • Generates prefetch requests on a match

  8. = = = = = = = Content-Directed Prefetching (CDP) X800 22220 x40373551 x80011100 x800 11100 [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] = Virtual Address Predictor GeneratePrefetch L2 DRAM X80022220 … …

  9. Shortcomings of CDP • CDP prefetches all identified pointers • Indiscriminate prefetching of all discovered pointers leads to • Low prefetch accuracy • High cache pollution • High bandwidth consumption

  10. Shortcomings of CDP – An example Struct node{ int Key; int * D1_ptr; int * D2_ptr; node * Next; } Key HashLookup(int Key) { … for (node = head ; node -> Key != Key; node = node -> Next; ) ; if (node) return node->D1; } D1 Key D1 Key D2 D2 D1 Key … D1 Key D2 … D2 D1 Key D2 Example from mst

  11. Shortcomings of CDP – An example Cache Line Addr [31:20] Key Next Next D2_ptr D1_ptr D2_ptr D1_ptr Key [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] = = = = = = = = Virtual Address Predictor … D1 Key D1 Key D2 D2 D1 Key … D1 Key D2 … D2 D1 Key D2

  12. Shortcomings of CDP – An example HashLookup(int Key) { … for (node = head ; node -> Key != Key; node = node -> Next; ) ; if (node) } return node -> D1; D1 Key D1 Key D2 D2 D1 Key … D1 Key D2 … D2 D1 Key D2

  13. Shortcomings of CDP – An example Cache Line Addr [31:20] D2_ptr Key Next D2_ptr Next D1_ptr Key D1_ptr [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] = = = = = = = = Virtual Address Predictor … D1 Key D1 Key D2 D2 D1 Key … D1 Key D2 … D2 D1 Key D2

  14. Outline • Background • Efficient Content Directed LDS Prefetching • Managing Multiple Prefetchersin a Hybrid Prefetching System • Evaluation • Conclusion

  15. Efficient Content Directed Prefetching (ECDP) – Basic Idea • A compiler guided technique that identifies likely-useful pointer addresses to prefetch • Compiler profiles and provides hints as to which pointer addressesare likely-useful to prefetch • Hardware uses hints to prefetch only likely-useful pointers

  16. Terminology – Pointer Group (PG) struct node { int key; node * right;} int data; LD1: … data = node -> data; node * left; node = node -> left; PG(L, X) = { all pointers at offset X from byte accessed by instruction L } P1 P2 data key left right data key left right data key left … right offset 8 offset 8 LD1 LD1 PG (LD1, 8) = {P1, P2, etc.}

  17. Efficient Content-Directed Prefetching (ECDP) • The PG definition naturally associates a number of PGs to each load instruction 1) Compile-timeprofilingclassifies PGs into beneficial/harmful 2) Hardwareprefetches PGs that are beneficial - Information conveyed to hardware with hint bit vector embedded into the load instruction

  18. { 50 useful data key left right data key left right data key left … right 9 useless { data key left right data key left right data key left … right { 25 useful 33 useful 12 useless 10 useless Beneficial vs Harmful PG PG1 = {P1, P2, P3} P1 data key left right data key left right data key left … right P2 P3 25 + 50 + 33 > 12 + 9 + 10 PG1’susefulprefetches> PG1’s useless prefetches A pointer group whose majority of prefetches are useful is classified as beneficial

  19. ECDP mechanism - Example LD1’s associated beneficial pointer groups PG2 = {LD1, 24} PG3 = {LD1, 44} PG1 = {LD1, 8} Prefetch Don’tPrefetch Assuming 4 byte address values bit 2 bit 6 bit 11 LD1 hint bit-vector 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 offset 8 offset 8 offset 12 offset 24 offset 44 offset 12 offset 8 … left key data key key left right left data right data right byte 12

  20. Outline • Background • Efficient Content Directed LDS Prefetching • Managing Multiple Prefetchers in a Hybrid Prefetching System • Evaluation • Conclusion

  21. Managing Multiple Prefetchersin a Hybrid Prefetching System • ECDP can be complementary toa stream prefetcher • Multiple prefetchers can deny service to each other as they contend for • Memory request buffer entries • DRAM bus bandwidth and DRAM banks • Cache space • Unmanaged use of multiple prefetchers causes • Performance degradation • Inability to gain full performance benefit of using multiple prefetchers

  22. Coordinated Throttling of Multiple Prefetchers – Basic Idea • Dynamic feedback gathered for every prefetcher in the system • Simple heuristics use feedback to adapt each prefetcher's aggressiveness

  23. Adapting Stream Prefetcher Aggressiveness • Stream Prefetcher Aggressiveness • Prefetch Distance • Prefetch Degree Access Stream A P+1 P+2 P+3 P+4 A+1 P Prefetch Distance Prefetch Degree

  24. Adapting CDP Aggressiveness • Each memory request assigned a depth value • Demand accessed line assigned depth 0 Depth = 0 Line fetched by demand access ptr1 ptr2 ptr3 ptr4 … Depth = 1 ptr5 ptr6 ptr7 … Depth = 2 ptr9 ptr10 … ptr8 • CDP Aggressiveness • Maximum allowed prefetch depth

  25. Coordinated Prefetcher Aggressiveness Control Policies • Each prefetcher adapts its own aggressiveness Prefetches Stream Prefetcher (Deciding) (Rival) Shared Memory Resources Feedback Prefetches Content-Directed Prefetcher The goal: Allow the prefetcher most likely to improve performance to use more shared resources (Rival) (Deciding) Feedback • Deciding prefetcher adaptsits own aggressiveness based on • Deciding prefetcher coverage and accuracy • Rival prefetcher coverage

  26. Coordinated Prefetcher Aggressiveness Control Policies Deciding Prefetcher Feedback Rival Feedback Action Reason • avoid unnecessary bandwidth • consumption and cache pollution throttle down Deciding Acc Low (b) give rival prefetcher chance to use more shared resources Rival Cov High throttle down Deciding Acc Med (c) give deciding prefetcher chance to improve coverage Deciding Cov Low Rival Cov Low throttle up Deciding Acc High Rival Cov High (d) deciding prefetcher not causing trouble, rival performing well do nothing (e) deciding prefetcher performing well, avoid performance loss Deciding Cov High throttle up

  27. Hardware Cost of all Techniques • Major components • ‘prefetched’ bits for each L2 line – used to account for useful/useless prefetches • Eleven 16-bit counters to estimate prefetcher coverage and accuracy

  28. Outline • Background • Efficient Content Directed LDS Prefetching • Managing Multiple Prefetchersin a Hybrid Prefetching System • Evaluation • Conclusion

  29. Evaluation Methodology • x86 cycle accurate simulator • Baseline processor configuration • Per core • 4-wide issue, out-of-order, 256-entry ROB • 1MB, 8-way L2 cache • Stream prefetcher with 32 streams, prefetch degree:4, prefetch distance:32 • Content Directed Prefetcher, compare bits:8, max depth:4 • Shared • 450 cycle memory latency • 8B wide core to memory bus • 32, 64, 128 L2 MSHRs for 1-, 2-, 4-core • Coordinated prefetcher throttling thresholds

  30. Overall Performance 22.5%

  31. Memory Bandwidth Consumption 375

  32. Comparison to other LDS/Correlation Prefetchers

  33. Summary of Other Results • Further comparisons and analysis are presented in the paper • Feedback Directed Prefetching • 5% avg. improvement • HW Prefetch Filtering • 17% avg. improvement • Multi-core Results • Dual Core (10.4% avg. improvement) • Quad Core (9.5% avg. improvement) • Effects of techniques on prefetcher accuracy and coverage

  34. Outline • Background • Efficient Content Directed LDS Prefetching • Managing Multiple Prefetchersin a Hybrid Prefetching System • Evaluation • Conclusion

  35. Conclusion • Developed a low-cost and bandwidth-efficientHW/SW cooperative linked data structure prefetcher • ECDP utilizes compiler hints to prefetch only likely-useful pointers • Inter-prefetcher interference can destroy potential performance • Coordinated throttling manages interference between multiple prefetchers • Efficient integration of ECDP with stream prefetching • Improves average performance by 22% over stream prefetching alone • Reduces bandwidth consumption by 25%

  36. Thank you ! Questions ?

  37. Comparison to Feedback-Directed Prefetching (Srinath et al. HPCA ‘07)

  38. Comparison to HW Prefetch Filtering (Zhuang and Lee ICPP ‘03)

  39. Performance on Dual-Core

  40. Performance on Quad-Core

  41. Stream Prefetcher Accuracy

  42. CDP Accuracy

  43. Sensitivity Study

More Related