430 likes | 483 Views
Explore advanced methods for prefetching linked data structures to enhance system performance by combining stream and content-directed prefetching. Evaluate and address challenges to improve prefetch accuracy.
E N D
Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems Eiman Ebrahimi* Onur Mutlu‡ Yale N. Patt* * HPS Research Group University of Texas at Austin ‡ Computer Architecture Laboratory Carnegie Mellon University
Motivation • Prefetching can significantly reduce memory latency impact on performance • Stream prefetching very useful but unable to reduce latency of many misses • Access patterns that follow pointers in linked data structures (LDS) prevalent in many applications • High-performance and bandwidth-efficient LDS prefetchers are needed
Potential Performance IPC delta of ideal LDS prefetching over stream prefetching 615
Our Goal Develop techniques that 1) Enable low cost and bandwidth-efficient prefetching of linked data structure accesses 2) Efficiently combine such prefetchers with commonly-employed stream based prefetchers
Develop techniques that 1) Enable low cost and bandwidth-efficient prefetching of linked data structure accesses 2) Efficiently combine such prefetchers with commonly-employed stream based prefetchers Our Goal
Outline • Background • Efficient Content Directed LDS Prefetching • Managing Multiple Prefetchersin a Hybrid Prefetching System • Evaluation • Conclusion
Content-Directed Prefetching (CDP) (Cooksey et al. ASPLOS ’02) • Requires no state Attractive approach • Searches for pointers as data is fetched from memory • Virtual address predictor • Compares high order bits of values within cache line with cache line’s address • Generates prefetch requests on a match
= = = = = = = Content-Directed Prefetching (CDP) X800 22220 x40373551 x80011100 x800 11100 [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] = Virtual Address Predictor GeneratePrefetch L2 DRAM X80022220 … …
Shortcomings of CDP • CDP prefetches all identified pointers • Indiscriminate prefetching of all discovered pointers leads to • Low prefetch accuracy • High cache pollution • High bandwidth consumption
Shortcomings of CDP – An example Struct node{ int Key; int * D1_ptr; int * D2_ptr; node * Next; } Key HashLookup(int Key) { … for (node = head ; node -> Key != Key; node = node -> Next; ) ; if (node) return node->D1; } D1 Key D1 Key D2 D2 D1 Key … D1 Key D2 … D2 D1 Key D2 Example from mst
Shortcomings of CDP – An example Cache Line Addr [31:20] Key Next Next D2_ptr D1_ptr D2_ptr D1_ptr Key [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] = = = = = = = = Virtual Address Predictor … D1 Key D1 Key D2 D2 D1 Key … D1 Key D2 … D2 D1 Key D2
Shortcomings of CDP – An example HashLookup(int Key) { … for (node = head ; node -> Key != Key; node = node -> Next; ) ; if (node) } return node -> D1; D1 Key D1 Key D2 D2 D1 Key … D1 Key D2 … D2 D1 Key D2
Shortcomings of CDP – An example Cache Line Addr [31:20] D2_ptr Key Next D2_ptr Next D1_ptr Key D1_ptr [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] = = = = = = = = Virtual Address Predictor … D1 Key D1 Key D2 D2 D1 Key … D1 Key D2 … D2 D1 Key D2
Outline • Background • Efficient Content Directed LDS Prefetching • Managing Multiple Prefetchersin a Hybrid Prefetching System • Evaluation • Conclusion
Efficient Content Directed Prefetching (ECDP) – Basic Idea • A compiler guided technique that identifies likely-useful pointer addresses to prefetch • Compiler profiles and provides hints as to which pointer addressesare likely-useful to prefetch • Hardware uses hints to prefetch only likely-useful pointers
Terminology – Pointer Group (PG) struct node { int key; node * right;} int data; LD1: … data = node -> data; node * left; node = node -> left; PG(L, X) = { all pointers at offset X from byte accessed by instruction L } P1 P2 data key left right data key left right data key left … right offset 8 offset 8 LD1 LD1 PG (LD1, 8) = {P1, P2, etc.}
Efficient Content-Directed Prefetching (ECDP) • The PG definition naturally associates a number of PGs to each load instruction 1) Compile-timeprofilingclassifies PGs into beneficial/harmful 2) Hardwareprefetches PGs that are beneficial - Information conveyed to hardware with hint bit vector embedded into the load instruction
{ 50 useful data key left right data key left right data key left … right 9 useless { data key left right data key left right data key left … right { 25 useful 33 useful 12 useless 10 useless Beneficial vs Harmful PG PG1 = {P1, P2, P3} P1 data key left right data key left right data key left … right P2 P3 25 + 50 + 33 > 12 + 9 + 10 PG1’susefulprefetches> PG1’s useless prefetches A pointer group whose majority of prefetches are useful is classified as beneficial
ECDP mechanism - Example LD1’s associated beneficial pointer groups PG2 = {LD1, 24} PG3 = {LD1, 44} PG1 = {LD1, 8} Prefetch Don’tPrefetch Assuming 4 byte address values bit 2 bit 6 bit 11 LD1 hint bit-vector 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 offset 8 offset 8 offset 12 offset 24 offset 44 offset 12 offset 8 … left key data key key left right left data right data right byte 12
Outline • Background • Efficient Content Directed LDS Prefetching • Managing Multiple Prefetchers in a Hybrid Prefetching System • Evaluation • Conclusion
Managing Multiple Prefetchersin a Hybrid Prefetching System • ECDP can be complementary toa stream prefetcher • Multiple prefetchers can deny service to each other as they contend for • Memory request buffer entries • DRAM bus bandwidth and DRAM banks • Cache space • Unmanaged use of multiple prefetchers causes • Performance degradation • Inability to gain full performance benefit of using multiple prefetchers
Coordinated Throttling of Multiple Prefetchers – Basic Idea • Dynamic feedback gathered for every prefetcher in the system • Simple heuristics use feedback to adapt each prefetcher's aggressiveness
Adapting Stream Prefetcher Aggressiveness • Stream Prefetcher Aggressiveness • Prefetch Distance • Prefetch Degree Access Stream A P+1 P+2 P+3 P+4 A+1 P Prefetch Distance Prefetch Degree
Adapting CDP Aggressiveness • Each memory request assigned a depth value • Demand accessed line assigned depth 0 Depth = 0 Line fetched by demand access ptr1 ptr2 ptr3 ptr4 … Depth = 1 ptr5 ptr6 ptr7 … Depth = 2 ptr9 ptr10 … ptr8 • CDP Aggressiveness • Maximum allowed prefetch depth
Coordinated Prefetcher Aggressiveness Control Policies • Each prefetcher adapts its own aggressiveness Prefetches Stream Prefetcher (Deciding) (Rival) Shared Memory Resources Feedback Prefetches Content-Directed Prefetcher The goal: Allow the prefetcher most likely to improve performance to use more shared resources (Rival) (Deciding) Feedback • Deciding prefetcher adaptsits own aggressiveness based on • Deciding prefetcher coverage and accuracy • Rival prefetcher coverage
Coordinated Prefetcher Aggressiveness Control Policies Deciding Prefetcher Feedback Rival Feedback Action Reason • avoid unnecessary bandwidth • consumption and cache pollution throttle down Deciding Acc Low (b) give rival prefetcher chance to use more shared resources Rival Cov High throttle down Deciding Acc Med (c) give deciding prefetcher chance to improve coverage Deciding Cov Low Rival Cov Low throttle up Deciding Acc High Rival Cov High (d) deciding prefetcher not causing trouble, rival performing well do nothing (e) deciding prefetcher performing well, avoid performance loss Deciding Cov High throttle up
Hardware Cost of all Techniques • Major components • ‘prefetched’ bits for each L2 line – used to account for useful/useless prefetches • Eleven 16-bit counters to estimate prefetcher coverage and accuracy
Outline • Background • Efficient Content Directed LDS Prefetching • Managing Multiple Prefetchersin a Hybrid Prefetching System • Evaluation • Conclusion
Evaluation Methodology • x86 cycle accurate simulator • Baseline processor configuration • Per core • 4-wide issue, out-of-order, 256-entry ROB • 1MB, 8-way L2 cache • Stream prefetcher with 32 streams, prefetch degree:4, prefetch distance:32 • Content Directed Prefetcher, compare bits:8, max depth:4 • Shared • 450 cycle memory latency • 8B wide core to memory bus • 32, 64, 128 L2 MSHRs for 1-, 2-, 4-core • Coordinated prefetcher throttling thresholds
Overall Performance 22.5%
Summary of Other Results • Further comparisons and analysis are presented in the paper • Feedback Directed Prefetching • 5% avg. improvement • HW Prefetch Filtering • 17% avg. improvement • Multi-core Results • Dual Core (10.4% avg. improvement) • Quad Core (9.5% avg. improvement) • Effects of techniques on prefetcher accuracy and coverage
Outline • Background • Efficient Content Directed LDS Prefetching • Managing Multiple Prefetchersin a Hybrid Prefetching System • Evaluation • Conclusion
Conclusion • Developed a low-cost and bandwidth-efficientHW/SW cooperative linked data structure prefetcher • ECDP utilizes compiler hints to prefetch only likely-useful pointers • Inter-prefetcher interference can destroy potential performance • Coordinated throttling manages interference between multiple prefetchers • Efficient integration of ECDP with stream prefetching • Improves average performance by 22% over stream prefetching alone • Reduces bandwidth consumption by 25%
Thank you ! Questions ?
Comparison to Feedback-Directed Prefetching (Srinath et al. HPCA ‘07)
Comparison to HW Prefetch Filtering (Zhuang and Lee ICPP ‘03)