Efficient Techniques for Bandwidth-Efficient Prefetching of LDS in Hybrid Systems

Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems Eiman Ebrahimi* Onur Mutlu‡ Yale N. Patt* * HPS Research Group University of Texas at Austin ‡ Computer Architecture Laboratory Carnegie Mellon University

Motivation • Prefetching can significantly reduce memory latency impact on performance • Stream prefetching very useful but unable to reduce latency of many misses • Access patterns that follow pointers in linked data structures (LDS) prevalent in many applications • High-performance and bandwidth-efficient LDS prefetchers are needed

Potential Performance IPC delta of ideal LDS prefetching over stream prefetching 615

Our Goal Develop techniques that 1) Enable low cost and bandwidth-efficient prefetching of linked data structure accesses 2) Efficiently combine such prefetchers with commonly-employed stream based prefetchers

Develop techniques that 1) Enable low cost and bandwidth-efficient prefetching of linked data structure accesses 2) Efficiently combine such prefetchers with commonly-employed stream based prefetchers Our Goal

Outline • Background • Efficient Content Directed LDS Prefetching • Managing Multiple Prefetchersin a Hybrid Prefetching System • Evaluation • Conclusion

Content-Directed Prefetching (CDP) (Cooksey et al. ASPLOS ’02) • Requires no state  Attractive approach • Searches for pointers as data is fetched from memory • Virtual address predictor • Compares high order bits of values within cache line with cache line’s address • Generates prefetch requests on a match

= = = = = = = Content-Directed Prefetching (CDP) X800 22220 x40373551 x80011100 x800 11100 [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] = Virtual Address Predictor GeneratePrefetch L2 DRAM X80022220 … …

Shortcomings of CDP • CDP prefetches all identified pointers • Indiscriminate prefetching of all discovered pointers leads to • Low prefetch accuracy • High cache pollution • High bandwidth consumption

Shortcomings of CDP – An example Struct node{ int Key; int * D1_ptr; int * D2_ptr; node * Next; } Key HashLookup(int Key) { … for (node = head ; node -> Key != Key; node = node -> Next; ) ; if (node) return node->D1; } D1 Key D1 Key D2 D2 D1 Key … D1 Key D2 … D2 D1 Key D2 Example from mst

Shortcomings of CDP – An example Cache Line Addr [31:20] Key Next Next D2_ptr D1_ptr D2_ptr D1_ptr Key [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] = = = = = = = = Virtual Address Predictor … D1 Key D1 Key D2 D2 D1 Key … D1 Key D2 … D2 D1 Key D2

Shortcomings of CDP – An example HashLookup(int Key) { … for (node = head ; node -> Key != Key; node = node -> Next; ) ; if (node) } return node -> D1; D1 Key D1 Key D2 D2 D1 Key … D1 Key D2 … D2 D1 Key D2

Shortcomings of CDP – An example Cache Line Addr [31:20] D2_ptr Key Next D2_ptr Next D1_ptr Key D1_ptr [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] [31:20] = = = = = = = = Virtual Address Predictor … D1 Key D1 Key D2 D2 D1 Key … D1 Key D2 … D2 D1 Key D2

Efficient Content Directed Prefetching (ECDP) – Basic Idea • A compiler guided technique that identifies likely-useful pointer addresses to prefetch • Compiler profiles and provides hints as to which pointer addressesare likely-useful to prefetch • Hardware uses hints to prefetch only likely-useful pointers

Terminology – Pointer Group (PG) struct node { int key; node * right;} int data; LD1: … data = node -> data; node * left; node = node -> left; PG(L, X) = { all pointers at offset X from byte accessed by instruction L } P1 P2 data key left right data key left right data key left … right offset 8 offset 8 LD1 LD1 PG (LD1, 8) = {P1, P2, etc.}

Efficient Content-Directed Prefetching (ECDP) • The PG definition naturally associates a number of PGs to each load instruction 1) Compile-timeprofilingclassifies PGs into beneficial/harmful 2) Hardwareprefetches PGs that are beneficial - Information conveyed to hardware with hint bit vector embedded into the load instruction

{ 50 useful data key left right data key left right data key left … right 9 useless { data key left right data key left right data key left … right { 25 useful 33 useful 12 useless 10 useless Beneficial vs Harmful PG PG1 = {P1, P2, P3} P1 data key left right data key left right data key left … right P2 P3 25 + 50 + 33 > 12 + 9 + 10 PG1’susefulprefetches> PG1’s useless prefetches A pointer group whose majority of prefetches are useful is classified as beneficial

ECDP mechanism - Example LD1’s associated beneficial pointer groups PG2 = {LD1, 24} PG3 = {LD1, 44} PG1 = {LD1, 8} Prefetch Don’tPrefetch Assuming 4 byte address values bit 2 bit 6 bit 11 LD1 hint bit-vector 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 offset 8 offset 8 offset 12 offset 24 offset 44 offset 12 offset 8 … left key data key key left right left data right data right byte 12

Outline • Background • Efficient Content Directed LDS Prefetching • Managing Multiple Prefetchers in a Hybrid Prefetching System • Evaluation • Conclusion

Managing Multiple Prefetchersin a Hybrid Prefetching System • ECDP can be complementary toa stream prefetcher • Multiple prefetchers can deny service to each other as they contend for • Memory request buffer entries • DRAM bus bandwidth and DRAM banks • Cache space • Unmanaged use of multiple prefetchers causes • Performance degradation • Inability to gain full performance benefit of using multiple prefetchers

Coordinated Throttling of Multiple Prefetchers – Basic Idea • Dynamic feedback gathered for every prefetcher in the system • Simple heuristics use feedback to adapt each prefetcher's aggressiveness

Adapting Stream Prefetcher Aggressiveness • Stream Prefetcher Aggressiveness • Prefetch Distance • Prefetch Degree Access Stream A P+1 P+2 P+3 P+4 A+1 P Prefetch Distance Prefetch Degree

Adapting CDP Aggressiveness • Each memory request assigned a depth value • Demand accessed line assigned depth 0 Depth = 0 Line fetched by demand access ptr1 ptr2 ptr3 ptr4 … Depth = 1 ptr5 ptr6 ptr7 … Depth = 2 ptr9 ptr10 … ptr8 • CDP Aggressiveness • Maximum allowed prefetch depth

Coordinated Prefetcher Aggressiveness Control Policies • Each prefetcher adapts its own aggressiveness Prefetches Stream Prefetcher (Deciding) (Rival) Shared Memory Resources Feedback Prefetches Content-Directed Prefetcher The goal: Allow the prefetcher most likely to improve performance to use more shared resources (Rival) (Deciding) Feedback • Deciding prefetcher adaptsits own aggressiveness based on • Deciding prefetcher coverage and accuracy • Rival prefetcher coverage

Coordinated Prefetcher Aggressiveness Control Policies Deciding Prefetcher Feedback Rival Feedback Action Reason • avoid unnecessary bandwidth • consumption and cache pollution throttle down Deciding Acc Low (b) give rival prefetcher chance to use more shared resources Rival Cov High throttle down Deciding Acc Med (c) give deciding prefetcher chance to improve coverage Deciding Cov Low Rival Cov Low throttle up Deciding Acc High Rival Cov High (d) deciding prefetcher not causing trouble, rival performing well do nothing (e) deciding prefetcher performing well, avoid performance loss Deciding Cov High throttle up

Hardware Cost of all Techniques • Major components • ‘prefetched’ bits for each L2 line – used to account for useful/useless prefetches • Eleven 16-bit counters to estimate prefetcher coverage and accuracy

Evaluation Methodology • x86 cycle accurate simulator • Baseline processor configuration • Per core • 4-wide issue, out-of-order, 256-entry ROB • 1MB, 8-way L2 cache • Stream prefetcher with 32 streams, prefetch degree:4, prefetch distance:32 • Content Directed Prefetcher, compare bits:8, max depth:4 • Shared • 450 cycle memory latency • 8B wide core to memory bus • 32, 64, 128 L2 MSHRs for 1-, 2-, 4-core • Coordinated prefetcher throttling thresholds

Overall Performance 22.5%

Memory Bandwidth Consumption 375

Comparison to other LDS/Correlation Prefetchers

Summary of Other Results • Further comparisons and analysis are presented in the paper • Feedback Directed Prefetching • 5% avg. improvement • HW Prefetch Filtering • 17% avg. improvement • Multi-core Results • Dual Core (10.4% avg. improvement) • Quad Core (9.5% avg. improvement) • Effects of techniques on prefetcher accuracy and coverage

Conclusion • Developed a low-cost and bandwidth-efficientHW/SW cooperative linked data structure prefetcher • ECDP utilizes compiler hints to prefetch only likely-useful pointers • Inter-prefetcher interference can destroy potential performance • Coordinated throttling manages interference between multiple prefetchers • Efficient integration of ECDP with stream prefetching • Improves average performance by 22% over stream prefetching alone • Reduces bandwidth consumption by 25%

Thank you ! Questions ?

Comparison to Feedback-Directed Prefetching (Srinath et al. HPCA ‘07)

Comparison to HW Prefetch Filtering (Zhuang and Lee ICPP ‘03)

Performance on Dual-Core

Performance on Quad-Core

Stream Prefetcher Accuracy

CDP Accuracy

Sensitivity Study

Efficient Techniques for Bandwidth-Efficient Prefetching of LDS in Hybrid Systems

Efficient Techniques for Bandwidth-Efficient Prefetching of LDS in Hybrid Systems

Presentation Transcript

A Case for Bufferless Routing in On-Chip Networks

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery

J. Varon, MD, FCCP , W.F. Peacock, MD, N. Garrison, MD, R. Ebrahimi, MD, L. Dunbar, MD, P. Acosta, MD, and C. Pollack, M

Ramin Ebrahimi, MD University of California Los Angeles/ Greater Los Angeles VA Medical Center

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Gennady Pekhimenko , Vivek Seshadri , Yoongu Kim, Hongyi Xin , Onur Mutlu , Todd C. Mowry

Staged Memory Scheduling

The Dirty-Block Index

Scalable Many-Core Memory Systems Optional Topic 5 : Interconnects

Scalable Many-Core Memory Systems Topic 2 : Emerging Technologies and Hybrid Memories

Multi-Core Architectures and Shared Resource Management Lecture 3: Interconnects

Yale University

Address-Value Delta (AVD) Prediction

Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures

Memory Systems in the Many-Core Era: Some Challenges and Solution Directions

Micro 2012 Lightning Session

Application-Aware Memory Channel Partitioning

Onur G. Guleryuz

Staged Memory Scheduling