180 likes | 266 Views
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints. Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of Michigan. Motivation. Increasing Memory – Processor frequency Gap Large Data Caches to hide Long Latencies
E N D
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of Michigan
Motivation • Increasing Memory – Processor frequency Gap • Large Data Caches to hide Long Latencies • Larger caches – Longer Access Latencies [McFarland 98] • Processor Cycle determines Cache Size • Intel Pentium III – 16K DL1 Cache, 3 cycle access • Intel Pentium 4 – 8K DL1 Cache, 2 cycle access • Need Large AND Fast Caches!
Related Work • Load Latency Tolerance [Srinivasan & Lebeck, MICRO 98] • All Loads are NOT equal • Determining Criticality – Very Complex • Sophisticated Simulator with Rollback • Non-Critical Buffer [Fisk & Bahar, ICCD99] • Determining Criticality – Performance Degradation/Dependency Chains • Non-Critical Buffer – Victim Cache for non-critical loads • Small Performance Improvements (upto 4%)
Related Work(contd.) • Locality vs. Criticality [Srinivasan et.al., ISCA 01] • Determining Criticality – Practical Heuristics • Potential for Improvement – 40% • Locality is better than Criticality • Non-Vital Loads [Rakvic et.al., HPCA 02] • Determining Criticality – Run-time Heuristics • Small and fast Vital cache for Vital Loads • 17% Performance Improvement
Criticality • Criticality – Effect of Load Latency on Performance • Two thresholds – Performance and Latency • A Very Direct Estimation of Criticality • Computation Intensive! • Static
Determining Criticality-A Closer Look IPC Threshold=99.6% Latency Threshold = 8cycles
Effectiveness? • Load Reference Distribution • What %age of Loads Identified as Critical • Miss Rate for Critical Load References • Critical Cache Configuration compared with • Faster Conventional Cache Configuration • DL1/DL2 Latencies – 3/10, 6/20, 9/30 cycles • Critical Cache Configuration compared with • Larger Conventional Cache Configuration • DL1 Sizes – 8KB, 16KB, 32KB, 64KB
Processor Configuration Similar to Alpha 21264 using SimpleScalar-3.0 [Austin, Burger 97]
ResultsComparison with a faster conventional Cache Configuration IPCs normalized to 16K-1cycle Configuration 25-66% of the Penalty due to a slower cache is eliminated
ResultsComparison with a faster Conventional Cache Configuration IPCs normalized to 32K-1cycle Configuration 25-70% of the Penalty due to a slower cache is eliminated
ResultsComparison with a larger Conventional cache Configuration IPCs normalized to 16K-3cycle Configuration
ResultsComparison with a larger Conventional cache Configuration IPCs normalized to 32k_6cycle Configuration Critical cache Configuration outperforms a larger conventional cache
Conclusions & Future Work • Conclusions • Compares well with a faster conventional cache • Outperforms a larger conventional cache in most cases • Future Work • More heuristics to refine “criticality” • Why are “critical loads” critical? • Criticality of a memory address vs. criticality of a load instruction • Criticality for lowpower Caches