310 likes | 430 Views
Accurate and Complexity-Effective Spatial Pattern Prediction. Chi Chen Se-Hyun Yang Babak Falsafi Andreas Moshovos. Motivation – Variation in Spatial Locality. Caches Exploit Spatial Locality via Block Size Prefetch Nearby Data Improve Performance
E N D
Accurate and Complexity-Effective Spatial Pattern Prediction Chi Chen Se-Hyun Yang Babak Falsafi Andreas Moshovos
Motivation – Variation in Spatial Locality • Caches ExploitSpatial Locality via Block Size • Prefetch Nearby Data Improve Performance • “One Size Fits All” Solution • Large enough for prefetching • Small enough to avoid memory link saturation • Opportunity Variation Within and Across Applications • If “Best Block Size” was known: • Prefetch even further Higher Performance • “Turn-off” unused data in cache Lower Leakage Power
This Work • Dynamic Spatial Pattern Prediction • Leakage Power Reduction • Sub-blocks of a block as a Group • Place “unused” block parts in low leakage state • Prefetching • Consecutive Memory Blocks as a Group • Selectively Prefetch Blocks Upon First Access in Group • Key Contribution: PC + Offset Within Group • Quick Learning • Compact Representation • High Coverage
How Well it Works • Spatial Pattern Predictor (SPP) • 256-entry Tag-Less Direct-Mapped • ~95% coverage • L1 Data Leakage Energy Reduction • ~40% reduction w/ 70nm CMOS technology • < 1% average performance degradation • Prefetching w/ 1024 byte Group • Up to 2x speedup and 56% Average • Conventional Cache: 14% Slowdown
Outline • Conventional Cache: Optimization Opportunities • Variation in Spatial Locality • Prediction Framework • Prior Work • Results
Optimization Opportunity #1 Conventional Cache typedef struct person { char name[20]; … int age; int isAdult; struct person* next; } // total 64 bytes // do something … while ( people ) { if ( peopleage >= 21 ) peopleisAdult = TRUE; people = peoplenext; } L1D with 64-Byte cache lines miss age isAdult next miss age isAdult next miss age isAdult next untouched touched Resident untouched data Wasteful Leakage
Optimization Opportunity #2 Conventional Cache typedef struct person { char name[20]; … int age; int isAdult; } people[LARGE] // do something … for i { if ( people[i].age >= 21 ) people[i].isAdult = TRUE; } L1D with 64-Byte cache lines age isAdult Group #1 age isAdult age isAdult Group #2 age isAdult Detech Access Patterns at Group Level Selectively Prefetch Same Block Members Improve Performance w/o Saturating Memory
100% 40% 89% 26% 48% 80% 60% 40% 20% 0% facerec gcc mcf vortex Variation in Spatial Locality Average Line Usage 8/8 7/8 6/8 5/8 All Cache Lines Touched 4/8 3/8 2/8 1/8 • Fraction of data used before eviction • Measured on 64KB 2-way L1D w/ 64B cache lines
1 0 . . . 1 Tag1 Tag0 Tag0 Tag1 Tag1 Prediction Framework Minimum Fetch Unit (MFU): • replacement unit of cache • e.g., cache line or sub block Spatial Group: • group of adjacent MFUs • indexed by logical tag Spatial Pattern: • reference pattern of a spatial group Spatial Group Generation: • starts with a new logical tag . . . . . . Time
Spatial Pattern Register PHT Entry Pointer 0 1 1 0 001 1 1 0 0 000 1 0 0 0 011 1 1 1 1 010 Spatial Pattern Predictor Pattern History Table (PHT) Current Pattern Table (CPT) Data Cache Prediction Index Spatial Pattern History 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 Prediction Index: 32 bits =? PC SPG Offset Spatial Pattern Prediction • Current Pattern Table records patterns • Pattern History Table stores captured patterns
Prior Work • Static profiling, V. Vleet, et al. ICCD 1999 • Adjustable block size, Dubnicki & LeBlanc. ISCA 1992 • Fetching adjacent cache lines, Temam & Jegou. ICS 1994 • Dual cache, Gonzalez, Aliagas & Valero. ICS 1995 • Spatial Locality Detection Table, Johnson, Merten & Hwu. MICRO 1998 • Spatial Footprint Predictor (SFP), Kumar & Wilkerson. ISCA 1998 Key Difference is Prediction Handle: PC + Group Offset 1. Compact Representation 2. Quick Learning 3. High Coverage
Results Overview • Predictor Performance Statistics • Leakage Power Reduction • Performance Improvement w/ Prefetching
Methodology • SimpleScalar simulator • 64KB 2-way L1D/L1I cache, 2-cycle latency • 2MB 8-way L2 cache, 12-cycle latency • SPEC CPU2000 • Alpha binaries + reference inputs • Predictor performance evaluation • Simulated to completion • Performance impact evaluation • Skipped 10B and simulated next 500M instructions • Energy reduction evaluation • SPICE w/ 70nm CMOS technology & 1V supply voltage
160% better 100% 80% 60% 40% 20% 0% Practical Predictor: Performance Training Over-Prediction Over-Prediction Under-Prediction Correct Prediction % of perfect predictions 256 Entries A: 16-way B: DM C: FA A B C A B C A B C A B C gcc mcf vortex fecerec • 256-entry tag-less direct-mapped • average prediction accuracy of 96%
Predictor Applications • Leakage energy reduction • Sub blocks as minimum fetch units • Cache lines as spatial groups • A cache miss starts a spatial group generation • Assuming Gated-Ground by Agarwal, Li, & Roy • Spatial group prefetcher • Cache lines as minimum fetch units • Adjacent cache lines grouped into spatial groups • A new logical tag starts a spatial group generation
100% 80% 60% 40% 20% 0% 5% gcc mcf vortex AVG fecerec Leakage Energy Reduction • Up to 73% leakage energy reduction • ~40% average leakage energy reduction • < 1% average performance degradation Relative Leakage Power better better Execution Time Increase 60% <1% ~2%
Performance Improvement • Up to 2x speedup with 1024B spatial groups • ~60% average speedup with 1024B spatial groups
Summary • Spatial Pattern Predictor (SPP) • Key Contribution: PC + Group Offset • Small and Effective, High Coverage • 256-entry Tag-Less Direct-Mapped • ~95% coverage • L1 Data Leakage Energy Reduction • ~40% reduction w/ 70nm CMOS technology • < 1% average performance degradation • Prefetching w/ 1024 byte Group • Up to 2x speedup and 56% Average • Conventional Cache: 14% Slowdown
Accurate and Complexity-Effective Spatial Pattern Prediction Chi Chen Se-Hyun Yang Babak Falsafi Andreas Moshovos
160% 100% 80% 60% 40% 20% 0% Prediction Index Training A: PC B: PC+SPG ID C: PC+SPG OFFSET D: PC+ADDR Over-Prediction Under-Prediction Correct Prediction A B C D A B C D A B C D A B C D facerec gcc mcf vortex • Infinite Tables • PC + SPG offset yields high prediction accuracy • PC + SPG offset has low prediction memory requirements
Contributions • Spatial Pattern Predictor (SPP) • 256-entry Tag-Less Direct-Mapped • ~95% coverage • Leakage Energy Reduction • ~40% reduction w/ 70nm CMOS technology • < 1% average performance degradation • Processor Performance Improvement • Up to 2x speedup
Variations in Spatial Locality • Fraction of data used before eviction • Measured on 64KB 2-way L1D w/ 64B cache lines
Prediction Index • PC + SPG offset yields high prediction accuracy • PC + SPG offset requires low prediction memory requirement
Predictor Memory Organization • 256-entry tag-less direct-mapped yields average prediction accuracy of 96%
Leakage Energy Reduction • Up to 73% leakage energy reduction • ~40% average leakage energy reduction • < 1% average performance degradation
Performance Improvement • Up to 2x speedup with 1024B spatial groups • ~60% average speedup with 1024B spatial groups
160% 100% 80% 60% 40% 20% 0% Predictor Memory Organization Training Over-Prediction Under-Prediction Correct Prediction A: 128-entry 16-way B: 128-entry DM C: 128-entry FA D: 256-entry 16-way E: 256-entry DM F: 256-entry FA A B C D E F A B C D E F A B C D E F A B C D E F gcc mcf vortex fecerec • 256-entry tag-less direct-mapped • average prediction accuracy of 96%