360 likes | 483 Views
Predictor Virtualization. Ioana Burcea * Stephen Somogyi § , Andreas Moshovos*, Babak Falsafi § #. *University of Toronto Canada. § Carnegie Mellon University # École Polytechnique Fédérale de Lausanne. ASPLOS 13 March 4, 2008. Why Predictors? History Repeats Itself . CPU.
E N D
Predictor Virtualization Ioana Burcea* Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§# *University of Toronto Canada §Carnegie Mellon University #École Polytechnique Fédérale de Lausanne ASPLOS 13 March 4, 2008
Why Predictors? History Repeats Itself CPU Branch Prediction Prefetching Value Prediction Predictors Pointer Caching Cache Replacement • Application footprints grow • Predictors need to scale to remain effective
Extra Resources: CMPs With Large On-Chip Caches CPU CPU CPU CPU I$ I$ I$ I$ D$ D$ D$ D$ L2 Cache 10’s – 100’s of MB Main Memory
Predictor Virtualization CPU CPU CPU CPU I$ I$ I$ I$ D$ D$ D$ D$ L2 Cache Physical Memory Address Space
Predictor Virtualization (PV) • Emulate large predictor tables • Reduce predictor table dedicated resources
Research Contributions • PV – metadata stored in conventional cache hierarchy • Benefits • Emulate larger tables → increased accuracy • Less dedicated resources • Why now? • Large caches / CMPs / Need for larger predictors • Will this work? • Metadata locality → intrinsically exploited by caches • First Step – Virtualized Data Prefetcher • Performance: within 1% on average • Space: 60KB down to < 1KB • Advantages of virtualization
Talk Road Map • PV architecture • PV in action • Virtualized “Spatial Memory Streaming” [ISCA 06]* • Conclusions *[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”
Talk Road Map • PV architecture • PV in action • Virtualized “Spatial Memory Streaming” [ISCA 06]* • Conclusions *[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”
PV Architecture Optimization Engine CPU request prediction I$ D$ Predictor Table Virtualize L2 Cache Main Memory
PV Architecture Optimization Engine CPU PVStart request prediction I$ D$ index PVCache PVProxy L2 Cache Physical Memory Address Space PVTable
PV: Variable Prediction Latency Optimization Engine CPU PVStart request prediction I$ D$ index Common Case PVCache PVProxy L2 Cache Infrequent Physical Memory Address Space Rare PVTable
Metadata Locality • Entry reuse • Temporal • One entry used for multiple predictions • Spatial – can be engineered • One miss overcome by several subsequent hits • Metadata access pattern predictability • Predictor metadata prefetching
Talk Road Map • PV architecture • PV in action • Virtualized “Spatial Memory Streaming” [ISCA 06]* • Conclusions *[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”
Spatial Memory Streaming [ISCA 06] 1100001010001… Spatial patterns stored in a pattern history table (PHT) Memory 1100000001101… spatial patterns *[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”
Virtualizing “Spatial Memory Streaming” (SMS) Virtualize data access stream patterns Detector Predictor patterns ~1KB ~60 KB triggeraccess prefetches
Virtualizing SMS tag pattern tag pattern tag pattern unused 39 bits 11 bits 32 bits Virtual Table PVCache 8 sets 1K sets 11 ways 11 ways Set entries → cache block – 64 bytes
Current Implementation • Non-Intrusive • Virtual table stored in reserved physical address space • One table per core • Caches oblivious to metadata • Options • Predictor tables stored in virtual memory • Single, shared table per application • Caches aware of metadata
Simulation Infrastructure • SimFlex • Full-system simulator based on Simics • Base processor configuration • 4-core CMP • 8-wide OoO • 256-entry ROB • L1D/L1I 64KB 4-way set-associative • UL2 8MB 16-way set-associative • Commercial workloads • TPC-C: DB2 and Oracle • TPC-H: Query 1, Query 2, Query 16, Query 17 • SpecWeb: Apache and Zeus
Original Prefetcher – Accuracy vs. Predictor Size L1 Read Misses better
Original Prefetcher – Accuracy vs. Predictor Size L1 Read Misses better
Original Prefetcher – Accuracy vs. Predictor Size L1 Read Misses better
Original Prefetcher – Accuracy vs. Predictor Size L1 Read Misses better Small Tables Diminish Prefetching Accuracy
Virtualized Prefetcher - Performance Speedup better Original Prefetcher ~60KB Virtualized Prefetcher < 1KB Hardware Cost
Impact on L2 Memory Requests L2 Memory Requests Increase better Dark Side: Increased L2 Memory Requests
Impact of Virtualization on Off-Chip Bandwidth Indirect impact on performance Off-Chip Bandwidth Increase Direct impact on performance better Minimal Impact on Off-Chip Bandwidth
Conclusions • Predictor Virtualization • Metadata stored in conventional cache hierarchy • Benefits • Emulate larger tables → increased accuracy • Less dedicated resources • First Step – Virtualized Data Prefetcher • Performance: within 1% on average • Space: 60KB down to < 1KB • Opportunities • Metadata sharing and persistence • Application directed prediction • Predictor adaptation
Predictor Virtualization Ioana Burcea* ioana@eecg.toronto.edu Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§# *University of Toronto Canada §Carnegie Mellon University #École Polytechnique Fédérale de Lausanne ASPLOS 13 March 4, 2008
Predictor Virtualization Ioana Burcea* ioana@eecg.toronto.edu Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§# *University of Toronto Canada §Carnegie Mellon University #École Polytechnique Fédérale de Lausanne ASPLOS 13 March 4, 2008
Predictor Virtualization Ioana Burcea* ioana@eecg.toronto.edu Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§# *University of Toronto Canada §Carnegie Mellon University #École Polytechnique Fédérale de Lausanne ASPLOS 13 March 4, 2008
PV – Motivating Trends • Dedicating resources to predictors hard to justify • Larger predictor tables • Increased performance • Chip multiprocessors • Space dedicated to predictors ↔ # processors • Memory hierarchies offer the opportunity • Increased capacity • Diminishing returns Use conventional memory hierarchies to store predictor metadata
Virtualizing the Predictor Table Pattern History Table Trigger Access Tag Pattern Tag Pattern Address PC … 1 1 1 0 1 0 1 0 … 0 0 1 1 1 0 1 1 Pattern index Tag … 0 0 1 1 1 0 1 0 Prefetch Virtualize • PHT stored in physical address space • Multiple PHT entries packed in one memory block • one memory request brings an entire table set
Packing Entries in One Cache Block • Index: PC + offset within spatial group • PC →16 bits • 32 blocks in a spatial group → 5 bit offset → 32 bit spatial pattern • Pattern table: 1K sets • 10 bits to index the table → 11 bit tag • Cache block: 64 bytes • 11 entries per cache block → Pattern table 1K sets – 11-way set associative 21 bit index tag pattern tag pattern tag pattern unused 85 0 11 43 54
Memory Address Calculation + PC Block offset 16 bits 5 bits PV Start Address 10 bits 000000 tag Memory Address
Increase in Off-Chip Bandwidth – different L2 sizes Off-Chip Bandwidth Increase
Increased L2 Latency Speedup
Conclusions • PV – metadata stored in conventional cache hierarchy • Benefits • Less dedicated resources • Emulate larger tables → increased accuracy • Example – Virtualized Data Prefetcher • Performance: within 1% on average • Space: 60KB down to < 1KB • Why now? • Large caches / CMPs / Need for larger predictors • Will this work? • Metadata locality →intrinsically exploited by caches • Metadata access pattern predictability • Opportunities • Metadata sharing and persistence • Application directed prediction • Predictor adaptation