Predictor Virtualization

Predictor Virtualization Ioana Burcea* Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§# *University of Toronto Canada §Carnegie Mellon University #École Polytechnique Fédérale de Lausanne ASPLOS 13 March 4, 2008

Why Predictors? History Repeats Itself CPU Branch Prediction Prefetching Value Prediction Predictors Pointer Caching Cache Replacement • Application footprints grow • Predictors need to scale to remain effective

Extra Resources: CMPs With Large On-Chip Caches CPU CPU CPU CPU I$ I$ I$ I$ D$ D$ D$ D$ L2 Cache 10’s – 100’s of MB Main Memory

Predictor Virtualization CPU CPU CPU CPU I$ I$ I$ I$ D$ D$ D$ D$ L2 Cache Physical Memory Address Space

Predictor Virtualization (PV) • Emulate large predictor tables • Reduce predictor table dedicated resources

Research Contributions • PV – metadata stored in conventional cache hierarchy • Benefits • Emulate larger tables → increased accuracy • Less dedicated resources • Why now? • Large caches / CMPs / Need for larger predictors • Will this work? • Metadata locality → intrinsically exploited by caches • First Step – Virtualized Data Prefetcher • Performance: within 1% on average • Space: 60KB down to < 1KB • Advantages of virtualization

Talk Road Map • PV architecture • PV in action • Virtualized “Spatial Memory Streaming” [ISCA 06]* • Conclusions *[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”

PV Architecture Optimization Engine CPU request prediction I$ D$ Predictor Table Virtualize L2 Cache Main Memory

PV Architecture Optimization Engine CPU PVStart request prediction I$ D$ index PVCache PVProxy L2 Cache Physical Memory Address Space PVTable

PV: Variable Prediction Latency Optimization Engine CPU PVStart request prediction I$ D$ index Common Case PVCache PVProxy L2 Cache Infrequent Physical Memory Address Space Rare PVTable

Metadata Locality • Entry reuse • Temporal • One entry used for multiple predictions • Spatial – can be engineered • One miss overcome by several subsequent hits • Metadata access pattern predictability • Predictor metadata prefetching

Talk Road Map • PV architecture • PV in action • Virtualized “Spatial Memory Streaming” [ISCA 06]* • Conclusions *[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”

Spatial Memory Streaming [ISCA 06] 1100001010001… Spatial patterns stored in a pattern history table (PHT) Memory 1100000001101… spatial patterns *[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”

Virtualizing “Spatial Memory Streaming” (SMS) Virtualize data access stream patterns Detector Predictor patterns ~1KB ~60 KB triggeraccess prefetches

Virtualizing SMS tag pattern tag pattern tag pattern unused 39 bits 11 bits 32 bits Virtual Table PVCache 8 sets 1K sets 11 ways 11 ways Set entries → cache block – 64 bytes

Current Implementation • Non-Intrusive • Virtual table stored in reserved physical address space • One table per core • Caches oblivious to metadata • Options • Predictor tables stored in virtual memory • Single, shared table per application • Caches aware of metadata

Simulation Infrastructure • SimFlex • Full-system simulator based on Simics • Base processor configuration • 4-core CMP • 8-wide OoO • 256-entry ROB • L1D/L1I 64KB 4-way set-associative • UL2 8MB 16-way set-associative • Commercial workloads • TPC-C: DB2 and Oracle • TPC-H: Query 1, Query 2, Query 16, Query 17 • SpecWeb: Apache and Zeus

Original Prefetcher – Accuracy vs. Predictor Size L1 Read Misses better

Original Prefetcher – Accuracy vs. Predictor Size L1 Read Misses better Small Tables Diminish Prefetching Accuracy

Virtualized Prefetcher - Performance Speedup better Original Prefetcher ~60KB Virtualized Prefetcher < 1KB Hardware Cost

Impact on L2 Memory Requests L2 Memory Requests Increase better Dark Side: Increased L2 Memory Requests

Impact of Virtualization on Off-Chip Bandwidth Indirect impact on performance Off-Chip Bandwidth Increase Direct impact on performance better Minimal Impact on Off-Chip Bandwidth

Conclusions • Predictor Virtualization • Metadata stored in conventional cache hierarchy • Benefits • Emulate larger tables → increased accuracy • Less dedicated resources • First Step – Virtualized Data Prefetcher • Performance: within 1% on average • Space: 60KB down to < 1KB • Opportunities • Metadata sharing and persistence • Application directed prediction • Predictor adaptation

Predictor Virtualization Ioana Burcea* ioana@eecg.toronto.edu Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§# *University of Toronto Canada §Carnegie Mellon University #École Polytechnique Fédérale de Lausanne ASPLOS 13 March 4, 2008

PV – Motivating Trends • Dedicating resources to predictors hard to justify • Larger predictor tables • Increased performance • Chip multiprocessors • Space dedicated to predictors ↔ # processors • Memory hierarchies offer the opportunity • Increased capacity • Diminishing returns Use conventional memory hierarchies to store predictor metadata

Virtualizing the Predictor Table Pattern History Table Trigger Access Tag Pattern Tag Pattern Address PC … 1 1 1 0 1 0 1 0 … 0 0 1 1 1 0 1 1 Pattern index Tag … 0 0 1 1 1 0 1 0 Prefetch Virtualize • PHT stored in physical address space • Multiple PHT entries packed in one memory block • one memory request brings an entire table set

Packing Entries in One Cache Block • Index: PC + offset within spatial group • PC →16 bits • 32 blocks in a spatial group → 5 bit offset → 32 bit spatial pattern • Pattern table: 1K sets • 10 bits to index the table → 11 bit tag • Cache block: 64 bytes • 11 entries per cache block → Pattern table 1K sets – 11-way set associative 21 bit index tag pattern tag pattern tag pattern unused 85 0 11 43 54

Memory Address Calculation + PC Block offset 16 bits 5 bits PV Start Address 10 bits 000000 tag Memory Address

Increase in Off-Chip Bandwidth – different L2 sizes Off-Chip Bandwidth Increase

Increased L2 Latency Speedup

Conclusions • PV – metadata stored in conventional cache hierarchy • Benefits • Less dedicated resources • Emulate larger tables → increased accuracy • Example – Virtualized Data Prefetcher • Performance: within 1% on average • Space: 60KB down to < 1KB • Why now? • Large caches / CMPs / Need for larger predictors • Will this work? • Metadata locality →intrinsically exploited by caches • Metadata access pattern predictability • Opportunities • Metadata sharing and persistence • Application directed prediction • Predictor adaptation

Predictor Virtualization

Predictor Virtualization

Presentation Transcript

Virtualization

SKM MARKET PREDICTOR

Virtualization

VIRTUALIZATION

Eclipse Predictor

Virtualization

Teaching Old Caches New Tricks: Predictor Virtualization

Temporal Stream Branch Predictor (TS Predictor)

Virtualization

Virtualization

PathoLogic Pathway Predictor

Virtualization

Virtualization

THE PREDICTOR

Virtualization

JNJ Best Predictor

Baby Gender Predictor

Ovulation Predictor

PathoLogic Pathway Predictor

A Weather Predictor

Virtualization