490 likes | 548 Views
Two Ways to Exploit Multi-Megabyte Caches. AENAO Research Group @ Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas Moshovos. {aasaraai, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu. CPU. CPU. I$. I$. D$. D$.
E N D
Two Ways to Exploit Multi-Megabyte Caches AENAO Research Group @ Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas Moshovos {aasaraai, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu
CPU CPU I$ I$ D$ D$ Future Caches: Just Larger? CPU • “Big Picture” Management • Store Metadata D$ I$ interconnect 10s – 100s of MB Main Memory Aenao Group/Toronto
Conventional Block Centric Cache Fine-Grain View of Memory • “Small” Blocks • Optimizes Bandwidth and Performance • Large L2/L3 caches especially L2 Cache Big Picture Lost Aenao Group/Toronto
“Big Picture” View Coarse-Grain View of Memory • Region: 2n sized, aligned area of memory • Patterns and behavior exposed • Spatial locality • Exploit for performance/area/power L2 Cache Aenao Group/Toronto
Coarse-Grain Framework Exploiting Coarse-Grain Patterns Circuit-Switched Coherence • Many existing coarse-grain optimizations • Add new structures to track coarse-grain information CPU Stealth Prefetching RegionScout • Embed coarse-grain information in tag array • Support many different optimizations with less area overhead Run-time Adaptive Cache Hierarchy Management via Reference Analysis L2 Cache Destination-Set Prediction Coarse-Grain Coherence Tracking Spatial Memory Streaming Adaptable optimization FRAMEWORK Hard to justify for a commercial design Aenao Group/Toronto
RegionTracker Solution L2 Cache Manage blocks, but also track and manage regions L1 Data Array Data Blocks Tag Array Region Tracker L1 Block Requests L1 Region Probes L1 Region Responses Block Requests Aenao Group/Toronto
RegionTracker Summary • Replace conventional tag array: • 4-core CMP with 8MB shared L2 cache • Within 1% of original performance • Up to 20% less tag area • Average 33% less energy consumption • Optimization Framework: • Stealth Prefetching: same performance, 36% less area • RegionScout: 2x more snoops avoided, no area overhead Aenao Group/Toronto
Road Map • Introduction • Goals • Coarse-Grain Cache Designs • RegionTracker: A Tag Array Replacement • RegionTracker: An Optimization Framework • Conclusion Aenao Group/Toronto
Goals • Conventional Tag Array Functionality • Identify data block location and state • Leave data array un-changed • Optimization Framework Functionality • Is Region X cached? • Which blocks of Region X are cached? Where? • Evict or migrate Region X • Easy to assign properties to each Region Aenao Group/Toronto
Large Block Size Coarse-Grain Cache Designs Tag Array Data Array • Increased BW, Decreased hit-rates Region X Aenao Group/Toronto
Sector Cache Tag Array Data Array • Decreased hit-rates Region X Aenao Group/Toronto
Sector Pool Cache Tag Array Data Array • High Associativity (2 - 4 times) Region X Aenao Group/Toronto
Decoupled Sector Cache Tag Array Status Table Data Array • Region information not exposed • Region replacement requires scanning multiple entries Region X Aenao Group/Toronto
Design Requirements • Small block size (64B) • Miss-rate does not increase • Lookup associativity does not increase • No additional access latency • (i.e., No scanning, no multiple block evictions) • Does not increase latency, area, or energy • Allows banking and interleaving • Fit in conventional tag array “envelope” Aenao Group/Toronto
RegionTracker: A Tag Array Replacement L1 Data Array • 3 SRAM arrays, combined smaller than tag array Region Vector Array L1 L1 Evicted Region Buffer L1 Block Status Table Aenao Group/Toronto
Basic Structures Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region Region Vector Array (RVA) Block Status Table (BST) • Address: specific RVA set and BST set • RVA entry: multiple, consecutive BST sets • BST entry: one of four RVA sets Region Tag …… status block15 block0 3 2 V way 1 4 Aenao Group/Toronto
Common Case: Hit Ex: 8MB, 16-way set-associative cache, 64-byte blocks, 1KB region 49 21 10 6 0 Address: Region Tag RVA Index Region Offset Block Offset Data Array + BST Index Block Offset 19 6 0 Region Vector Array (RVA) Block Status Table (BST) Region Tag …… status block15 block0 3 2 V way To Data Array 1 4 Aenao Group/Toronto
Worst Case (Rare): Region Miss 49 21 10 6 0 Address: Region Tag RVA Index Region Offset Block Offset Ptr Data Array + BST Index Block Offset 19 6 0 Region Vector Array (RVA) Block Status Table (BST) Evicted Region Buffer (ERB) No Match! Region Tag …… status Ptr block15 block0 3 2 V way Aenao Group/Toronto
Methodology P P P P • Flexus simulator from CMU SimFlex group • Based on Simics full-system simulator • 4-core CMP modeled after Piranha • Private 32KB, 4-way set-associative L1 caches • Shared 8MB, 16-way set-associative L2 cache • 64-byte blocks • Miss-rates: Functional simulation of 2 billion instructions per core • Performance and Energy: Timing simulation using SMARTS sampling methodology • Area and Power: Full custom implementation on 130nm commercial technology • 9 commercial workloads: • WEB: SpecWEB on Apache and Zeus • OLTP: TPC-C on DB2 and Oracle • DSS: 5 TPC-H queries on DB2 D$ I$ D$ I$ D$ I$ D$ I$ Interconnect L2 Aenao Group/Toronto
Miss-Rates vs. Area Sector Cache (0.25, 1.26) • Sector Cache: 512KB sectors, SPC and RT: 1KB regions • Trade-offs comparable to conventional cache 48-way Relative Miss-Rate 52-way 14-way 15-way better Relative Tag Array Area Aenao Group/Toronto
Performance & Energy Performance Energy better better Reduction in Tag Energy Normalized Execution Time • 12-way set-associative RegionTracker: 20% less area • Error bars: 95% confidence interval • Performance within 1%, with 33% tag energy reduction Aenao Group/Toronto
Road Map • Introduction • Goals • Coarse-Grain Cache Designs • RegionTracker: A Tag Array Replacement • RegionTracker: An Optimization Framework • Conclusion Aenao Group/Toronto
L1 Data Array L1 RVA L1 ERB L1 BST RegionTracker: An Optimization Framework Stealth Prefetching: Average 20% performance improvement Drop-in RegionTracker for 36% less area overhead RegionScout: In-depth analysis Aenao Group/Toronto
Snoop Coherence: Common Case CPU CPU CPU Read x+1 Read x+2 Read x+n Read x miss miss Main Memory Many snoops are to non-shared regions Aenao Group/Toronto
RegionScout CPU CPU CPU Read x Miss Miss Region Miss Region Miss Global Region Miss Main Memory Non-Shared Regions Locally Cached Regions Eliminate broadcasts for non-shared regions Aenao Group/Toronto
Locally Cached Regions Non-Shared Regions Already provided by RVA Add 1 bit to each RVA entry RegionTracker Implementation • Minimal overhead to support RegionScout optimization • Still uses less area than conventional tag array Aenao Group/Toronto
RegionTracker + RegionScout • 4 processors, 512KB L2 Caches • 1KB regions BlockScout (4KB) better Reduction in Snoop Broadcasts Avoid 41% of Snoop Broadcasts, no area overhead compared to conventional tag array Aenao Group/Toronto
Result Summary • Replace Conventional Tag Array: • 20% Less tag area • 33% Less tag energy • Within 1% of original performance • Coarse-Grain Optimization Framework: • 36% reduction in area overhead for Stealth Prefetching • Filter 41% of snoop broadcasts with no area overhead compared to conventional cache Aenao Group/Toronto
Predictor Virtualization Ioana Burcea Joint work with Stephen Somogyi Babak Falsafi
CPU CPU CPU L1-D L1-D L1-D L1-I L1-I L1-I Optimization Engines: Predictors Predictor Virtualization CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU L1-D L1-I L1-D L1-I L1-D L1-D L1-I L1-D L1-I L1-D L1-I Interconnect L2 Main Memory Aenao Group/Toronto
Motivating Trends • Dedicating resources to predictors hard to justify: • Chip multiprocessors • Space dedicated to predictors X #processors • Larger predictor tables • Increased performance • Memory hierarchies offer the opportunity • Increased capacity • How many apps really use the space? Use conventional memory hierarchies to store predictor information Aenao Group/Toronto
PV Architecture contd. Optimization Engine request request prediction Predictor Table Aenao Group/Toronto
PV Architecture contd. Optimization Engine request prediction Predictor Virtualization Aenao Group/Toronto
+ PV Architecture contd. Optimization Engine request prediction PVCache MSHR PVStart index PVProxy On the backside of the L1 L2 PVTable Main Memory Aenao Group/Toronto
CPU Infrequent I$ D$ interconnect L2/L3 Main Memory To Virtualize Or Not to Virtualize? Common Case • Re-Use2. Predictor Info Prefetching Aenao Group/Toronto
To Virtualize or Not? • Challenge • Hit in the PVCache most of the time • Will not work for all predictors out of the box • Reuse is necessary • Intrinsic • Easy to virtualize • Non-intrinsic • Must be engineered • More so if the predictor needs to be fast to start with Aenao Group/Toronto
CPU I$ D$ interconnect L2/L3 Main Memory Will There Be Reuse? • Intrinsic: • Multiple [predictions per entry • We’ll see an example • Can be engineered • Group temporally correlated entries together: Cache block Aenao Group/Toronto
Spatial Memory Streaming • Footprint: • Blocks accessed per memory region • Predict next time the footprint will be the same • Handle: PC + offset within region Aenao Group/Toronto
Spatial Generations Aenao Group/Toronto
Virtualizing SMS Virtualize patterns Predictor Detector patterns triggeraccess prefetches Aenao Group/Toronto
Virtualizing SMS Virtual Table PVCache 8 1K 11 11 tag pattern tag pattern tag pattern unused 85 0 11 43 54 Aenao Group/Toronto
Packing Entries in One Cache Block • Index: PC + offset within spatial group • PC →16 bits • 32 blocks in a spatial group → 5 bit offset → 32 bit spatial pattern • Pattern table: 1K sets • 10 bits to index the table → 11 bit tag • Cache block: 64 bytes • 11 entries per cache block → Pattern table 1K sets – 11-way set associative 21 bit index tag pattern tag pattern tag pattern unused 85 0 11 43 54 Aenao Group/Toronto
+ Memory Address Calculation PC Block offset 16 bits 5 bits PV Start Address 10 bits 000000 Memory Address Aenao Group/Toronto
Simulation Infrastructure • SimFlex: CMU Impetus • Full-system simulator based on Simics • Base processor configuration • 8-wide OoO • 256-entry ROB / 64-entry LSQ • L1D/L1I 64KB 4-way set-associative • UL2 8MB 16-way set-associative • Commercial workloads • TPC-C: DB2 and Oracle • TPC-H: Query 1, Query 2, Query 16, Query 17 • Web: Apache and Zeus Aenao Group/Toronto
SMS – Performance Potential better Aenao Group/Toronto
Virtualized Spatial Memory Streaming better Original Prefetcher: Cost: 60KB Virtualized Prefetcher: Cost: <1Kbyte Nearly Identical Performance Aenao Group/Toronto
Impact of Virtualization on L2 Misses Aenao Group/Toronto
Impact of Virtualization on L2 Requests Aenao Group/Toronto
Coarse-Grain Tracking Jason Zebchuk