470 likes | 604 Views
Region-Centric Memory Design. AENAO Research Group Patrick Akl , M.A.Sc. Ioana Burcea , Ph.D. C. Myrto Papadopoulou , M.A.Sc. C. Elham Safi , Ph.D. C. Jason Zebchuk , M.A.Sc. C. Andreas Moshovos. {pakl, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu. CPU. CPU. I$. I$. D$.
E N D
Region-Centric Memory Design AENAO Research Group Patrick Akl, M.A.Sc. Ioana Burcea, Ph.D. C. Myrto Papadopoulou, M.A.Sc. C. Elham Safi, Ph.D. C. Jason Zebchuk, M.A.Sc. C. Andreas Moshovos {pakl, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu
CPU CPU I$ I$ D$ D$ Future On-Chip Caches: Just Larger? CPU Observe and Exploit Memory Access Behavior at a Coarse Grain D$ I$ interconnect 10s – 100s of MB Main Memory Aenao Group/Toronto
Conventional Block-Centric Memory Hierarchy • “Small” Blocks • Performance and Bandwidth • Several optimizations exist Big picture is lost Conventional Fine-Grain Tracking Aenao Group/Toronto
“Big Picture” View Supplemental Coarse-Grain Tracking • Region: 2n sized, aligned memory area • Concept already in use: TLBs • Patterns Emerge in Space / Time • Exploit for performance & power • Expose to software Aenao Group/Toronto
This Presentation • Examples of Coarse-Grain Optimizations • Snoop Coherence • Thread-level speculation disambiguation • Region-Centric Memory Design • RegionTracker Cache • Snoop Coherence Revisited • Current Activities • Coherence Delegation • Predictor Virtualization Aenao Group/Toronto
An Example: Snoop Coherence • Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth • Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? • Remains Attractive: Simple / Design Re-use Yes: Exploit Program Behavior to Dynamically Identify Requests that do not Need Snooping CPU CPU CPU I$ D$ I$ D$ I$ D$ interconnect Main Memory Aenao Group/Toronto
Coherence Basics • Given request for memory block X (address) • Detect where current value resides CPU CPU CPU X snoop snoop hit Main Memory Aenao Group/Toronto
Conventional Coherence not Power-Aware/Bandwidth-Effective CPU CPU CPU L2 miss miss Main Memory All L2 tags see all accesses Perf. & Complexity: Have L2 tags why not use them Power:All L2 tags consume power on all accesses Bandwidth: broadcast all coherent requests Aenao Group/Toronto
RegionScout Motivation:Sharing is Coarse • Region: large continuous memory area, power of 2 size • CPU X asks for data block in region R • No one else has X • No one else has any block in R RegionScout Exploits this Behavior Layered Extension over Snoop Coherence Typical Memory Space Snapshot: colored by owner(s) addresses Aenao Group/Toronto
CPU CPU CPU I$ I$ I$ D$ D$ D$ Optimization Opportunities • Power and Bandwidth • Originating node: avoid asking others • Remote node: avoid tag lookup SWITCH Memory Aenao Group/Toronto
Potential: Region Miss Frequency better % of all requests Global Region Misses Region Size Even with a 16K Region ~45% of requests miss in all remote nodes Aenao Group/Toronto
1 2 2 3 RegionScout at Work: Non-Shared Region Discovery First request detects a non-shared region CPU CPU CPU Region Miss Region Miss Global Region Miss Main Memory Record: Non-Shared Regions Record: Locally Cached Regions Aenao Group/Toronto
1 2 RegionScout at Work:Avoiding Snoops Subsequent request avoids snoops CPU CPU CPU Global Region Miss Main Memory Record: Non-Shared Regions Record: Locally Cached Regions Aenao Group/Toronto
1 2 2 RegionScout is Self-Correcting Request from another node invalidates non-shared record CPU CPU CPU Main Memory Record: Non-Shared Regions Record: Locally Cached Regions Aenao Group/Toronto
Region Tag offset CPU Implementation: Requirements • Requesting Node provides address: • At Originating Node – from CPU: • Have I discovered that this region is not shared? • At Remote Nodes – from Interconnect: • Do I have a block in the region? address lg(Region Size) Aenao Group/Toronto
Remembering Non-Shared Regions address • Records non-shared regions • Lookup by Region portion prior to issuing a request • Snoop requests and invalidate Region Tag offset Non-Shared Region Table valid Few entries 16x4 in most experiments Aenao Group/Toronto
What Regions are Locally Cached? Region Tag offset • If we had as many counters as regions: • Block Allocation: counter[region]++ • Block Eviction: counter[region]-- • Region cached only if counter[Region] non-zero • Not Practical: • E.g., 16K Regions and 4G Memory 256K counters counter Aenao Group/Toronto
What Regions are Locally Cached? Region Tag offset counter hash() • Imprecise: • Records a superset of locally cached Regions • False positives: lost opportunity, correctness preserved • Small: e.g., 256 entries for 1M cache • Power-Optimized structures described in the paper Aenao Group/Toronto
LFSR-Based Implementation Region Tag offset • Linear-Feedback Shift Register Array • Increment/Decrement/Is Zero? • 130nm commercial technology • ISLPED ’06 • Faster: 1.6x to 3.7x • More Energy Efficient: 1.4x to 2.3x • But Area: 3.2x LFSR hash() Zero Detector Aenao Group/Toronto
Filter Rates: SPLASH-II better Identified Global Region Misses CRH Size Jason Cantin@Wisconsin studied commercial workloads 40% filter rate Aenao Group/Toronto
Region-Centric Disambiguation Join work w/ Greg Steffan and Mihai Burcea Patrick Akl Andreas Moshovos
Speculative Parallelization Models • Thread level speculation • Transactional Memory Speculative Parallelization Original Good Scenario Bad Scenario read a read b time write a write a Need to Compare Addresses Across Code Pieces Aenao Group/Toronto
Ex #2: Region-Centric Disambiguation Region-Centric Conventional • Send digest at region level • Region-conflict • Send block-level info • Reduced bandwidth, potential for performance and power Task 1 Task 2 Task 1 Task 2 Memory Space Aenao Group/Toronto
TLS benchmarks from STAMPEDE group (G. Steffan) Approximate timing model How Much Traffic Can We Save? Better Potential for traffic reduction by 38% Aenao Group/Toronto
Exploiting Region-Level Information • Region Coherence Arrays • Cantin, Lipasti and Smith • RegionScout • Both of these reduce snoop lookups (and broadcasts) in snoop coherence protocolsOur work • Spatial Memory Prefetching • Leverages spatial memory patterns for prefetching with commercial workloads • Impetus Group at CMU • Stealth Prefetching • Cantin, Lipasti and Smith Aenao Group/Toronto
CPU I$ D$ Coarse-Grain Techniques Today Conventional Cache • Overhead • Storage: e.g., 60% of tags • Functionality: Restrict placement, Region Evictions • Loss of Information Hard to justify for a commercial design Auxiliary Tracking DATA TAGS Aenao Group/Toronto
CPU I$ D$ Rethinking Cache Design Embedded Tracking DATA Dual-Grain TAGS • Can we provide a common substrate for all these optimizations? • Redesign caches: • Regions a first class citizen • RegionTracker Cache Aenao Group/Toronto
RegionTracker Cache • Goals • Expose region behavior • Is region X cached? • Which blocks are? • Facilitate management at the region level • Evict/migrate region X • Do something with all blocks in X • Constraints: • Data movement only at the block level • No increase in area • No decrease in performance • Complexity • Associativity Aenao Group/Toronto
Region-Based Caches • Start with conventional 16-way cache and replace tag array • Sector Caches • Hit rate suffers: 20% loss • Sector Pool Caches • High Associavity: 48-way for matching a 16-way cache • Decoupled-Sector Caches • No coarse-grain info • Replacements require searching • No previous design is adequate • RegionTracker: • Meets all requirements • But does not save as much tag resources Aenao Group/Toronto
Sector Cache D-way Data • Reduced Area and Power • Increased miss-rates (2.5% - 96% for 1kB sectors) • Replacement? D-way Region Tags { RVA Data Array Aenao Group/Toronto
M-way Region Tags RVA Sector Pool Cache D-way Data • M > D • Requires highly associative cache to achieve same performance as RegionTracker (~48-way) { 1 DSR Data Array Aenao Group/Toronto
Decoupled-Sectored Cache • Has multiple block evictions • Requires scanning “status” array • No simple mechanism to avoid this • Does NOT expose region-level information Aenao Group/Toronto
D-way Data L-way Region Tags { 1 DSR RVA Data Array RegionTracker • In practice L <= D • Decouple Data and Lookup organizations • Lower Associativity lookups with no hit-rate penalty • RegionTracker provides complete solution Aenao Group/Toronto
L1 Data Array L1 RVA L1 ERB L1 BST RegionTracker Cache Block and Region Lookups Region Tag + Way Per Block Evict Region Blocks Lazily Simplify replacement and reduce area Status per block + RVA set backpointer Can be banked and partitioned Aenao Group/Toronto
Region-Aware Cache: Performance vs. Area • Commercial workloads: DB2, Oracle, TPC-C and TPC-H, Apache, Zeus • SimICS + SimFlex, Sampling, 2K Regions better Aenao Group/Toronto
RegionTracker-RegionScout • One bit per Region tag: Known to be not shared • 1KB Regions, Commercial workloads • 512KB L2 private caches Filter 41% of snoops at “Zero Cost” compared to conventional cache BlockScout better Reduction in Broadcasts Aenao Group/Toronto
Directory Optimizations Base Architecture Core L3 Data DRAM L2 Tags Directory L3 Tags L2 Data Aenao Group/Toronto
Coherence Delegation Ideal Path Requesting Node • Eliminate 3-hop overhead • Attract directory tracking to nodes Directory Lookup Remote L2 containing data Aenao Group/Toronto
CPU CPU CPU CPU L1-D L1-D L1-D L1-D L1-I L1-I L1-I L1-I Optimization Engines: Predictors PredictorVirtualization CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU L1-D L1-I L1-D L1-I L1-D L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-D L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I L1-D L1-I Interconnect L2 Main Memory Aenao Group/Toronto
Motivating Trends • Chip multiprocessors • Space dedicated to predictors X #processors • Larger predictor table • Increased performance • Memory hierarchies • Increased capacities Use conventional memory hierarchies to store predictor information Aenao Group/Toronto
PV Architecture Optimization Engine entry index prediction Predictor Table Aenao Group/Toronto
PV Architecture Optimization Engine entry index prediction Predictor Virtualization Aenao Group/Toronto
+ PV Architecture Optimization Engine entry index prediction PVCache MSHR PVStart index PVProxy L2 PVTable Main Memory Aenao Group/Toronto
Virtualized Spatial Memory Streaming Original Prefetcher: Cost: 80KB Virtualized Prefetcer: Cost: <1Kbyte Nearly Identical Performance Aenao Group/Toronto
Region-Centric Memory Design AENAO Research Group Patrick Akl, M.A.Sc. C. Ioana Burcea, Ph.D. C. Myrto Papadopoulou, M.A.Sc. C. Elham Safi, Ph.D. C. Jason Zebchuk, M.A.Sc. C. Andreas Moshovos {pakl, ioana, myrto, elham, zebchuk, moshovos}@eecg.toronto.edu
Summary • Caches are getting larger • Time to look at the “big picture” • Region-Centric Memory Design • Expose region-level info • Allow management at the region-level • RegionScout • eliminate broadcasts for snoop coherence • Region-Centric Disambiguation • Reduce bandwidth for TLS or TM • Region-Aware Memory • “Same” area and performance as conventional + region info • Predictor Virtualization Aenao Group/Toronto