960 likes | 1.18k Views
Snoop Filtering and Coarse-Grain Memory Tracking. Andreas Moshovos Univ. of Toronto/ECE Short Course at the University of Zaragoza, July 2009 Some slides by J. Zebchuk or the original paper authors. JETTY Snoop-Filtering for Reduced Power in SMP Servers.
E N D
Snoop Filtering and Coarse-Grain Memory Tracking Andreas Moshovos Univ. of Toronto/ECE Short Course at the University of Zaragoza, July 2009 Some slides by J. Zebchuk or the original paper authors
JETTY Snoop-Filtering for Reduced Power in SMP Servers BabakFalsafi, ECE, Carnegie Mellon GokhanMemik, ECE, Northwestern AlokChoudhary, ECE, Northwestern Int’l Conference on High-Performance Architecture, 2001 Andreas Moshovos
Power is Becoming Important • Architecture is a science of tradeoffs • Thus far: Performance vs. Cost vs. Complexity • Today: vs. Power • Where? • Mobile Devices • Desktops/Servers Our Focus
Power-Aware Servers • Revisit the design of SMP servers • 2 or more CPUs per machine • Snoop coherence-based • Why? • File, web, databases, your typical desktop • Cost effective too • This work - a first step: Power-Aware Snoopy-Coherence
Power-Aware Snoop-Coherence • Conventional • All L2 caches snoop all memory traffic • Power expended by all on any memory access • Jetty-Enhanced • Tiny structure on L2-backside • Filters most “would-be-misses” • Less power expended on most snoop misses • No changes to protocol necessary • No performance loss
Roadmap • Why Power is a Concern for Servers? • Snoopy-Coherence Basics • An Opportunity for Reducing Power • JETTY • Results • Summary
Why is Power Important? Power Could Ultimately Limit Performance • Power Demands have been increasing • Deliver Energy to and on chip • Dissipate Heat • Limit: • Amount of resources & frequency • Feasibility • Cooling a solution: Cost & Integration? Reducing Power Demands is much more convenient
What can be done? • Redesign Circuits • Clock Gating and Frequency Scaling • A lot has been done thus far • Still active • Rethink Architectural Decisions • Orthogonal to others Reduce Power Under Performance Constraints
The “Silver Bullet” Solution • Good if there was one • However, till one is found... • Look at all structures • Rethink Design • Propose Power-Optimized versions • This is what we’re doing for performance
Snoopy Cache Coherence CPU Core CPU Core L1 L2 Hit Main Memory All L2 tags see all bus accesses Intervene when necessary
How About Power? CPU Core CPU Core CPU Core L1 L2 miss miss Main Memory All L2 tags see all bus accesses Perf. & Complexity: Have L2 tags why not use them Power: All L2 tags consume power on all accesses
JETTY: A Would be Snoop-Miss Filter Would be Snoop-Miss: Would be Snoop-Hit: CPU n CPU n JETTY JETTY addr addr Not here! Don’t Know Detect most misses using fewer resources Imprecise: May filter a would-be miss Never filters snoop-hits
Potential for Savings Exist • Most Snoops miss • 91% AVG • Many L2 accesses are due to Snoop Misses • 55% AVG • Sizeable Potential Power Savings: • 20% - 50% of total L2 power
Exclude JETTY Exclude-Jetty • Subset of what is not cached cached not cached How? Cache recent snoop-misses locally
Exclude-Jetty • Subset of what you don’t have Works well for producer-consumer
Include-Jetty • Superset of what is cached cached include JETTY not cached How? Well...
Include-Jetty bit vector 0 address • Not-Cached • Anyzero bit • May be Cached • Allbits set f( ) bit vector 1 g( ) bit vector 2 h( ) Later I was told this is a Bloom filter…
Include-Jetty • Superset of what you have Partial overlapping indexes worked better This is a counting bloom filter: L-CBF: A Low Power, Fast Counting Bloom Filter ImplementationElham Safi, Andreas Moshovos and Andreas Veneris,In Proc. Annual International Symposium on Low Power Electronics and Design (ISLPED), Oct. 2006.
Hybrid-Jetty • Some cases Exclude-J works well • Some other Include-J is better • Combine • Access in parallel on snoop • Allocation • IJ always • If IJ fails to filter then to EJ • EJ coverage increases
Latency? • Jetty may increase snoop-response time • Can only be determined on a design by design basis • Largest Jetty: • Five 32x32 bit register files
Results • Used SPLASH-II • Scientific applications • “Large” Datasets • e.g., 4-80Megs of main memory allocated • Access Counts: 60M-1.7B • 4-way SMP, MOESI • 1M direct-mapped L2, 64b 32b subblocks • 32k direct-mapped L1, 32b blocks • Coverage & Power (analytical model)
Coverage: Hybrid-Jetty • Can capture 74% of all snoop-misses better
Power-Savings • 28% of overall L2 power better
Summary • Power is becoming important • Performance, Reliability and Feasibility • Unique Opportunities Exist for Servers • JETTY: Filter Snoops that would miss • 74% of all snoops • 28% of L2 power saved • No protocol changes • No performance loss
Power efficient cache coherence C. Saldanha, M. Lipasti Workshop on Memory Performance Issues (in conjunction with ISCA), June 2001.
Avoids Speculative transmission of Snoop packets. Check the nearest neighbor Data supplied with minimum latency and power Serial Snooping MEMORY
TLB and Snoop Energy-Reduction using Virtual Caches inLow-Power Chip-Multiprocessors Magnus Ekman, *Fredrik Dahlgren, and Per Stenström Chalmers University of Technology Ericsson Mobile Platforms Int’l Symposium on Low Power Electronic Design and Devices, Aug. 2002
Page Sharing Tables • On snoop requesting node gets a page-level sharing vector If a PST entry is evicted the whole page must be evicted Paper by same authors demonstrates the Jetty is not beneficial for small-scale CMPs
RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu Int’l Conference on Computer Architecture 2005
Improving Snoop Coherence • Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth • Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? • Remains Attractive: Simple / Design Re-use Yes: Exploit Program Behavior to Dynamically Identify Requests that do not Need Snooping CPU CPU CPU I$ D$ I$ D$ I$ D$ interconnect Main Memory
RegionScout: Avoid Some Snoops • Frequent case: non-sharing even at a coarse level/Region • RegionScout: Dynamically Identify Non-Shared Regions • First Request to a Region Identifies it as not Shared • Subsequent Requests do not need to be broadcast • Uses Imprecise Information • Small structures • Layer on top of conventional coherence • No additional constraints CPU CPU CPU I$ D$ I$ D$ I$ D$ interconnect Main Memory
Roadmap • Conventional Coherence: • The need for power-aware designs • Potential: Program Behavior • RegionScout: What and How • Implementation • Evaluation • Summary
Coherence Basics • Given request for memory block X (address) • Detect where its current value resides CPU CPU CPU X snoop snoop hit Main Memory
Conventional Coherence not Power-Aware/Bandwidth-Effective CPU CPU CPU L2 miss miss Main Memory All L2 tags see all accesses Perf. & Complexity: Have L2 tags why not use them Power:All L2 tags consume power on all accesses Bandwidth: broadcast all coherent requests
RegionScoutMotivation: Sharing is Coarse • Region: large continuous memory area, power of 2 size • CPU X asks for data block in region R • No one else has X • No one else has any block in R RegionScout Exploits this Behavior Layered Extension over Snoop Coherence Typical Memory Space Snapshot: colored by owner(s) addresses
CPU CPU CPU I$ I$ I$ D$ D$ D$ Optimization Opportunities • Power and Bandwidth • Originating node: avoid asking others • Remote node: avoid tag lookup SWITCH Memory
Potential: Region Miss Frequency better % of all requests Global Region Misses Region Size Even with a 16K Region ~45% of requests miss in all remote nodes
1 2 2 3 RegionScout at Work: Non-Shared Region Discovery First request detects a non-shared region CPU CPU CPU Region Miss Region Miss Global Region Miss Main Memory Record: Non-Shared Regions Record: Locally Cached Regions
1 2 RegionScout at Work: Avoiding Snoops Subsequent request avoids snoops CPU CPU CPU Global Region Miss Main Memory Record: Non-Shared Regions Record: Locally Cached Regions
1 2 2 RegionScout is Self-Correcting Request from another node invalidates non-shared record CPU CPU CPU Main Memory Record: Non-Shared Regions Record: Locally Cached Regions
Region Tag offset CPU Implementation: Requirements • Requesting Node provides address: • At Originating Node – from CPU: • Have I discovered that this region is not shared? • At Remote Nodes – from Interconnect: • Do I have a block in the region? address lg(Region Size)
Remembering Non-Shared Regions • Records non-shared regions • Lookup by Region portion prior to issuing a request • Snoop requests and invalidate address Region Tag offset Non-Shared Region Table valid Few entries 16x4 in most experiments
What Regions are Locally Cached? • If we had as many counters as regions: • Block Allocation: counter[region]++ • Block Eviction: counter[region]-- • Region cached only if counter[region] non-zero • Not Practical: • E.g., 16K Regions and 4G Memory 256K counters Region Tag offset counter
p bits P-bit 1 if counter non-zero used for lookups What Regions are Locally Cached? • Use few Counters Imprecise: • Records a superset of locally cached Regions • False positives: lost opportunity, correctness preserved Region Tag offset Cached Region Hash “Counter”: + on block allocation - on block eviction Few entries, e.g., 256 hash counter
Roadmap • Conventional Coherence • Program Behavior: Region Miss Frequency • RegionScout • Evaluation • Summary
Evaluation Overview • Methodology • Filter rates • Practical Filters can capture many Region Misses • Interconnect bandwidth reduction
Methodology • In-House simulator based on Simplescalar • Execution driven • All instructions simulated – MIPS like ISA • System calls faked by passing them to host OS • Synchronization using load-linked/store-conditional • Simple in-order processors • Memory requests complete instantaneously • MESI snoop coherence • 1 or 2 level memory hierarchy • WATTCH power models • SPLASH II benchmarks • Scientific workloads • Feasibility study
Filter Rates better Identified Global Region Misses CRH Size For small CRH better to use large regions Practical RegionScout filters capture a lot of the potential
Bandwidth Reduction Messages better CMP Region Size Moderate Bandwidth Savings for SMP (15%-22%) More so for CMP (>25%)
Related Work • RegionScout • Technical Report, Dec. 2003 • Jetty • Moshovos, Memik, Falsafi, Choudhary, HPCA 2001 • PST • Eckman, Dahlgren, and Stenström, ISLPED 2002 • Coarse-Grain Coherence • Cantin, Lipasti and Smith, ISCA 2005