250 likes | 386 Views
RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence. www.eecg.toronto.edu/aenao. Andreas Moshovos moshovos@eecg.toronto.edu. Improving Snoop Coherence. Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth Can we: (1) Reduce Power/bandwidth
E N D
RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence www.eecg.toronto.edu/aenao Andreas Moshovos moshovos@eecg.toronto.edu
Improving Snoop Coherence • Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth • Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? • Remains Attractive: Simple / Design Re-use Yes: Exploit Program Behavior to Dynamically Identify Requests that do not Need Snooping CPU CPU CPU I$ D$ I$ D$ I$ D$ interconnect Main Memory
RegionScout: Avoid Some Snoops • Frequent case: non-sharing even at a coarse level/Region • RegionScout: Dynamically Identify Non-Shared Regions • First Request to a Region Identifies it as not Shared • Subsequent Requests do not need to be broadcast • Uses Imprecise Information • Small structures • Layer on top of conventional coherence • No additional constraints CPU CPU CPU I$ D$ I$ D$ I$ D$ interconnect Main Memory
Roadmap • Conventional Coherence: • The need for power-aware designs • Potential: Program Behavior • RegionScout: What and How • Implementation • Evaluation • Summary
Coherence Basics • Given request for memory block X (address) • Detect where its current value resides CPU CPU CPU X snoop snoop hit Main Memory
Conventional Coherence not Power-Aware/Bandwidth-Effective CPU CPU CPU L2 miss miss Main Memory All L2 tags see all accesses Perf. & Complexity: Have L2 tags why not use them Power:All L2 tags consume power on all accesses Bandwidth: broadcast all coherent requests
RegionScout Motivation:Sharing is Coarse • Region: large continuous memory area, power of 2 size • CPU X asks for data block in region R • No one else has X • No one else has any block in R RegionScout Exploits this Behavior Layered Extension over Snoop Coherence Typical Memory Space Snapshot: colored by owner(s) addresses
CPU CPU CPU I$ I$ I$ D$ D$ D$ Optimization Opportunities • Power and Bandwidth • Originating node: avoid asking others • Remote node: avoid tag lookup SWITCH Memory
Potential: Region Miss Frequency better % of all requests Global Region Misses Region Size Even with a 16K Region ~45% of requests miss in all remote nodes
1 2 2 3 RegionScout at Work: Non-Shared Region Discovery First request detects a non-shared region CPU CPU CPU Region Miss Region Miss Global Region Miss Main Memory Record: Non-Shared Regions Record: Locally Cached Regions
1 2 RegionScout at Work:Avoiding Snoops Subsequent request avoids snoops CPU CPU CPU Global Region Miss Main Memory Record: Non-Shared Regions Record: Locally Cached Regions
1 2 2 RegionScout is Self-Correcting Request from another node invalidates non-shared record CPU CPU CPU Main Memory Record: Non-Shared Regions Record: Locally Cached Regions
Region Tag offset CPU Implementation: Requirements • Requesting Node provides address: • At Originating Node – from CPU: • Have I discovered that this region is not shared? • At Remote Nodes – from Interconnect: • Do I have a block in the region? address lg(Region Size)
Remembering Non-Shared Regions address • Records non-shared regions • Lookup by Region portion prior to issuing a request • Snoop requests and invalidate Region Tag offset Non-Shared Region Table valid Few entries 16x4 in most experiments
What Regions are Locally Cached? Region Tag offset • If we had as many counters as regions: • Block Allocation: counter[region]++ • Block Eviction: counter[region]-- • Region cached only if counter[region] non-zero • Not Practical: • E.g., 16K Regions and 4G Memory 256K counters counter
p bits P-bit 1 if counter non-zero used for lookups What Regions are Locally Cached? • Use few Counters Imprecise: • Records a superset of locally cached Regions • False positives: lost opportunity, correctness preserved Region Tag offset Cached Region Hash “Counter”: + on block allocation - on block eviction Few entries, e.g., 256 hash counter
Roadmap • Conventional Coherence • Program Behavior: Region Miss Frequency • RegionScout • Evaluation • Summary
Evaluation Overview • Methodology • Filter rates • Practical Filters can capture many Region Misses • Interconnect bandwidth reduction
Methodology • In-House simulator based on Simplescalar • Execution driven • All instructions simulated – MIPS like ISA • System calls faked by passing them to host OS • Synchronization using load-linked/store-conditional • Simple in-order processors • Memory requests complete instantaneously • MESI snoop coherence • 1 or 2 level memory hierarchy • WATTCH power models • SPLASH II benchmarks • Scientific workloads • Feasibility study
Filter Rates better Identified Global Region Misses CRH Size For small CRH better to use large regions Practical RegionScout filters capture a lot of the potential
Bandwidth Reduction Messages better CMP Region Size Moderate Bandwidth Savings for SMP (15%-22%) More so for CMP (>25%)
Related Work • RegionScout • Technical Report, Dec. 2003 • Jetty • Moshovos, Memik, Falsafi, Choudhary, HPCA 2001 • PST • Eckman, Dahlgren, and Stenström, ISLPED 2002 • Coarse-Grain Coherence • Cantin, Lipasti and Smith, ISCA 2005
Summary • Exploit program behavior/optimize a frequent case • Many requests result in a global region miss • RegionScout • Practical filter mechanism • Dynamically detect would-be region misses • Avoid broadcasts • Save tag lookup power and interconnect bandwidth • Small structures • Layered extension over existing mechanisms • Invisible to programmer and the OS
RegionScout and Directories • Different information • Directory block-level sharing • RegionScout: Region-level sharing • Could build Region-level directory • This work serves as motivation • Directories use precise information • RegionScout does not have to • Directories/Implementation • RegionScout can approximate a directory • If remote nodes sent sharing info as opposed to a single bit