Snoop Filtering and Coarse-Grain Memory Tracking

Snoop Filtering and Coarse-Grain Memory Tracking Andreas Moshovos Univ. of Toronto/ECE Short Course at the University of Zaragoza, July 2009 Some slides by J. Zebchuk or the original paper authors

JETTY Snoop-Filtering for Reduced Power in SMP Servers BabakFalsafi, ECE, Carnegie Mellon GokhanMemik, ECE, Northwestern AlokChoudhary, ECE, Northwestern Int’l Conference on High-Performance Architecture, 2001 Andreas Moshovos

Power is Becoming Important • Architecture is a science of tradeoffs • Thus far: Performance vs. Cost vs. Complexity • Today: vs. Power • Where? • Mobile Devices • Desktops/Servers Our Focus

Power-Aware Servers • Revisit the design of SMP servers • 2 or more CPUs per machine • Snoop coherence-based • Why? • File, web, databases, your typical desktop • Cost effective too • This work - a first step: Power-Aware Snoopy-Coherence

Power-Aware Snoop-Coherence • Conventional • All L2 caches snoop all memory traffic • Power expended by all on any memory access • Jetty-Enhanced • Tiny structure on L2-backside • Filters most “would-be-misses” • Less power expended on most snoop misses • No changes to protocol necessary • No performance loss

Roadmap • Why Power is a Concern for Servers? • Snoopy-Coherence Basics • An Opportunity for Reducing Power • JETTY • Results • Summary

Why is Power Important? Power Could Ultimately Limit Performance • Power Demands have been increasing • Deliver Energy to and on chip • Dissipate Heat • Limit: • Amount of resources & frequency • Feasibility • Cooling a solution: Cost & Integration? Reducing Power Demands is much more convenient

What can be done? • Redesign Circuits • Clock Gating and Frequency Scaling • A lot has been done thus far • Still active • Rethink Architectural Decisions • Orthogonal to others Reduce Power Under Performance Constraints

The “Silver Bullet” Solution • Good if there was one • However, till one is found... • Look at all structures • Rethink Design • Propose Power-Optimized versions • This is what we’re doing for performance

Snoopy Cache Coherence CPU Core CPU Core L1 L2 Hit Main Memory All L2 tags see all bus accesses Intervene when necessary

How About Power? CPU Core CPU Core CPU Core L1 L2 miss miss Main Memory All L2 tags see all bus accesses Perf. & Complexity: Have L2 tags why not use them Power: All L2 tags consume power on all accesses

JETTY: A Would be Snoop-Miss Filter Would be Snoop-Miss: Would be Snoop-Hit: CPU n CPU n JETTY JETTY addr addr Not here! Don’t Know Detect most misses using fewer resources Imprecise: May filter a would-be miss Never filters snoop-hits

Potential for Savings Exist • Most Snoops miss • 91% AVG • Many L2 accesses are due to Snoop Misses • 55% AVG • Sizeable Potential Power Savings: • 20% - 50% of total L2 power

Exclude JETTY Exclude-Jetty • Subset of what is not cached cached not cached How? Cache recent snoop-misses locally

Exclude-Jetty • Subset of what you don’t have Works well for producer-consumer

Include-Jetty • Superset of what is cached cached include JETTY not cached How? Well...

Include-Jetty bit vector 0 address • Not-Cached • Anyzero bit • May be Cached • Allbits set f( ) bit vector 1 g( ) bit vector 2 h( ) Later I was told this is a Bloom filter…

Include-Jetty • Superset of what you have Partial overlapping indexes worked better This is a counting bloom filter: L-CBF: A Low Power, Fast Counting Bloom Filter ImplementationElham Safi, Andreas Moshovos and Andreas Veneris,In Proc. Annual International Symposium on Low Power Electronics and Design (ISLPED), Oct. 2006.

Hybrid-Jetty • Some cases Exclude-J works well • Some other Include-J is better • Combine • Access in parallel on snoop • Allocation • IJ always • If IJ fails to filter then to EJ • EJ coverage increases

Latency? • Jetty may increase snoop-response time • Can only be determined on a design by design basis • Largest Jetty: • Five 32x32 bit register files

Results • Used SPLASH-II • Scientific applications • “Large” Datasets • e.g., 4-80Megs of main memory allocated • Access Counts: 60M-1.7B • 4-way SMP, MOESI • 1M direct-mapped L2, 64b 32b subblocks • 32k direct-mapped L1, 32b blocks • Coverage & Power (analytical model)

Coverage: Hybrid-Jetty • Can capture 74% of all snoop-misses better

Power-Savings • 28% of overall L2 power better

Summary • Power is becoming important • Performance, Reliability and Feasibility • Unique Opportunities Exist for Servers • JETTY: Filter Snoops that would miss • 74% of all snoops • 28% of L2 power saved • No protocol changes • No performance loss

Power efficient cache coherence C. Saldanha, M. Lipasti Workshop on Memory Performance Issues (in conjunction with ISCA), June 2001.

Avoids Speculative transmission of Snoop packets. Check the nearest neighbor Data supplied with minimum latency and power Serial Snooping MEMORY

TLB and Snoop Energy-Reduction using Virtual Caches inLow-Power Chip-Multiprocessors Magnus Ekman, *Fredrik Dahlgren, and Per Stenström Chalmers University of Technology Ericsson Mobile Platforms Int’l Symposium on Low Power Electronic Design and Devices, Aug. 2002

Page Sharing Tables • On snoop requesting node gets a page-level sharing vector If a PST entry is evicted the whole page must be evicted Paper by same authors demonstrates the Jetty is not beneficial for small-scale CMPs

RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu Int’l Conference on Computer Architecture 2005

Improving Snoop Coherence • Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth • Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? • Remains Attractive: Simple / Design Re-use Yes: Exploit Program Behavior to Dynamically Identify Requests that do not Need Snooping CPU CPU CPU I$ D$ I$ D$ I$ D$ interconnect Main Memory

RegionScout: Avoid Some Snoops • Frequent case: non-sharing even at a coarse level/Region • RegionScout: Dynamically Identify Non-Shared Regions • First Request to a Region Identifies it as not Shared • Subsequent Requests do not need to be broadcast • Uses Imprecise Information • Small structures • Layer on top of conventional coherence • No additional constraints CPU CPU CPU I$ D$ I$ D$ I$ D$ interconnect Main Memory

Roadmap • Conventional Coherence: • The need for power-aware designs • Potential: Program Behavior • RegionScout: What and How • Implementation • Evaluation • Summary

Coherence Basics • Given request for memory block X (address) • Detect where its current value resides CPU CPU CPU X snoop snoop hit Main Memory

Conventional Coherence not Power-Aware/Bandwidth-Effective CPU CPU CPU L2 miss miss Main Memory All L2 tags see all accesses Perf. & Complexity: Have L2 tags why not use them Power:All L2 tags consume power on all accesses Bandwidth: broadcast all coherent requests

RegionScoutMotivation: Sharing is Coarse • Region: large continuous memory area, power of 2 size • CPU X asks for data block in region R • No one else has X • No one else has any block in R RegionScout Exploits this Behavior Layered Extension over Snoop Coherence Typical Memory Space Snapshot: colored by owner(s) addresses

CPU CPU CPU I$ I$ I$ D$ D$ D$ Optimization Opportunities • Power and Bandwidth • Originating node: avoid asking others • Remote node: avoid tag lookup SWITCH Memory

Potential: Region Miss Frequency better % of all requests Global Region Misses Region Size Even with a 16K Region ~45% of requests miss in all remote nodes

1 2 2 3 RegionScout at Work: Non-Shared Region Discovery First request detects a non-shared region CPU CPU CPU Region Miss Region Miss Global Region Miss Main Memory Record: Non-Shared Regions Record: Locally Cached Regions

1 2 RegionScout at Work: Avoiding Snoops Subsequent request avoids snoops CPU CPU CPU Global Region Miss Main Memory Record: Non-Shared Regions Record: Locally Cached Regions

1 2 2 RegionScout is Self-Correcting Request from another node invalidates non-shared record CPU CPU CPU Main Memory Record: Non-Shared Regions Record: Locally Cached Regions

Region Tag offset CPU Implementation: Requirements • Requesting Node provides address: • At Originating Node – from CPU: • Have I discovered that this region is not shared? • At Remote Nodes – from Interconnect: • Do I have a block in the region? address lg(Region Size)

Remembering Non-Shared Regions • Records non-shared regions • Lookup by Region portion prior to issuing a request • Snoop requests and invalidate address Region Tag offset Non-Shared Region Table valid Few entries 16x4 in most experiments

What Regions are Locally Cached? • If we had as many counters as regions: • Block Allocation: counter[region]++ • Block Eviction: counter[region]-- • Region cached only if counter[region] non-zero • Not Practical: • E.g., 16K Regions and 4G Memory  256K counters Region Tag offset counter

p bits P-bit 1 if counter non-zero used for lookups What Regions are Locally Cached? • Use few Counters Imprecise: • Records a superset of locally cached Regions • False positives: lost opportunity, correctness preserved Region Tag offset Cached Region Hash “Counter”: + on block allocation - on block eviction Few entries, e.g., 256 hash counter

Roadmap • Conventional Coherence • Program Behavior: Region Miss Frequency • RegionScout • Evaluation • Summary

Evaluation Overview • Methodology • Filter rates • Practical Filters can capture many Region Misses • Interconnect bandwidth reduction

Methodology • In-House simulator based on Simplescalar • Execution driven • All instructions simulated – MIPS like ISA • System calls faked by passing them to host OS • Synchronization using load-linked/store-conditional • Simple in-order processors • Memory requests complete instantaneously • MESI snoop coherence • 1 or 2 level memory hierarchy • WATTCH power models • SPLASH II benchmarks • Scientific workloads • Feasibility study

Filter Rates better Identified Global Region Misses CRH Size For small CRH better to use large regions Practical RegionScout filters capture a lot of the potential

Bandwidth Reduction Messages better CMP Region Size Moderate Bandwidth Savings for SMP (15%-22%) More so for CMP (>25%)

Related Work • RegionScout • Technical Report, Dec. 2003 • Jetty • Moshovos, Memik, Falsafi, Choudhary, HPCA 2001 • PST • Eckman, Dahlgren, and Stenström, ISLPED 2002 • Coarse-Grain Coherence • Cantin, Lipasti and Smith, ISCA 2005

Snoop Filtering and Coarse-Grain Memory Tracking

Snoop Filtering and Coarse-Grain Memory Tracking

Presentation Transcript

Snoop cache

Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

Complementing User-Level Coarse-Grain Parallelism with Implicit Speculative Parallelism

COARSE GRAINS : GRAIN SORGHUM OATS BARLEY

Snoop Dogg

Procedure Cloning and Integration for Converting Parallelism from Coarse to Fine Grain

Distributed Tracking Using Kalman Filtering

Distributed Tracking Using Kalman Filtering

Kalman Filtering for Coarse Time-Stepper Based Multiscale Data Assimilation

Design Space Exploration for a Coarse Grain Accelerator

Snoop Lion

RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence

Architecture of Datapath-oriented Coarse-grain Logic and Routing for FPGAs

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy

SNOOP

Coarse and Fine Grain Programmable Overlay Architectures for FPGAs

Internet Privacy: Tracking and Filtering, a Policy Question

Acurate determination of parameters for coarse grain model

Evaluation of Fracture toughness of fine- and coarse-grain graphite

Architecture of Datapath-oriented Coarse-grain Logic and Routing for FPGAs