ImanFaraji

Time-based Snoop Filtering in Chip Multiprocessors Amirali Baniasadi ImanFaraji University of Victoria Victoria, Canada Amirkabir University of Technology Tehran, Iran

This work: Reducing redundant snoops in chip multiprocessors Our Goal Improving energy efficiency of WT-based CMP Our Motivation There are long time intervals where snooping fails, wasting energy and bandwidth. Our Solution Detect such intervals and avoid snoops Key Results Memory Energy 18% Snoop Traffic 93% Performance 3.8%

Conventional Snooping CPU CPU 4 D$ D$ 5 1 Redundant (miss): ~70% 2 Interconnect 3 controller 6 5 5 D$ D$ 4 4 CPU CPU

WB vs. WT • Relative memory energy consumption

Previous Work: Snoop Filters Eliminate redundant snoop (local & global) requests. Local: one core fails to provide data Global: all cores fail. Examples: RegionScout: Detects Memory Regions Not Shared (Moshovos) Selective Snoop Request: Predicts Supplier (Atoofian & Baniasadi) Serial Snooping: Requests Nodes One by One (Saldanha & Lipasti) Good snoop filter • Fast & simple • Accurate and effective

Our Work Time-based Snoop Filtering Motivation: There are long intervals where snooping fails consecutively But how long & how often?

Our Work (Cont.)

Our Work (Cont.) Global Read Miss (GRM): Occurs whenever the last snoopbyall processors fail Local Read Miss (LRM): Redundant snoop occurringbya singleprocessor fails

Distribution (a) LRM distribution for different processors (b) GRM distribution Periods of Data Scarcity are usually long

Time-based Global Miss predictor (TGM) TGM Goals: Detect GRM intervals Shutting down snooping in all processors but one (surviving node). • TGM Types: • TGM-First: First processor that has failed snooping survives. • TGM-Last: Last processor that has failed snooping survives.

TGM implementation • TGM-enhanced CMP

TGM • (a) Coverage (b) Accuracy

Time-based Local Miss predictor (TLM) • Goal: Detect LRMs • How? • Count consecutive snoop misses in a node • Disable snoop when exceeds a threshold • Restart snooping after a number of cycles

TLM implementation • TGM-enhanced CMP Processing Unit (PU) First Level Cache Each Processor Redundant SNoop (RSN) Counter Predictor ReStarT (RST) Counter

TLM features • (a)Coverage (b) Accuracy

Methodology • Our Simulator: SESC • Benchmarks: Splash-2 • To evaluate energy: Cacti 6.5 • System used:Quad-Core CMP • System Parameters SPLASH-2 Benchmarks and INPUT parameters

Relative Snoop Traffic Reduction • TGM-F: 58% • TGM-L: 57% • TLM: 77%

Relative Memory Energy • TGM-F: 8% • TGM-L: 8.5% • TLM: 11%

Relative Memory Delay • TGM-F: 1.1% • TGM-L: 2.1% • TLM: 1.7%

Relative Performance • TGM-F: No Change • TGM-L: 0.4% • TLM: 0.3%

Summary • We showed: • Long data scarcity period (DSP) exist during workload runtime • During DSPs redundant snoops happen frequently and consecutively • Our solutions • TGM: • uses snoop behavior on all processors to detect and filter redundant snoops • Shutdown snoop on as much processor as possible • TLM: • Redundant snoops are filtered in a single node • Counts recent redundant snoops to detect data scarcity periods and filter upcoming redundant snoops • Simulation Results: • Snoop Reduction: TGM-F: 58% TGM-L: 57% TLM: 77% • Memory Energy: TGM-F: 8% TGM-L: 8.5% TLM: 11% • Memory Delay: TGM-F: 1.1% TGM-L: 2.1% TLM: 1.7% • Performance: TGM-F: no change TGM-L: 0.4% TLM: 0.3%

Thanks for your attention

Backup Slides

Discussion • How Characteristics of the benchmarks affect memory energy/delay reduced by our solution? 1. True detection of redundant snoops 2. Share of Redundant Snoops

Memory Energy.Delay Memory Energy = Energy consumed to provide the requested data Memory Delay = time required to provide the requested data

Volrend Benchmark • Volrend while running rarely send snoop requests • This application renders a three-dimensional volume. It renders several frames from changing viewpoints consecutive frames in rotation sequences often vary slightly in viewpoint High Temporal Locality Volrend does Load Distribution very well High Spatial Locality

ImanFaraji

ImanFaraji

Presentation Transcript