260 likes | 373 Views
Time-based Snoop Filtering in Chip Multiprocessors. Amirali Baniasadi. ImanFaraji. University of Victoria Victoria, Canada. Amirkabir University of Technology Tehran, Iran. This work: Reducing redundant snoops in chip multiprocessors. Our Goal
E N D
Time-based Snoop Filtering in Chip Multiprocessors Amirali Baniasadi ImanFaraji University of Victoria Victoria, Canada Amirkabir University of Technology Tehran, Iran
This work: Reducing redundant snoops in chip multiprocessors Our Goal Improving energy efficiency of WT-based CMP Our Motivation There are long time intervals where snooping fails, wasting energy and bandwidth. Our Solution Detect such intervals and avoid snoops Key Results Memory Energy 18% Snoop Traffic 93% Performance 3.8%
Conventional Snooping CPU CPU 4 D$ D$ 5 1 Redundant (miss): ~70% 2 Interconnect 3 controller 6 5 5 D$ D$ 4 4 CPU CPU
WB vs. WT • Relative memory energy consumption
Previous Work: Snoop Filters Eliminate redundant snoop (local & global) requests. Local: one core fails to provide data Global: all cores fail. Examples: RegionScout: Detects Memory Regions Not Shared (Moshovos) Selective Snoop Request: Predicts Supplier (Atoofian & Baniasadi) Serial Snooping: Requests Nodes One by One (Saldanha & Lipasti) Good snoop filter • Fast & simple • Accurate and effective
Our Work Time-based Snoop Filtering Motivation: There are long intervals where snooping fails consecutively But how long & how often?
Our Work (Cont.) Global Read Miss (GRM): Occurs whenever the last snoopbyall processors fail Local Read Miss (LRM): Redundant snoop occurringbya singleprocessor fails
Distribution (a) LRM distribution for different processors (b) GRM distribution Periods of Data Scarcity are usually long
Time-based Global Miss predictor (TGM) TGM Goals: Detect GRM intervals Shutting down snooping in all processors but one (surviving node). • TGM Types: • TGM-First: First processor that has failed snooping survives. • TGM-Last: Last processor that has failed snooping survives.
TGM implementation • TGM-enhanced CMP
TGM • (a) Coverage (b) Accuracy
Time-based Local Miss predictor (TLM) • Goal: Detect LRMs • How? • Count consecutive snoop misses in a node • Disable snoop when exceeds a threshold • Restart snooping after a number of cycles
TLM implementation • TGM-enhanced CMP Processing Unit (PU) First Level Cache Each Processor Redundant SNoop (RSN) Counter Predictor ReStarT (RST) Counter
TLM features • (a)Coverage (b) Accuracy
Methodology • Our Simulator: SESC • Benchmarks: Splash-2 • To evaluate energy: Cacti 6.5 • System used:Quad-Core CMP • System Parameters SPLASH-2 Benchmarks and INPUT parameters
Relative Snoop Traffic Reduction • TGM-F: 58% • TGM-L: 57% • TLM: 77%
Relative Memory Energy • TGM-F: 8% • TGM-L: 8.5% • TLM: 11%
Relative Memory Delay • TGM-F: 1.1% • TGM-L: 2.1% • TLM: 1.7%
Relative Performance • TGM-F: No Change • TGM-L: 0.4% • TLM: 0.3%
Summary • We showed: • Long data scarcity period (DSP) exist during workload runtime • During DSPs redundant snoops happen frequently and consecutively • Our solutions • TGM: • uses snoop behavior on all processors to detect and filter redundant snoops • Shutdown snoop on as much processor as possible • TLM: • Redundant snoops are filtered in a single node • Counts recent redundant snoops to detect data scarcity periods and filter upcoming redundant snoops • Simulation Results: • Snoop Reduction: TGM-F: 58% TGM-L: 57% TLM: 77% • Memory Energy: TGM-F: 8% TGM-L: 8.5% TLM: 11% • Memory Delay: TGM-F: 1.1% TGM-L: 2.1% TLM: 1.7% • Performance: TGM-F: no change TGM-L: 0.4% TLM: 0.3%
Discussion • How Characteristics of the benchmarks affect memory energy/delay reduced by our solution? 1. True detection of redundant snoops 2. Share of Redundant Snoops
Memory Energy.Delay Memory Energy = Energy consumed to provide the requested data Memory Delay = time required to provide the requested data
Volrend Benchmark • Volrend while running rarely send snoop requests • This application renders a three-dimensional volume. It renders several frames from changing viewpoints consecutive frames in rotation sequences often vary slightly in viewpoint High Temporal Locality Volrend does Load Distribution very well High Spatial Locality