210 likes | 329 Views
Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors. Ali Shafiee Narges Shahidi Amirali Baniasadi Sharif University of Technology University of Victoria. This Work: Improving Snoop Coherency. Goal: Improving energy efficiency in snoop-based CMPs.
E N D
Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali Shafiee Narges Shahidi Amirali Baniasadi Sharif University of Technology University of Victoria
This Work: Improving Snoop Coherency Goal: Improving energy efficiency in snoop-based CMPs. Motivation: Broadcasting/processing entire tag is inefficient. Our Solution: Using Partial Tag Comparison (PTC) prior to snoop. Key Results Performance (2.9%) Tag array power (52%) Bandwidth utilization (78.5%)
Our Solution (PTC) vs. Conventional Conventional Our solution D$ D$ D$ …. D$ D$ …. D$ Interconnect Interconnect Upper Level Cache Upper Level Cache Fast ++ (early miss detection) Power & Bandwidth Efficient + Fast + Power & Bandwidth −
Conventional Snooping CPU CPU 3 D$ D$ 4 1 Redundant (miss): ~70% Address Bus 2 Snoop Bus controller Command Bus 5 4 4 D$ D$ 3 3 CPU CPU
Snoop Filters Goal: Eliminate redundant snoop requests. Example: RegionScout (ISCA’05), CGCT(ISCA’05), SSP (ASPLOS’08) PTC: (1) Early miss detection using subset of tag bits. (2) Once a miss is detected, snoop is avoided. How often is that possible?
How often using n bits is enough to detect a miss? 95+% of misses can be detected using 8 bits.
PTC-Filter D$ PTC-Filter LSB LSB LSB hit miss Avoid Snoop Access Upper Level Snoop Potential Targets Address Bus
PTC-Filter 1 2 0 3 4-way D$ 4-way D$ 4-way D$ 4-way D$ PTC-Filter Filter Filter Filter … LSB D V 8 bits Core1’s LSB Core2’s LSB Core3’s LSB
PTC: Filter Miss CPU CPU D$ D$ 1 2 Address Bus 3 Snoop Bus controller Command Bus D$ D$ CPU CPU
PTC: Filter Hit CPU CPU D$ 4 D$ ✓ ✗ 1 ✗ ✓ 5 2 Address Bus 3 Snoop Bus controller Command Bus 6 ✗ ✗ D$ D$ CPU CPU
Filter Maintenance Core 0 Core i CPU Snoop Controller Request =A 1 Pending Request Table ….. ….. 6 PTC- Filter 2 4 A 0 1 1 miss A. place it in position of tag F 6 5 Place A, insert in Way 1 of core 0 Command Bus 3 Address Bus {Address=A, C=0,W=1, D=1}
Methodology • SESC simulator 4-way CMP • SPLASH-2 benchmarks • CACTI 6.0
Performance Average: 2.9%
Bandwidth Average: 78.5%
Tag Power Average: 52%
Discussion • Why do benchmarks show different performance improvement? • Different cache miss frequency • Different early miss detection frequency • Not all cache misses are on the critical path • Filter overhead: • Timing: 1 cycle • Power: 78.5% of single tag array access
Summary • PTC: • Using subset of tag bits to improve bandwidth/power efficiency. • Results: • Performance: 2.9% • Tag Power: 52% • Bandwidth: 78.5%
Global vs. Local Miss Have B? Have B? NO NO NO YES NO • local miss detection better power/bandwidth profile • Remote miss detection (source-based approach) vs. (destination-based filter) D$ D$ D$ D$ D$ D$ D$ …. …. Interconnect interconnect Upper Level Cache Upper Level Cache Global Miss Local Miss