430 likes | 593 Views
Power Efficient Cache Coherence. Craig Saldanha Mikko Lipasti. Motivation. Power consumption becoming a serious design constraint. Market demand for faster and more complex servers. Complex coherence protocol and interconnect. f clock + Complexity = P interconnect. Motivation.
E N D
Power Efficient Cache Coherence Craig Saldanha Mikko Lipasti
Motivation • Power consumption becoming a serious design constraint. • Market demand for faster and more complex servers. • Complex coherence protocol and interconnect. • fclock + Complexity = Pinterconnect
Motivation • Traditional power saving methodologies ineffective. • Minimize number of transaction packets. • At the end-points: Jetty. • At the source: Serial Snoop.
OUTLINE • Overview of Snoop based protocols and opportunity for power savings • Latency and Power consumption of parallel snooping techniques • Serial Snooping • Results • Conclusions • Future Work
Local Node Tag Lookup Bus Arbitration Snoop Transmission Remote Node Tag Lookup Snoop Response Combination of Responses Data Fetch Data Transmit Snoop Based Coherence P1 P2 P3 P4 Tag Array Memory Data Array
3 Degrees of Freedom to Speculate SnoopingData FetchData Transmit Degrees of Speculation Parallel Snoop,Spec Dfetch Spec DXmit D-Xmit Parallel Snoop,Spec Dfetch Non-Spec DXmit DFetch X D-Xmit Parallel Snoop,Non-Spec Dfetch Non-Spec DXmit Snoop Serial Snoop,Spec Dfetch Spec DXmit D-Xmit Serial Snoop,Spec Dfetch Non-Spec DXmit DFetch X D-Xmit Serial Snoop,Non-Spec Dfetch Non-Spec DXmit
Latency and Power assumptions • Consider only load misses • Tree of point-point connections. • Latency to traverse a link: 1Bus cycle (7ns) • Tag Look up :1 bus cycle (7ns) • D-Fetch: 2 bus cycles (14ns) • DRAM access: 10 bus cycles (70ns) Backplane MEMORY Root Node Switch2 Switch1 P1 P2 P3 P4 Board1
0 7 Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast Latency MEMORY Power Plink
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast Latency MEMORY 0 7 14 Power Plink+Pswitch
0 7 14 Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast Latency MEMORY 21 Power 2Plink+Pswitch
0 7 14 Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast Latency MEMORY 21 28 Power 2Plink+2Pswitch
0 7 14 Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast MEMORY Latency 21 28 35 Power 5Plink+2Pswitch
0 7 14 Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast Latency MEMORY 21 28 35 42 Power 5Plink+4Pswitch
0 7 14 Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Snoop Broadcast Latency MEMORY 21 28 35 42 49 Power 8Plink+4Pswitch
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Memory Access :Data Fetch MEMORY Latency 35 91 105 Power Pmem
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Memory Access :Data Transmit MEMORY Latency 35 91 105 140 Power Pmem+3Plink+2Pswitch
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Remote Node: Tag Lookup MEMORY Latency 49 56 Power 3Ptag
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Remote Node :Snoop Response MEMORY Latency 49 56 105 Power 3Ptag+3*(Pswitch+2Plink)+Plink+2Pswitch+2Plink
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Remote Node: Data Fetch MEMORY Latency 49 63 Power 3Pcache
Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Remote Node :Data Transmit MEMORY Latency 49 63 112 Power 3Ptag+3*(4Plink+3Pswitch)
49 0 TL TL Snoop BRDCST Local Node 49 49 56 56 105 105 TL RSP + CMB RSP + CMB Remote Node 49 49 63 63 112 DF DF Remote Node Data Xmit 35 35 91 91 105 105 140 140 Memory Access Memory Access Data Xmit Data Xmit Memory Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Latency Power Remote Node supplies the data 29Plink+18Pswitch+3Ptag+3Pcache+Pmem Memory supplies the data 20Plink+11Pswitch+3Ptag+3Pcache+Pmem
Parallel Snoop Speculative Fetch Non-Speculative Transmit (PS/SF/NT) Latency 0 49 Local Node TL Snoop BRDCST RSP 49 56 105 TL Remote Node + CMB Remote Node 49 63 105 154 DF Data Xmit 35 91 105 140 Memory Access Data Xmit Memory Power Remote Node supplies the data 21Plink+12Pswitch+3Ptag+Pcache+Pmem Memory supplies the data 20Plink+11Pswitch+3Ptag+3Pcache+Pmem
0 49 Local Node TL Snoop BRDCST 49 RSP+ 105 TL Remote Node CMB 105 119 168 DF Data Xmit Remote Node 91 161 196 Memory Access Data Xmit Memory Parallel Snoop Non-Speculative Fetch Non-Speculative Transmit (PS/NF/NT) Latency Power Remote Node supplies the data 21Plink+12Pswitch+3Ptag+Pcache Memory supplies the data 20Plink+11Pswitch+3Ptag+Pmem
Avoids Speculative transmission of Snoop packets. Serial Snooping MEMORY
Avoids Speculative transmission of Snoop packets. Check the nearest neighbor Data supplied with minimum latency and power Serial Snooping MEMORY
Forward snoop to next level Serial Snooping MEMORY
Forward snoop to next level Serial Snooping MEMORY
Search other half of tree Serial Snooping MEMORY
Search other half of tree Search leaf nodes serially Serial Snooping MEMORY
Search other half of tree Search leaf nodes serially Serial Snooping MEMORY
Latency to satisfy a request dependent on distance from requestor. Data resident at the nearest neighbor supplied with the lowest latency and power. Requests visible to memory controller only at root node. Latency is adversely affected when requested data present at the farthest node Worst case power consumption is still less than the parallel snooping. Serial Snooping : Features
Serial Snooping : Request satisfied by Nearest Node Latency MEMORY 0 21 SNP TL P1 28 21 49 TL RSP P2 21 35 56 DF Data Xmit P2 Power Xmit Snoop: 2Plink + Pswitch P2 Tag access and snoop response: Ptag + 2Plink + Pswitch P2 Data Fetch and Xmit: Pcache +2Plink + Pswitch Ptotal= 6Plink+3Pswitch+Ptag+Pcache
0 21 TL SNP P1 21 28 49 63 77 P2 TL RSP 84 77 133 TL Xmit P3 77 91 140 DF Xmit P3 63 133 168 Memory Access Xmit Memory Serial Snooping : Request satisfied by Next-Nearest Neighbor Latency MEMORY Power 16Plink+10Pswitch+2Ptag+Pcache
Serial Snooping : Request satisfied by farthest node Latency MEMORY 0 21 TL Xmit P1 21 28 49 63 77 TL P2 RSP 84 77 10 Xm TL P3 P4 105 112 161 TL RSP P4 105 119 147 168 DF Xmit 168 Memory Access 133 Xmit Spec-Memory 133 147 182 Memory Access Xmit 147 217 252 Memory Access Xmit Non-Spec-Memory Power Remote Node supplies the data 18Plink+11Pswitch+3Ptag+Pcache If Memory supplies the data 17Plink+10Pswitch+3Ptag+Pmem
CONCLUSIONS • Reducing degree of speculation has potential for significant power savings • Performance degradation is minimal for the set of benchmarks studied. • Serial Snooping with speculative memory fetch provides optimal latency and power consumption.
Future Work • Develop detailed execution-driven Power Model • Explore different interconnect topologies. • Examine the viability of adaptive mechanisms for protocol policy.
0 7 14 Serial Snooping Nearest Neighbor MEMORY Latency 21 Power 2Plink+Pswitch
49 0 TL TL Snoop BRDCST Parallel Snoop Speculative Fetch Speculative Transmit (PS/SF/ST) Local Node MEMORY 35 35 91 Memory Access Memory Access