Analyzing the Impact of Data Prefetching on Chip MultiProcessors

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara,Koji Inoue, Kazuaki Murakami Kyushu University, Japan ACSAC 13 August 4, 2008

Back Ground • CMP (Chip MultiProcessor): • Several processor cores integrated in a chip • High performance by parallel processing • New feature: Cache-to-cache data transfer • Limitation factor of CMP performance • Memory-wall problem is more critical • High frequency of off-chip accesses • Not scaling bandwidth with the number of cores CMP Core Core L1 $ L1 $ L2$ chip Data prefetching is more important in CMPs

Motivation & Goal • Motivation • Conventional prefetch techniques have been developed for uniprocessors • Not clear that these prefetch techniques achieve high performance in even in CMPs • Is it necessary for the prefetch techniques to consider CMP features ? • Need to know the effect of prefetch on CMPs • Goal • Analysis of the prefetch effect on CMPs

Outline • Introduction • Prefetch Taxonomyfor multiprocessors • Extension for CMPs • Quantitative Analysis • Conclusions

Classification of Prefetches According to Impact on Memory Performance • Focusing on each prefetch • Definition of the prefetch states • Initial state: the state just after a block is prefetched into cache memory • Final State: the state when the block is evicted from cache memory • The state transits based on Eventsduring the life time of the prefetched block in cache memory

Definition of Events Local core Remote core prefetch A Load A Event1. The prefetched block is accessedby the local core Event2. The local core accesses the block which has evicted from the cache by the prefetch Event3. The prefetch causes a downgrade followed by a subsequent upgrade in a remote core Core Core A hiding off-chip access latency L1 $ L1 $ A L2$ A Main Memory

Definition of Events Local core Remote core Load B prefetch A Event1. The prefetched block is accessedby the local core Event2. The local core accesses the block which has evicted from the cache by the prefetch Event3. The prefetch causes a downgrade followed by a subsequent upgrade in a remote core Core Core B A L1 $ L1 $ Cache miss!! blockB Is evicted A L2$ A Main Memory

Definition of Events Local core Remote core prefetch A Store A Event1. The prefetched block is accessedby the local core Event2. The local core accesses the block which has evicted from the cache by the prefetch Event3. The prefetch causes a downgrade followed by a subsequent upgrade in a remote core Core Core A A L1 $ L1 $ Invalidate Request L2$ Main Memory

The State Transition of Prefetch in Local Core # of memory accesses is increased in local core(initial state) Local core Remote core Useless Load A prefetch A # of local L1 cache misses is decreased Load B Event1. Core Core B A Useful L1 $ L1 $ blockB Is evicted Event2 cache miss!! A Useless/Conflict L2$ A Event1 # of local L1 cache misses and # of accesses are increased in local core Useful/Conflict # of memory accesses is Increased in local core Main Memory

The State Transition of Prefetch in Local and Remote Cores* Useless Event1 Useful Event2 Useless/Conflict Event1 Useful/Conflict * Jerger, N., Hill, E., and Lipasti, M., ``Friendly Fire: Understanding the Effects of Multiprocessor Prefetching‘’ In Proceedings of the IEEE InternationalSymposium on Performance Analysis of Systems and Software (ISPASS), 2006.

The State Transition of Prefetch in Local and Remote Cores* # of invalidation requests and # of memory accesses are increased Useless Local core Remote core Load B prefetch A Store A Event1 Core Event3 Core Useful Harmful invalidated B A A L1 $ L1 $ Invalidate Request Event2 cache miss!! Useless/Conflict Event2 L2$ Event1 Event3 Harmful/Conflict Useful/Conflict # of invalidation requests, # of memory accesses and #of cache misses are increased * Jerger, N., Hill, E., and Lipasti, M., ``Friendly Fire: Understanding the Effects of Multiprocessor Prefetching‘’ In Proceedings of the IEEE InternationalSymposium on Performance Analysis of Systems and Software (ISPASS), 2006. Main Memory

Considering Cache-to-Cache Data Transfer • Event4. The prefetched block loaded from L2 or main memory is accessed by a remote core Core Core L1 $ L1 $ Local core Remote core prefetch A L2$ Load A A A hiding off chip access latency A A Main Memory

Useless Harmful Useful Useless/Conflict Useful/Conflict Harmful/Conflict The State Transition in CMPs Core Core Event1. Event3. Local core Remote core L1 $ L1 $ Event2. B L2$ Event2. Event3. Event1. A Main Memory

Useless Harmful Useful Useless/Remote Useless/Conflict Useful/Conflict Harmful/Conflict Useless/Conflict /Remote The State Transition in CMPs Core Core Event4 Load A Local core Remote core L1 $ L1 $ LoadA Load B prefetch A Event2. Event1 B A A # of L2 access is decreased in remote core L2$ cache miss Event2 A Event4 A Event1 # of L2 accesses is decreasedin remote core, # of cache misses is increased in local core Main Memory

Classification of Prefetches in CMPs one memory access in local core, and invalidate request in remote core are increased one memory access is increased in local core one cache missis decreased in local core one L2 access decreased in remote core Useless Better case Best case Useful Harmful Useless/Remote Useless/Conflict one cache miss is increasedin local core Useful/Conflict Harmful/Conflict Worse case Useless/Conflict /Remote Worst case

Outline • Introduction • Prefetch Taxonomy • for Multiprocessors • for CMP • Quantitative Analysis • Conclusions

Simulation Environment • Simulator • M5: CMPsimulator • Prefetch mechanism attached on L1 cache • Stride prefetch and tagged prefetch • MOESIcoherence protocol • Benchmark programs • SPLASH-2:Scientific computation programs Core Core Core Core D D D I I I I D L2$ 4MB8way 64KB 2-way Main memory

Can Conventional Prefetch Techniques Exploit Cache-to-Cache data transfer ? Useless/Conflict/Remote Useless/Remote 1 2 FMM LU Radix Water 1. stride prefetch 2.taggedprefetch • The percentage of Useless/Remote and Useless/Conflict/Remote prefetches is only 5% •  Conventional prefetch techniques do not exploit cache-to-cache data transfer effectively

Are the Prefetched-Block Invalidations Serious Problem for CMPs? 1 2 FMM LU Radix Water 1. stride prefetch 2.taggedprefetch • Prefetches of Harmful and Harmful/Conflict are extremely few (average 0.2%) •  Invalidations of prefetched blocks are negligible

Multiprocessor vs. Chip Multiprocessor • Harmful and Harmful/Conflict prefetches • 0.01~0.70% in CMP (tagged prefetch)  Small negative impact • 2~18% in MP* (sequential prefetch)  Large negative impact • Why does this difference occur ? *Jerger, N., Hill, E., and Lipasti, M., Friendly Fire: Understanding the Effects of Multiprocessor Prefetching. In Proceedings of the IEEE InternationalSymposium on Performance Analysis of Systems and Software (ISPASS), 2006.

The Reason of Difference of Invalidation Rate CMP • Difference of the life time of prefetched blocks in cache • Long life time (large cache size)  High possibility of invalidation • Short life time (small cache size)  Low possibility of invalidation • If the cache size is large, the negative impact is large( like MPs) • If the cache size is small, the negative impact is small (like CMPs) Core core Core Core L1 $ L1 $ L1$ L1$ L2$ L2$ Multiprocessor L2$ load prefetched block and keep coherence

The Invalidation Rate of Prefetched Blockswith Varying L1 Cache Size (taggedprefetch) L1 cache size invalidated rate Larger cache  large negative impact ( like MPs) Smaller cache  small negative impact (like CMPs)

Summary • Contributions • New method to analyze prefetch effects on CMPs • Quantitative analysis for two types of prefetches • Observations • Conventional prefetch techniques DO NOT exploit cache-to-cache data transfer effectively • Harmful prefetches are NOT harmful in CMPs • Future work • Propose novel prefetch technique exploiting the features of CMPs

Thank you Any Questions ?~Please speak slowly~

Average Memory Access Time (AMAT) Main memory Memory bus Shared bus L2 $ Remote L1 $ L1 $

Harmful and Harmful/Conflict Prefetches varying # of cores

MultiProcessor Traffic and Miss Taxonomy (MPTMT [Jerger’06]) • MultiProcessor Traffic and Miss Taxonomy (MPTMT) • This is an extended version of Uniproccessor taxonomy (Srinivasan et al.) • Prefetches are classified according to effects on memory performance • To count the classified prefetches, we can measure the prefetch effects precisely

Analyzing the Impact of Data Prefetching on Chip MultiProcessors