190 likes | 326 Views
Design Exploration of an Instruction-Based Shared Markov Table on CMPs. Design Exploration of an Instruction-Based Shared Markov Table on CMPs. Karthik Ramachandran & Lixin Su. Outline. Motivation Multiple cores on single chip Commercial workloads Our study
E N D
Design Exploration of an Instruction-Based Shared Markov Table on CMPs Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su
Outline • Motivation • Multiple cores on single chip • Commercial workloads • Our study • Start from Instruction sharing pattern analysis • Our experiments • Move onto Instruction cache miss pattern analysis • Our experiments • Conclusions
Motivation • Technology push: CMPs • Lower access latency to other processors • Application pull: Commercial workloads • OS behavior • Database applications • Opportunities for shared structures • Markov based sharing structure • Address large instruction footprint VS. small fast I caches
Instruction Sharing Analysis • How instruction sharing may occur ? • OS: multiple processes, scheduling • DB: concurrent transactions, repeated queries, multiple threads • How can CMP’s benefit from instruction sharing ? • Snoop/grab instruction from other cores • Shared structures • Let’s investigate it.
Methodology • Two-step approach • Experiment I • Targets Instruction trace analysis • How much sharing occurs ? • Experiment II • Targets I cache miss stream analysis • Examine the potential of a shared Markov structure
Experiment I • Add instrumentation code to analyze committed instructions • Focus on repeated sequences of 2, 3, 4, and 5 instructions across 16P • Histogram-based approach How do we Count ? P1 : 3 times P2 : 1 time P3 : 0 times P4 : 2 times Total : 10 times P1 P2 P3 P4 {A,B} {A,B} {A,B} {A,B} {A,B} {A,B} {A,B} {A,B} {A,B} {A,B}
Results - Experiment I Q.) Is there any Instruction sharing ? A.) Maybe, observe the number of times the sequences 2-5 repeat (~13000 -17000) Q.) But why does the numbers for a sequence pattern of 5 Instructions not differ much from a sequence pattern of 2 Instructions ? A.) Spin Loops!! For non warm-up case : 50% For warm-up case : 30%
Experiment II • Focus on instruction cache misses • Is there sharing involved here too? • Upper bound performance benefit of a shared Markov table? • Experiment setup • 16K-entry fully associative shared Markov table • Each entry has two consecutive misses from same processor • Atomic lookup and hit/miss counter update when a processor has two consecutive I $ misses. • On a miss, Insert a new entry to LRU head • On a hit, Record distance from the LRU head and move the hit entry to LRU head
Design Block Diagram • Small fast shared Markov table • Prefetch when I$ miss occurs P P I$ I$ Markov Table L2 $
Table Lookup Hit Ratio Q1.) Is there a lot of miss sharing? Q2.) Does constructive interference pattern exist to help a CMP? Q3.) Do equal opportunities exist for all the P?
Let’s Answer the Questions? A1.) Yes Of course A2.) Definitely a constructive interference pattern exists as you see from the figure A3.) Yes. Hit/miss ratio remains pretty stable across processor in spite of variance in the number of I cache misses.
How Big Should the Table Be ? • About 60% of hits are within 4K entries away from LRU head. • A shared Markov table can fairly utilize I cache miss sharing. • What about snooping and grabbing instructions from other I caches?
Real Design Issues • Associativity and size of the table • Choose the right path if multiple paths exist • Separate address directory from data entries for the table and have multiple address directories • What if a sequential prefetcher exists?
Conclusions • Instruction sharing on CMPs exists. Spin loops occur frequently with current workloads. • Markov-based structure for storing I cache misses may be helpful on CMPs.
Comparison with Real Markov Prefetching Cnt 5 LRU head 2 3 LRU Tail P • Miss to A and prefetch along A, B & C P • Misses to A & C and then look up in the table • Update hit/miss counters and change/record LRU
Lookup Example I P LRU head LRU head Look up LRU head LRU Tail
Lookup Example II P LRU head LRU head Look up LRU head LRU Tail