Design Exploration of an Instruction-Based Shared Markov Table on CMPs

Design Exploration of an Instruction-Based Shared Markov Table on CMPs Design Exploration of an Instruction-Based Shared Markov Table on CMPs Karthik Ramachandran & Lixin Su

Outline • Motivation • Multiple cores on single chip • Commercial workloads • Our study • Start from Instruction sharing pattern analysis • Our experiments • Move onto Instruction cache miss pattern analysis • Our experiments • Conclusions

Motivation • Technology push: CMPs • Lower access latency to other processors • Application pull: Commercial workloads • OS behavior • Database applications • Opportunities for shared structures • Markov based sharing structure • Address large instruction footprint VS. small fast I caches

Instruction Sharing Analysis • How instruction sharing may occur ? • OS: multiple processes, scheduling • DB: concurrent transactions, repeated queries, multiple threads • How can CMP’s benefit from instruction sharing ? • Snoop/grab instruction from other cores • Shared structures • Let’s investigate it.

Methodology • Two-step approach • Experiment I • Targets Instruction trace analysis • How much sharing occurs ? • Experiment II • Targets I cache miss stream analysis • Examine the potential of a shared Markov structure

Experiment I • Add instrumentation code to analyze committed instructions • Focus on repeated sequences of 2, 3, 4, and 5 instructions across 16P • Histogram-based approach How do we Count ? P1 : 3 times P2 : 1 time P3 : 0 times P4 : 2 times Total : 10 times P1 P2 P3 P4 {A,B} {A,B} {A,B} {A,B} {A,B} {A,B} {A,B} {A,B} {A,B} {A,B}

Results - Experiment I Q.) Is there any Instruction sharing ? A.) Maybe, observe the number of times the sequences 2-5 repeat (~13000 -17000) Q.) But why does the numbers for a sequence pattern of 5 Instructions not differ much from a sequence pattern of 2 Instructions ? A.) Spin Loops!! For non warm-up case : 50% For warm-up case : 30%

Experiment II • Focus on instruction cache misses • Is there sharing involved here too? • Upper bound performance benefit of a shared Markov table? • Experiment setup • 16K-entry fully associative shared Markov table • Each entry has two consecutive misses from same processor • Atomic lookup and hit/miss counter update when a processor has two consecutive I $ misses. • On a miss, Insert a new entry to LRU head • On a hit, Record distance from the LRU head and move the hit entry to LRU head

Design Block Diagram • Small fast shared Markov table • Prefetch when I$ miss occurs P P I$ I$ Markov Table L2 $

Table Lookup Hit Ratio Q1.) Is there a lot of miss sharing? Q2.) Does constructive interference pattern exist to help a CMP? Q3.) Do equal opportunities exist for all the P?

Let’s Answer the Questions? A1.) Yes Of course A2.) Definitely a constructive interference pattern exists as you see from the figure A3.) Yes. Hit/miss ratio remains pretty stable across processor in spite of variance in the number of I cache misses.

How Big Should the Table Be ? • About 60% of hits are within 4K entries away from LRU head. • A shared Markov table can fairly utilize I cache miss sharing. • What about snooping and grabbing instructions from other I caches?

Real Design Issues • Associativity and size of the table • Choose the right path if multiple paths exist • Separate address directory from data entries for the table and have multiple address directories • What if a sequential prefetcher exists?

Conclusions • Instruction sharing on CMPs exists. Spin loops occur frequently with current workloads. • Markov-based structure for storing I cache misses may be helpful on CMPs.

Questions?

Comparison with Real Markov Prefetching Cnt 5 LRU head 2 3 LRU Tail P • Miss to A and prefetch along A, B & C P • Misses to A & C and then look up in the table • Update hit/miss counters and change/record LRU

Lookup Example I P LRU head LRU head Look up LRU head LRU Tail

Lookup Example II P LRU head LRU head Look up LRU head LRU Tail

Design Exploration of an Instruction-Based Shared Markov Table on CMPs

Design Exploration of an Instruction-Based Shared Markov Table on CMPs

Presentation Transcript

Model-Based Design Exploration of Wireless Sensor Node Lifetimes

Automated Model Compiler based on Design Space Exploration Tool

Backward Design and Standards Based Instruction

An Era of Exploration

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs

Design Exploration

An Open Workshop on Decision-Based Design

An Open Workshop on Decision-Based Design

Differentiated Instruction Based on Data

Based on previous your instruction...

Exploring the Design Space of Future CMPs

An Era of Exploration

Estimation Of Distribution Algorithm based on Markov Random Fields

Design Exploration of an Electronic Honor System

An exploration of Coaching

An Age of Exploration

Design Exploration of an Electronic Honor System

An Adaptive Shared/Private NUCA Cache Partitioning Scheme for CMPs

An exploration of Modernism