Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis

Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields1 Rastislav Bodik1 Mark Hill2 Chris Newburn3 1UC-Berkeley, 2UW-Madison, 3Intel

Outline Interaction Cost Bottleneck analysis complicated by parallelism Parallelism causes interactions • Qualitative: parallel and serial interactions • Quantitative: interaction cost (icost) Icost case study: designing a deep pipeline Hardware profiler • Icost “shotgun” profiler • Replace current performance counters

Bottleneck analysis is hard Why? -architectural parallelism complicates performance understanding • Two parallel cache misses • A multiply and window stall • A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating-point multiplies are executing

What we want from bottleneck analysis • Performance cost (or reward) • speedup when the bottleneck is removed Q: What if two bottlenecks interact?

Our solution: measure interactions Two parallel cache misses (Each 100 cycles) miss #1 (100) miss #2 (100) Cost(miss #1) = 0 Cost(miss #2) = 0 Cost({miss #1, miss #2}) = 100 Aggregate cost > Sum of individual costs Parallel interaction 100 0 + 0 icost = aggregate cost – sum of individual costs = 100 – 0 – 0 = 100

miss #1 • Positive icost  parallel interaction miss #2 Interaction cost (icost) icost = aggregate cost – sum of individual costs • Zero icost ?

Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 • Positive icost  parallel interaction miss #2 . . . • Zero icost  independent miss #2 miss #1 • Negative icost ?

Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = ?

Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = 90 Cost(miss #2) = 90 Cost({miss #1, miss #2}) = 90 icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90 Negative icost  serial interaction

Branch mispredict Load-Replay Trap Interaction cost (icost) icost = aggregate cost – sum of individual costs Fetch BW miss #1 • Positive icost  parallel interaction LSQ stall miss #2 . . . • Zero icost  independent miss #2 miss #1 miss #1 miss #2 • Negative icost serial interaction ALU latency

Reason #1 We are over-optimizing! Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us) Reason #2 We have a choice of what to optimize Prefetching miss #2 has the same effect as miss #1 Why care about serial interactions? miss #1 (100) miss #2 (100) ALU latency (110 cycles)

Assume 4-cycle DL1 access; how to mitigate? Increase cache ports? Increase window size? Increase fetch BW? Reduce cache misses? Icost Case Study: Deep pipelines Deep pipelines cause long latency loops: • level-one (DL1) cache access, issue-wakeup, branch misprediction, … But can often mitigate them indirectly Really, looking for serial interactions!

Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge

Icost Breakdown (6 wide, 64-entry window)

Vortex Breakdowns, enlarging the window

Outline Interaction Cost • Bottleneck analysis complicated by parallelism • Parallelism causes interactions • Qualitative: parallel and serial interactions • Quantitative: interaction cost (icost) • Icost case study: designing a deep pipeline • Exploiting serial interactions Hardware profiler • Icost “shotgun” profiler • Overcome the limitations of performance counters

Profiling goal Goal: • Construct graph many dynamic instructions Constraint: • Can only sample sparsely

Genome sequencing Profiling goal Goal: • Construct graph DNA strand DNA Constraint: • Can only sample sparsely

“Shotgun” genome sequencing DNA

“Shotgun” genome sequencing DNA . . . . . .

“Shotgun” genome sequencing DNA . . . . . . Find overlaps among samples . . . . . .

Icache miss Dcache miss Branch misp. No event Mapping “shotgun” to our situation many dynamic instructions

. . . . . . Profiler hardware requirements

Match! Profiler hardware requirements . . . . . .

Conclusion Bottleneck analysis is complicated by parallelism • Parallelism is interpreted with interaction cost (icost) • Three possibilities: independent, parallel, or serial Applies to all instructions, resources, events Enabled by the “shotgun” profiler: Interaction cost overcomes limitations of counters

Icost Case Study: Deep pipelines Decode, rename Icache miss DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge Multiply + pipe latency

Profiler software requirements Software puts the graph together Detailed samples(with matching PC) Skeleton sample

DL1 access 1 0 1 0 1 F F F F F F 1 1 1 1 1 1 4 2 E E E E E E 2 3 1 2 1 2 3 2 3 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 Compare Icost and Sensitivity Study Corollary to DL1 and ROB serial interaction: As load latency increases, the benefit from enlarging the ROB increases.

Compare Icost and Sensitivity Study

Compare Icost and Sensitivity Study Sensitivity Study Advantages • More information • e.g., concave or convex curves Interaction Cost Advantages • Easy (automatic) interpretation • Sign and magnitude have well defined meanings • Concise communication • DL1 and ROB interact serially

Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis

Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis

Presentation Transcript

Interaction

Interaction

Interaction

Interaction

Interaction

Using InterAction: the front-end challenge

Engaging Learning Groups using Social Interaction Strategies

Estimating Interaction Effects Using Multiple Regression

Interaction

Interaction

Interaction Laws Verification Using Knowledge-based Reasoning

Interaction

Cost estimation - Using excel

Interaction

Interaction

Interaction

INTERACTION

Hyperfine Interaction studies using RIB

Interaction

interaction