410 likes | 421 Views
This paper discusses the concept of Interaction Cost (icost) and its application in analyzing microarchitectural bottlenecks in parallel computing. It presents a case study on designing a deep pipeline hardware profiler and explores ways to exploit serial interactions. The goal is to overcome the limitations of traditional performance counters and construct a graph of dynamic instructions.
E N D
Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields1 Rastislav Bodik1 Mark Hill2 Chris Newburn3 1UC-Berkeley, 2UW-Madison, 3Intel
Outline Interaction Cost Bottleneck analysis complicated by parallelism Parallelism causes interactions • Qualitative: parallel and serial interactions • Quantitative: interaction cost (icost) Icost case study: designing a deep pipeline Hardware profiler • Icost “shotgun” profiler • Replace current performance counters
Bottleneck analysis is hard Why? -architectural parallelism complicates performance understanding • Two parallel cache misses • A multiply and window stall • A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating-point multiplies are executing
What we want from bottleneck analysis • Performance cost (or reward) • speedup when the bottleneck is removed Q: What if two bottlenecks interact?
Our solution: measure interactions Two parallel cache misses (Each 100 cycles) miss #1 (100) miss #2 (100) Cost(miss #1) = 0 Cost(miss #2) = 0 Cost({miss #1, miss #2}) = 100 Aggregate cost > Sum of individual costs Parallel interaction 100 0 + 0 icost = aggregate cost – sum of individual costs = 100 – 0 – 0 = 100
miss #1 • Positive icost parallel interaction miss #2 Interaction cost (icost) icost = aggregate cost – sum of individual costs • Zero icost ?
Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 • Positive icost parallel interaction miss #2 . . . • Zero icost independent miss #2 miss #1 • Negative icost ?
Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = ?
Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = 90 Cost(miss #2) = 90 Cost({miss #1, miss #2}) = 90 icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90 Negative icost serial interaction
Branch mispredict Load-Replay Trap Interaction cost (icost) icost = aggregate cost – sum of individual costs Fetch BW miss #1 • Positive icost parallel interaction LSQ stall miss #2 . . . • Zero icost independent miss #2 miss #1 miss #1 miss #2 • Negative icost serial interaction ALU latency
Reason #1 We are over-optimizing! Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us) Reason #2 We have a choice of what to optimize Prefetching miss #2 has the same effect as miss #1 Why care about serial interactions? miss #1 (100) miss #2 (100) ALU latency (110 cycles)
Assume 4-cycle DL1 access; how to mitigate? Increase cache ports? Increase window size? Increase fetch BW? Reduce cache misses? Icost Case Study: Deep pipelines Deep pipelines cause long latency loops: • level-one (DL1) cache access, issue-wakeup, branch misprediction, … But can often mitigate them indirectly Really, looking for serial interactions!
Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge
Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge
Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge
Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge
Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge
Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge
Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge
Outline Interaction Cost • Bottleneck analysis complicated by parallelism • Parallelism causes interactions • Qualitative: parallel and serial interactions • Quantitative: interaction cost (icost) • Icost case study: designing a deep pipeline • Exploiting serial interactions Hardware profiler • Icost “shotgun” profiler • Overcome the limitations of performance counters
Profiling goal Goal: • Construct graph many dynamic instructions Constraint: • Can only sample sparsely
Genome sequencing Profiling goal Goal: • Construct graph DNA strand DNA Constraint: • Can only sample sparsely
“Shotgun” genome sequencing DNA . . . . . .
“Shotgun” genome sequencing DNA . . . . . . Find overlaps among samples . . . . . .
Icache miss Dcache miss Branch misp. No event Mapping “shotgun” to our situation many dynamic instructions
. . . . . . Profiler hardware requirements
Match! Profiler hardware requirements . . . . . .
Conclusion Bottleneck analysis is complicated by parallelism • Parallelism is interpreted with interaction cost (icost) • Three possibilities: independent, parallel, or serial Applies to all instructions, resources, events Enabled by the “shotgun” profiler: Interaction cost overcomes limitations of counters
Icost Case Study: Deep pipelines Decode, rename Icache miss DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge Multiply + pipe latency
Profiler software requirements Software puts the graph together Detailed samples(with matching PC) Skeleton sample
DL1 access 1 0 1 0 1 F F F F F F 1 1 1 1 1 1 4 2 E E E E E E 2 3 1 2 1 2 3 2 3 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 Compare Icost and Sensitivity Study Corollary to DL1 and ROB serial interaction: As load latency increases, the benefit from enlarging the ROB increases.
Compare Icost and Sensitivity Study Sensitivity Study Advantages • More information • e.g., concave or convex curves Interaction Cost Advantages • Easy (automatic) interpretation • Sign and magnitude have well defined meanings • Concise communication • DL1 and ROB interact serially