1 / 41

Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis

This paper discusses the concept of Interaction Cost (icost) and its application in analyzing microarchitectural bottlenecks in parallel computing. It presents a case study on designing a deep pipeline hardware profiler and explores ways to exploit serial interactions. The goal is to overcome the limitations of traditional performance counters and construct a graph of dynamic instructions.

morrisonp
Download Presentation

Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields1 Rastislav Bodik1 Mark Hill2 Chris Newburn3 1UC-Berkeley, 2UW-Madison, 3Intel

  2. Outline Interaction Cost Bottleneck analysis complicated by parallelism Parallelism causes interactions • Qualitative: parallel and serial interactions • Quantitative: interaction cost (icost) Icost case study: designing a deep pipeline Hardware profiler • Icost “shotgun” profiler • Replace current performance counters

  3. Bottleneck analysis is hard Why? -architectural parallelism complicates performance understanding • Two parallel cache misses • A multiply and window stall • A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating-point multiplies are executing

  4. What we want from bottleneck analysis • Performance cost (or reward) • speedup when the bottleneck is removed Q: What if two bottlenecks interact?

  5. Our solution: measure interactions Two parallel cache misses (Each 100 cycles) miss #1 (100) miss #2 (100) Cost(miss #1) = 0 Cost(miss #2) = 0 Cost({miss #1, miss #2}) = 100 Aggregate cost > Sum of individual costs Parallel interaction 100 0 + 0 icost = aggregate cost – sum of individual costs = 100 – 0 – 0 = 100

  6. miss #1 • Positive icost  parallel interaction miss #2 Interaction cost (icost) icost = aggregate cost – sum of individual costs • Zero icost ?

  7. Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 • Positive icost  parallel interaction miss #2 . . . • Zero icost  independent miss #2 miss #1 • Negative icost ?

  8. Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = ?

  9. Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = 90 Cost(miss #2) = 90 Cost({miss #1, miss #2}) = 90 icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90 Negative icost  serial interaction

  10. Branch mispredict Load-Replay Trap Interaction cost (icost) icost = aggregate cost – sum of individual costs Fetch BW miss #1 • Positive icost  parallel interaction LSQ stall miss #2 . . . • Zero icost  independent miss #2 miss #1 miss #1 miss #2 • Negative icost serial interaction ALU latency

  11. Reason #1 We are over-optimizing! Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us) Reason #2 We have a choice of what to optimize Prefetching miss #2 has the same effect as miss #1 Why care about serial interactions? miss #1 (100) miss #2 (100) ALU latency (110 cycles)

  12. Assume 4-cycle DL1 access; how to mitigate? Increase cache ports? Increase window size? Increase fetch BW? Reduce cache misses? Icost Case Study: Deep pipelines Deep pipelines cause long latency loops: • level-one (DL1) cache access, issue-wakeup, branch misprediction, … But can often mitigate them indirectly Really, looking for serial interactions!

  13. Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge

  14. Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge

  15. Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge

  16. Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge

  17. Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge

  18. Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge

  19. Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge

  20. Icost Breakdown (6 wide, 64-entry window)

  21. Icost Breakdown (6 wide, 64-entry window)

  22. Icost Breakdown (6 wide, 64-entry window)

  23. Icost Breakdown (6 wide, 64-entry window)

  24. Vortex Breakdowns, enlarging the window

  25. Vortex Breakdowns, enlarging the window

  26. Outline Interaction Cost • Bottleneck analysis complicated by parallelism • Parallelism causes interactions • Qualitative: parallel and serial interactions • Quantitative: interaction cost (icost) • Icost case study: designing a deep pipeline • Exploiting serial interactions Hardware profiler • Icost “shotgun” profiler • Overcome the limitations of performance counters

  27. Profiling goal Goal: • Construct graph many dynamic instructions Constraint: • Can only sample sparsely

  28. Genome sequencing Profiling goal Goal: • Construct graph DNA strand DNA Constraint: • Can only sample sparsely

  29. “Shotgun” genome sequencing DNA

  30. “Shotgun” genome sequencing DNA

  31. “Shotgun” genome sequencing DNA . . . . . .

  32. “Shotgun” genome sequencing DNA . . . . . . Find overlaps among samples . . . . . .

  33. Icache miss Dcache miss Branch misp. No event Mapping “shotgun” to our situation many dynamic instructions

  34. . . . . . . Profiler hardware requirements

  35. Match! Profiler hardware requirements . . . . . .

  36. Conclusion Bottleneck analysis is complicated by parallelism • Parallelism is interpreted with interaction cost (icost) • Three possibilities: independent, parallel, or serial Applies to all instructions, resources, events Enabled by the “shotgun” profiler: Interaction cost overcomes limitations of counters

  37. Icost Case Study: Deep pipelines Decode, rename Icache miss DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge Multiply + pipe latency

  38. Profiler software requirements Software puts the graph together Detailed samples(with matching PC) Skeleton sample

  39. DL1 access 1 0 1 0 1 F F F F F F 1 1 1 1 1 1 4 2 E E E E E E 2 3 1 2 1 2 3 2 3 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 Compare Icost and Sensitivity Study Corollary to DL1 and ROB serial interaction: As load latency increases, the benefit from enlarging the ROB increases.

  40. Compare Icost and Sensitivity Study

  41. Compare Icost and Sensitivity Study Sensitivity Study Advantages • More information • e.g., concave or convex curves Interaction Cost Advantages • Easy (automatic) interpretation • Sign and magnitude have well defined meanings • Concise communication • DL1 and ROB interact serially

More Related