2.33k likes | 2.35k Views
This paper discusses the importance of bottleneck analysis and its applications in various areas such as run-time optimization, effective speculation, dynamic reconfiguration, energy efficiency, design decisions, and programmer performance tuning. It also explores the challenges of bottleneck analysis and introduces the concept of criticality and its role in determining the performance effect of events on execution time.
E N D
Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)
Bottleneck Analysis Bottleneck Analysis: Determining the performance effect of an event on execution time • An event could be: • an instruction’s execution • an instruction-window-full stall • a branch mispredict • a network request • inter-processor communication • etc.
Bottleneck Analysis Applications • Run-time Optimization • Resource arbitration • e.g., how to scheduling memory accesses? • Effective speculation • e.g., which branches to predicate? • Dynamic reconfiguration • e.g, when to enable hyperthreading? • Energy efficiency • e.g., when to throttle frequency? • Design Decisions • Overcoming technology constraints • e.g., how to mitigate effect of long wire latencies? • Programmer Performance Tuning • Where have the cycles gone? • e.g., which cache misses should be prefetched?
miss1 (100 cycles) miss2 (100 cycles) 2 misses but only 1 miss penalty Current state-of-art Event counts: Exe. time = (CPU cycles + Mem. cycles) * Clock cycle time where: Mem. cycles = Number of cache misses * Miss penalty
Parallelism Parallelism in systems complicates performance understanding • Two parallel cache misses • Two parallel threads • A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating-point multiplies are executing
Criticality Challenges • Cost • How much speedup possible from optimizing an event? • Slack • How much can an event be “slowed down” before increasing execution time? • Interactions • When do multiple events need to be optimized simultaneously? • When do we have a choice? • Exploit in Hardware
Our Approach: Criticality Critical events affect execution time, non-critical do not Bottleneck Analysis: Determining the performance effect of an event on execution time
Defining criticality Need Performance Sensitivity • slowing down a “critical” event should slow down the entire program • speeding up a “noncritical” event should leave execution time unchanged
(MISP) Annotated with Dependence Edges
Fetch BW Data Dep ROB Branch Misp. Annotated with Dependence Edges
Edge Weights Added 1 1 1 1 0 2 1 1 1 3 1 1 1
F F F F F F F F F F F E E E E E E E E E E E C C C C C C C C C C C 1 1 1 1 1 1 2 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 Convert to Graph 0 1 1 0 1 1 1 2 0 0 1 1 1 3 0 1 1 1
F F F F F F F F F F F E E E E E E E E E E E C C C C C C C C C C C Convert to Graph 1 1 0 1 1 1 1 1 1 0 1 1 2 2 1 2 0 1 1 0 1 1 1 1 2 1 1 1 1 3 1 1 0 1 1 1 1 1 1 1
Critical Icache miss,But how costly? Non-critical,But how much slack? Smaller graph instance 0 0 10 1 F F F F F 1 1 1 1 1 1 E E E E E 3 1 1 1 1 1 C C C C C
Critical Icache miss,But how costly? Non-critical,But how much slack? Add “hidden” constraints 0 0 10 1 F F F F F 1 1 1 1 1 1 2 1 1 1 1 E E E E E 3 1 1 1 1 1 C C C C C 1 0 0 1
Slack = 13 – 7 = 6 cycles Add “hidden” constraints Cost = 13 – 7 = 6 cycles 0 0 10 1 F F F F F 1 1 1 1 1 1 2 1 1 1 1 E E E E E 3 1 1 1 1 1 C C C C C 1 0 0 1
Slack = 6 cycles Slack “sharing” 0 0 10 1 F F F F F 1 1 1 1 1 1 2 1 1 1 1 E E E E E 3 1 1 1 1 1 C C C C C 1 0 0 1 Slack = 6 cycles Can delay one edge by 6 cycles, but not both!
~80% insts have at least 5 cycles of apportioned slack Machine Imbalance global apportioned
Criticality Challenges • Cost • How much speedup possible from optimizing an event? • Slack • How much can an event be “slowed down” before increasing execution time? • Interactions • When do multiple events need to be optimized simultaneously? • When do we have a choice? • Exploit in Hardware
Simple criticality not always enough Sometimes events have nearly equal criticality miss #1 (99) miss #2 (100) Want to know • how critical is each event? • how far from critical is each event? Actually, even that is not enough
Our solution: measure interactions Two parallel cache misses miss #1 (99) miss #2 (100) Cost(miss #1) = 0 Cost(miss #2) = 1 Cost({miss #1, miss #2}) = 100 Aggregate cost > Sum of individual costs Parallel interaction 100 0 + 1 icost = aggregate cost – sum of individual costs = 100 – 0 – 1 = 99
miss #1 • Positive icost parallel interaction miss #2 Interaction cost (icost) icost = aggregate cost – sum of individual costs • Zero icost ?
Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 • Positive icost parallel interaction miss #2 . . . • Zero icost independent miss #2 miss #1 • Negative icost ?
Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = ?
Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = 90 Cost(miss #2) = 90 Cost({miss #1, miss #2}) = 90 icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90 Negative icost serial interaction
Branch mispredict Load-Replay Trap Interaction cost (icost) icost = aggregate cost – sum of individual costs Fetch BW miss #1 • Positive icost parallel interaction LSQ stall miss #2 . . . • Zero icost independent miss #2 miss #1 miss #1 miss #2 • Negative icost serial interaction ALU latency
Reason #1 We are over-optimizing! Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us) Reason #2 We have a choice of what to optimize Prefetching miss #2 has the same effect as miss #1 Why care about serial interactions? miss #1 (100) miss #2 (100) ALU latency (110 cycles)
4 Icost Case Study: Deep pipelines 1 Dcache (DL1) Looking for serial interactions!
Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge
Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge
Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge
Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge
Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge
Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge
Criticality Challenges • Cost • How much speedup possible from optimizing an event? • Slack • How much can an event be “slowed down” before increasing execution time? • Interactions • When do multiple events need to be optimized simultaneously? • When do we have a choice? • Exploit in Hardware
Exploit in Hardware • Criticality Analyzer • Online, fast-feedback • Limited to critical/not critical • Replacement for Performance Counters • Requires offline analysis • Constructs entire graph
R1 R2 + R3 Only last-arriving edges can be critical • Observation: R2 E R3 Dependence resolved early If dependence into R2 is on critical path, then value of R2arrived last. critical arrives last arrives last critical
F F F F F F F F E E E E E E E E C C C C C C C C C C C C C C C C EFif branch misp. CFif ROB stall FFotherwise EEobserve arrival order of operands F F F F F E E E E E FEif data ready on fetch C C C C C C C C C C Determining last-arrive edges Observe events within the machine last_arrive[F] = last_arrive[C] = last_arrive[E] = F F F F E E E E C C C C C C C C ECif commit pointer is delayed CCotherwise
Last-arrive edges The last-arrive rule CP consists only of “last-arrive” edges F E C
Prune the graph Only need to put last-arrive edges in graph No other edges could be on CP F E C newest
…and we’ve found the critical path! Backward propagate along last-arrive edges F E C newest newest • Found CPby only observing last-arrive edges • but still requires constructing entire graph
Step 2. Reducing storage reqs CP is a ”long” chain of last-arrive edges. • the longer a given chain of last-arrive edges, the more likely it is part of the CP Algorithm: find sufficiently long last-arrive chains • Plant token into a node n • Propagate forward, only along last-arrive edges • Check for token after several hundred cycles • If token alive, n is assumed critical