2.33k likes | 2.45k Views
Using Criticality to Attack Performance Bottlenecks. Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn). Bottleneck Analysis. Bottleneck Analysis: Determining the performance effect of an event on execution time. An event could be:
E N D
Using Criticality to Attack Performance Bottlenecks Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)
Bottleneck Analysis Bottleneck Analysis: Determining the performance effect of an event on execution time • An event could be: • an instruction’s execution • an instruction-window-full stall • a branch mispredict • a network request • inter-processor communication • etc.
Bottleneck Analysis Applications • Run-time Optimization • Resource arbitration • e.g., how to scheduling memory accesses? • Effective speculation • e.g., which branches to predicate? • Dynamic reconfiguration • e.g, when to enable hyperthreading? • Energy efficiency • e.g., when to throttle frequency? • Design Decisions • Overcoming technology constraints • e.g., how to mitigate effect of long wire latencies? • Programmer Performance Tuning • Where have the cycles gone? • e.g., which cache misses should be prefetched?
miss1 (100 cycles) miss2 (100 cycles) 2 misses but only 1 miss penalty Current state-of-art Event counts: Exe. time = (CPU cycles + Mem. cycles) * Clock cycle time where: Mem. cycles = Number of cache misses * Miss penalty
Parallelism Parallelism in systems complicates performance understanding • Two parallel cache misses • Two parallel threads • A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating-point multiplies are executing
Criticality Challenges • Cost • How much speedup possible from optimizing an event? • Slack • How much can an event be “slowed down” before increasing execution time? • Interactions • When do multiple events need to be optimized simultaneously? • When do we have a choice? • Exploit in Hardware
Our Approach: Criticality Critical events affect execution time, non-critical do not Bottleneck Analysis: Determining the performance effect of an event on execution time
Defining criticality Need Performance Sensitivity • slowing down a “critical” event should slow down the entire program • speeding up a “noncritical” event should leave execution time unchanged
(MISP) Annotated with Dependence Edges
Fetch BW Data Dep ROB Branch Misp. Annotated with Dependence Edges
Edge Weights Added 1 1 1 1 0 2 1 1 1 3 1 1 1
F F F F F F F F F F F E E E E E E E E E E E C C C C C C C C C C C 1 1 1 1 1 1 2 2 1 1 1 1 2 1 1 1 1 1 1 1 1 1 Convert to Graph 0 1 1 0 1 1 1 2 0 0 1 1 1 3 0 1 1 1
F F F F F F F F F F F E E E E E E E E E E E C C C C C C C C C C C Convert to Graph 1 1 0 1 1 1 1 1 1 0 1 1 2 2 1 2 0 1 1 0 1 1 1 1 2 1 1 1 1 3 1 1 0 1 1 1 1 1 1 1
Critical Icache miss,But how costly? Non-critical,But how much slack? Smaller graph instance 0 0 10 1 F F F F F 1 1 1 1 1 1 E E E E E 3 1 1 1 1 1 C C C C C
Critical Icache miss,But how costly? Non-critical,But how much slack? Add “hidden” constraints 0 0 10 1 F F F F F 1 1 1 1 1 1 2 1 1 1 1 E E E E E 3 1 1 1 1 1 C C C C C 1 0 0 1
Slack = 13 – 7 = 6 cycles Add “hidden” constraints Cost = 13 – 7 = 6 cycles 0 0 10 1 F F F F F 1 1 1 1 1 1 2 1 1 1 1 E E E E E 3 1 1 1 1 1 C C C C C 1 0 0 1
Slack = 6 cycles Slack “sharing” 0 0 10 1 F F F F F 1 1 1 1 1 1 2 1 1 1 1 E E E E E 3 1 1 1 1 1 C C C C C 1 0 0 1 Slack = 6 cycles Can delay one edge by 6 cycles, but not both!
~80% insts have at least 5 cycles of apportioned slack Machine Imbalance global apportioned
Criticality Challenges • Cost • How much speedup possible from optimizing an event? • Slack • How much can an event be “slowed down” before increasing execution time? • Interactions • When do multiple events need to be optimized simultaneously? • When do we have a choice? • Exploit in Hardware
Simple criticality not always enough Sometimes events have nearly equal criticality miss #1 (99) miss #2 (100) Want to know • how critical is each event? • how far from critical is each event? Actually, even that is not enough
Our solution: measure interactions Two parallel cache misses miss #1 (99) miss #2 (100) Cost(miss #1) = 0 Cost(miss #2) = 1 Cost({miss #1, miss #2}) = 100 Aggregate cost > Sum of individual costs Parallel interaction 100 0 + 1 icost = aggregate cost – sum of individual costs = 100 – 0 – 1 = 99
miss #1 • Positive icost parallel interaction miss #2 Interaction cost (icost) icost = aggregate cost – sum of individual costs • Zero icost ?
Interaction cost (icost) icost = aggregate cost – sum of individual costs miss #1 • Positive icost parallel interaction miss #2 . . . • Zero icost independent miss #2 miss #1 • Negative icost ?
Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = ?
Negative icost Two serial cache misses (data dependent) miss #1 (100) miss #2 (100) ALU latency (110 cycles) Cost(miss #1) = 90 Cost(miss #2) = 90 Cost({miss #1, miss #2}) = 90 icost = aggregate cost – sum of individual costs = 90 – 90 – 90 = -90 Negative icost serial interaction
Branch mispredict Load-Replay Trap Interaction cost (icost) icost = aggregate cost – sum of individual costs Fetch BW miss #1 • Positive icost parallel interaction LSQ stall miss #2 . . . • Zero icost independent miss #2 miss #1 miss #1 miss #2 • Negative icost serial interaction ALU latency
Reason #1 We are over-optimizing! Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us) Reason #2 We have a choice of what to optimize Prefetching miss #2 has the same effect as miss #1 Why care about serial interactions? miss #1 (100) miss #2 (100) ALU latency (110 cycles)
4 Icost Case Study: Deep pipelines 1 Dcache (DL1) Looking for serial interactions!
Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge
Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge
Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge
Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge
Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge
Icost Case Study: Deep pipelines DL1 access 1 0 1 12 12 F F F F F F 5 5 5 5 5 5 4 2 E E E E E E 14 4 1 6 9 18 7 6 7 0 1 0 1 0 C C C C C C i1 i2 i3 i4 i5 i6 window edge
Criticality Challenges • Cost • How much speedup possible from optimizing an event? • Slack • How much can an event be “slowed down” before increasing execution time? • Interactions • When do multiple events need to be optimized simultaneously? • When do we have a choice? • Exploit in Hardware
Exploit in Hardware • Criticality Analyzer • Online, fast-feedback • Limited to critical/not critical • Replacement for Performance Counters • Requires offline analysis • Constructs entire graph
R1 R2 + R3 Only last-arriving edges can be critical • Observation: R2 E R3 Dependence resolved early If dependence into R2 is on critical path, then value of R2arrived last. critical arrives last arrives last critical
F F F F F F F F E E E E E E E E C C C C C C C C C C C C C C C C EFif branch misp. CFif ROB stall FFotherwise EEobserve arrival order of operands F F F F F E E E E E FEif data ready on fetch C C C C C C C C C C Determining last-arrive edges Observe events within the machine last_arrive[F] = last_arrive[C] = last_arrive[E] = F F F F E E E E C C C C C C C C ECif commit pointer is delayed CCotherwise
Last-arrive edges The last-arrive rule CP consists only of “last-arrive” edges F E C
Prune the graph Only need to put last-arrive edges in graph No other edges could be on CP F E C newest
…and we’ve found the critical path! Backward propagate along last-arrive edges F E C newest newest • Found CPby only observing last-arrive edges • but still requires constructing entire graph
Step 2. Reducing storage reqs CP is a ”long” chain of last-arrive edges. • the longer a given chain of last-arrive edges, the more likely it is part of the CP Algorithm: find sufficiently long last-arrive chains • Plant token into a node n • Propagate forward, only along last-arrive edges • Check for token after several hundred cycles • If token alive, n is assumed critical