580 likes | 592 Views
How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack. Brian Fields Rastislav Bodík Mark D. Hill University of Wisconsin-Madison. Constraint:. Memory latency. Design:. Cache hierarchy. Non-uniformity:. Load latencies. Policy:. What to replace?.
E N D
How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack Brian Fields Rastislav Bodík Mark D. Hill University of Wisconsin-Madison
Constraint: Memory latency Design: Cache hierarchy Non-uniformity: Load latencies Policy: What to replace? The Problem: Managing constraints Technological constraints dominate memory design
Constraint: Wires Power Complexity Design: Clusters Fast/Slow ALUs Grid, ILDP Non-uniformity: Bypasses Exe. Latencies L1 latencies Policy: ? ? ? The Problem: Managing constraints In the future, technological constraints will also dominate microprocessor design • Policy Goal: Minimize effect of lower-quality resources
Key Insight: Control policy crucial With non-uniform machines, the technological constraint problem becomes a control policy problem
Achieved through slack: The amount an instruction can be delayed without increasing execution time Key Insight: Control policy crucial The best possible policy: Delays are imposed only on instructions so that execution time is not increased
Contributions/Outline Understanding (measure slack in a simulator?) • determining slack: resource constraints important • reporting slack: apportion to individual instructions • analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware?) • simple, delay and observe approach works well Case study (how to design a control policy?) • on power-efficient machine, up to 20% speedup
Determining slack: Why hard? “Probe the processor” approach:Delay and observe • Delay dynamic instruction by n cycles • See if execution time increased • No, increase n; restart; go to step 1 Microprocessors are complex: Sometimes slack is determined by resources (e.g. ROB) Srinivasan and Lebeck approximation, for loads(MICRO ’98) • heuristics to predict execution time increase
Determining slack Alternative approach: Dependence-graph analysis • Build resource-sensitive dependence graph • Analyze to find slack But, how to build resource-sensitive graph? Casmira and Grunwald’s solution(Kool Chips Workshop ’00) Graphs only with instructions in issue window
Data-Dependence Graph 1 2 1 1 1 1 3 Slack = 0 cycles
Our Dependence Graph Model (ISCA ‘01) F F F F F E E E E E C C C C C Slack = 0 cycles
Our Dependence Graph Model (ISCA ‘01) 0 0 10 1 F F F F F 1 1 1 1 1 1 2 1 1 1 1 E E E E E 3 1 1 1 1 1 C C C C C 1 0 0 1 Slack = 6 cycles • Modeling resources increases observable slack
Reporting slack Globalslack: # cycles a dynamic operation can be delayed without increasing execution time 35 0 3 0 10 10 1 2 GS = 15 GS = 15 AS = 10 AS = 5 Apportioned slack: Distribute global slack among operations using an apportioning strategy
Slack measurements (Perl) 6-wide out-of-order superscalar128-entry issue window12-stage pipeline
Slack measurements (Perl) global
Slack measurements (Perl) global apportioned
Design Non-uniformity App. Strategy Analysis via apportioning strategy What non-uniform designs can slack tolerate? Fast/slow ALU Exe. latency Double latency Good news: 80% of dynamic instructions can have latency doubled
Contributions/Outline Understanding (measure slack in a simulator?) • determining slack: resource constraints important • reporting slack: apportion to individual instructions • analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware?) • simple, delay and observe approach works well Case study (how to design a control policy?) • on power-efficient machine, up to 20% speedup
Measuring slack in hardware delay and observe Goal: Determine whether static instruction has n cycles of slack • Delay a dynamic instance by n cycles • Check if critical (via critical-path analyzer): • No, instruction has n cycles of slack • Yes, instruction does not have n cycles of slack ISCA ‘01
Two predictor designs • Implicit slack predictor • delay and observe with natural non-uniform delays • “Bin” instructions to match non-uniform hardware • Explicit slack predictor • Retry delay and observe with different values of slack Problem: obtaining unperturbed measurements
Contributions/Outline Understanding (measure slack in a simulator?) • determining slack: resource constraints important • reporting slack: apportion to individual instructions • analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware?) • simple, delay and observe approach works well Case study (how to design a control policy?) • on power-efficient machine, up to 20% speedup
Fast/slow pipeline microarchitecture P F2save ~37% core power ALUs Reg WIN Data Cache Fast, 3-wide pipeline Fetch + Rename Steer Reg WIN ALUs Bypass Bus Slow, 3-wide pipeline • Design has three nonuniformities: • Higher execution latencies • Increased (cross-domain) bypass latency • Decreased effective issue bandwidth
Steer Fast Slow 1 3 High Schedule 2 4 Low Selecting bins for implicit slack predictor • Two decisions • Steer to fast/slow pipeline, then • Schedule with high/low priority within a pipeline Use implicit slack predictor with four (22) bins:
Putting it all together Prediction Path Fast/slow pipeline core Slack predictiontable 4 KB PC Slack bin # Training Path Criticality Analyzer ~1 KB 4-bin slack state machine
Fast/slow pipeline performance 2 fast, high-powerpipelines slack-based policy reg-dep steering
Slack used up Average global slack per dynamic instruction 2 fast, high-power pipelines slack-based policy
Slack used up Average global slack per dynamic instruction 2 fast, high-power pipelines slack-based policy reg-dep steering
Conclusion: Future processor design flow Future processors will be non-uniform. A slack-based policy can control them. • Measure slack in a simulator • decide early on what designs to build • Predict slack in hardware • simple implementation • Design a control policy • policy decisions slack bins
2 cycles 1 cycle 1 cycle Define local slack Define Local Slack: # cycles edge latency can be increased without delaying subsequent instructions 1 1 1 1 1 1 3 In real programs, ~20% insts have local slack of at least 5 cycles
2 cycles 1 cycle 1 3 3 1 2 5 4 1 cycle Compute local slack Define Local Slack: # cycles edge latency can be increased without delaying subsequent instructions 1 1 1 1 1 1 3 Arrival Time In real programs, ~20% insts have local slack of at least 5 cycles
2 cycles 2 cycles 1 cycle 1 cycle Define global slack Global Slack: # cycles edge latency can be increased without delaying the last instruction in the program 1 1 1 1 1 1 3 In real programs, >90% insts have global slack of at least 5 cycles
GS5=LS5=2 GS1=MIN(GS3,GS5)+LS1=2 GS6=LS6=0 GS3=GS6+LS3=1 Compute global slack Calculate global slack: backward propagate, accumulating local slacks LS5=2 LS1=1 LS3=1 LS2=0 In real programs, >90% insts have global slack of at least 5 cycles
Apportioned slack Goal: Distribute slack to instructions that need it Thus, apportioningstrategydepends upon nature of non-uniformities in machine e.g.: non-uniformity: 2 speed bypass busses (1 cycle, 2 cycle) strategy: give 1 cycle slack to as many edges as possible
Define apportioned slack Apportioned slack: Distribute global slack among edges For example: GS1=2, AS1=1 GS5=2, AS5=1 GS3=1, AS3=0 GS2=1, AS2=1 In real programs, >75% insts can be apportioned slack of at least 5 cycles
Slack measurements global apportioned local
Multi-speed ALUs Can we tolerate ALUs running at half frequency? Yes, but: • For all types of operations? (needed for multi-speed clusters) • Can we make all integer ops double latency?
Load slack Can we tolerate a long-latency L1 hit? design: wire-constrained machine, e.g. Grid non-uniformity: multi-latency L1 apportioning strategy: apportion ALL slack to load instructions
Most loads can tolerate an L2 cache hit Apportion all slack to loads
Multi-speed ALUs Can we tolerate ALUs running at half frequency? design: fast/slow ALUs non-uniformity: multi-latency execution latency, bypass apportioning strategy: give slack equal to original latency + 1
Most instructions can tolerate doubling their latency Latency+1 apportioning
Validation Two steps: • Increase latencies of insts. by their apportioned slack • for three apportioning strategies: 1) latency+1, 2) 5-cycles to as many instructions as possible, 3) 12-cycles to as many loads as possible • Compare to baseline (no delays inserted)
Worst case: Inaccuracy of 0.6% Validation
Predicting slack Two steps to PC-indexed, history-based prediction: • Measure slack of a dynamic instruction • Store in array indexed by PC of staticinstruction Need: Ability to measure slack of a dynamic instruction Need:Locality of slack • can capture 80% of potential exploitable slack
Locality of slack experiment For each static instruction: • Measure % slackful dynamic instances • Multiply by # of dynamic instances • Sum across all static instructions • Compare to total slackful dynamic instructions (ideal case) slackful = has enough apportioned slack to double latency
PC-indexed, history-based predictor can capture most of the available slack Locality of slack
Predicting slack Two steps to PC-indexed, history-based prediction: • Measure slack of a dynamic instruction • Store in array indexed by PC of staticinstruction Need: Ability to measure slack of a dynamic instruction Need:Locality of slack • can capture 80% of potential exploitable slack
Measuring slack in hardware delay and observe Goal: Determine whether static instruction has n cycles of slack • Delay a dynamic instance by n cycles • Check if critical (via critical-path analyzer): • No, instruction has n cycles of slack • Yes, instruction does not have n cycles of slack