1 / 58

How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack

How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack. Brian Fields Rastislav Bodík Mark D. Hill University of Wisconsin-Madison. Constraint:. Memory latency. Design:. Cache hierarchy. Non-uniformity:. Load latencies. Policy:. What to replace?.

velasquezs
Download Presentation

How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to Turn the Technological Constraint Problem into a Control Policy Problem Using Slack Brian Fields Rastislav Bodík Mark D. Hill University of Wisconsin-Madison

  2. Constraint: Memory latency Design: Cache hierarchy Non-uniformity: Load latencies Policy: What to replace? The Problem: Managing constraints Technological constraints dominate memory design

  3. Constraint: Wires Power Complexity Design: Clusters Fast/Slow ALUs Grid, ILDP Non-uniformity: Bypasses Exe. Latencies L1 latencies Policy: ? ? ? The Problem: Managing constraints In the future, technological constraints will also dominate microprocessor design • Policy Goal: Minimize effect of lower-quality resources

  4. Key Insight: Control policy crucial With non-uniform machines, the technological constraint problem becomes a control policy problem

  5. Achieved through slack: The amount an instruction can be delayed without increasing execution time Key Insight: Control policy crucial The best possible policy: Delays are imposed only on instructions so that execution time is not increased

  6. Contributions/Outline Understanding (measure slack in a simulator?) • determining slack: resource constraints important • reporting slack: apportion to individual instructions • analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware?) • simple, delay and observe approach works well Case study (how to design a control policy?) • on power-efficient machine, up to 20% speedup

  7. Determining slack: Why hard? “Probe the processor” approach:Delay and observe • Delay dynamic instruction by n cycles • See if execution time increased • No, increase n; restart; go to step 1 Microprocessors are complex: Sometimes slack is determined by resources (e.g. ROB) Srinivasan and Lebeck approximation, for loads(MICRO ’98) • heuristics to predict execution time increase

  8. Determining slack Alternative approach: Dependence-graph analysis • Build resource-sensitive dependence graph • Analyze to find slack But, how to build resource-sensitive graph? Casmira and Grunwald’s solution(Kool Chips Workshop ’00) Graphs only with instructions in issue window

  9. Data-Dependence Graph 1 2 1 1 1 1 3 Slack = 0 cycles

  10. Our Dependence Graph Model (ISCA ‘01) F F F F F E E E E E C C C C C Slack = 0 cycles

  11. Our Dependence Graph Model (ISCA ‘01) 0 0 10 1 F F F F F 1 1 1 1 1 1 2 1 1 1 1 E E E E E 3 1 1 1 1 1 C C C C C 1 0 0 1 Slack = 6 cycles • Modeling resources increases observable slack

  12. Reporting slack Globalslack: # cycles a dynamic operation can be delayed without increasing execution time 35 0 3 0 10 10 1 2 GS = 15 GS = 15 AS = 10 AS = 5 Apportioned slack: Distribute global slack among operations using an apportioning strategy

  13. Slack measurements (Perl) 6-wide out-of-order superscalar128-entry issue window12-stage pipeline

  14. Slack measurements (Perl) global

  15. Slack measurements (Perl) global apportioned

  16. Design Non-uniformity App. Strategy Analysis via apportioning strategy What non-uniform designs can slack tolerate? Fast/slow ALU Exe. latency Double latency Good news: 80% of dynamic instructions can have latency doubled

  17. Contributions/Outline Understanding (measure slack in a simulator?) • determining slack: resource constraints important • reporting slack: apportion to individual instructions • analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware?) • simple, delay and observe approach works well Case study (how to design a control policy?) • on power-efficient machine, up to 20% speedup 

  18. Measuring slack in hardware delay and observe Goal: Determine whether static instruction has n cycles of slack • Delay a dynamic instance by n cycles • Check if critical (via critical-path analyzer): • No, instruction has n cycles of slack • Yes, instruction does not have n cycles of slack ISCA ‘01

  19. Two predictor designs • Implicit slack predictor • delay and observe with natural non-uniform delays • “Bin” instructions to match non-uniform hardware • Explicit slack predictor • Retry delay and observe with different values of slack Problem: obtaining unperturbed measurements

  20. Contributions/Outline Understanding (measure slack in a simulator?) • determining slack: resource constraints important • reporting slack: apportion to individual instructions • analysis: suggest nonuniform machines to build Predicting (how to predict slack in hardware?) • simple, delay and observe approach works well Case study (how to design a control policy?) • on power-efficient machine, up to 20% speedup 

  21. Fast/slow pipeline microarchitecture P  F2save ~37% core power ALUs Reg WIN Data Cache Fast, 3-wide pipeline Fetch + Rename Steer Reg WIN ALUs Bypass Bus Slow, 3-wide pipeline • Design has three nonuniformities: • Higher execution latencies • Increased (cross-domain) bypass latency • Decreased effective issue bandwidth

  22. Steer Fast Slow 1 3 High Schedule 2 4 Low Selecting bins for implicit slack predictor • Two decisions • Steer to fast/slow pipeline, then • Schedule with high/low priority within a pipeline Use implicit slack predictor with four (22) bins:

  23. Putting it all together Prediction Path Fast/slow pipeline core Slack predictiontable 4 KB PC Slack bin # Training Path Criticality Analyzer ~1 KB 4-bin slack state machine

  24. Fast/slow pipeline performance 2 fast, high-powerpipelines slack-based policy reg-dep steering

  25. Slack used up Average global slack per dynamic instruction 2 fast, high-power pipelines slack-based policy

  26. Slack used up Average global slack per dynamic instruction 2 fast, high-power pipelines slack-based policy reg-dep steering

  27. Conclusion: Future processor design flow Future processors will be non-uniform. A slack-based policy can control them. • Measure slack in a simulator • decide early on what designs to build • Predict slack in hardware • simple implementation • Design a control policy • policy decisions  slack bins

  28. Backup slides

  29. 2 cycles 1 cycle 1 cycle Define local slack Define Local Slack: # cycles edge latency can be increased without delaying subsequent instructions 1 1 1 1 1 1 3 In real programs, ~20% insts have local slack of at least 5 cycles

  30. 2 cycles 1 cycle 1 3 3 1 2 5 4 1 cycle Compute local slack Define Local Slack: # cycles edge latency can be increased without delaying subsequent instructions 1 1 1 1 1 1 3 Arrival Time In real programs, ~20% insts have local slack of at least 5 cycles

  31. 2 cycles 2 cycles 1 cycle 1 cycle Define global slack Global Slack: # cycles edge latency can be increased without delaying the last instruction in the program 1 1 1 1 1 1 3 In real programs, >90% insts have global slack of at least 5 cycles

  32. GS5=LS5=2 GS1=MIN(GS3,GS5)+LS1=2 GS6=LS6=0 GS3=GS6+LS3=1 Compute global slack Calculate global slack: backward propagate, accumulating local slacks LS5=2 LS1=1 LS3=1 LS2=0 In real programs, >90% insts have global slack of at least 5 cycles

  33. Apportioned slack Goal: Distribute slack to instructions that need it Thus, apportioningstrategydepends upon nature of non-uniformities in machine e.g.: non-uniformity: 2 speed bypass busses (1 cycle, 2 cycle) strategy: give 1 cycle slack to as many edges as possible

  34. Define apportioned slack Apportioned slack: Distribute global slack among edges For example: GS1=2, AS1=1 GS5=2, AS5=1 GS3=1, AS3=0 GS2=1, AS2=1 In real programs, >75% insts can be apportioned slack of at least 5 cycles

  35. Slack measurements global apportioned local

  36. Multi-speed ALUs Can we tolerate ALUs running at half frequency? Yes, but: • For all types of operations? (needed for multi-speed clusters) • Can we make all integer ops double latency?

  37. Load slack Can we tolerate a long-latency L1 hit? design: wire-constrained machine, e.g. Grid non-uniformity: multi-latency L1 apportioning strategy: apportion ALL slack to load instructions

  38. Most loads can tolerate an L2 cache hit Apportion all slack to loads

  39. Multi-speed ALUs Can we tolerate ALUs running at half frequency? design: fast/slow ALUs non-uniformity: multi-latency execution latency, bypass apportioning strategy: give slack equal to original latency + 1

  40. Most instructions can tolerate doubling their latency Latency+1 apportioning

  41. Breakdown by operation (Latency+1 apportioning)

  42. Validation Two steps: • Increase latencies of insts. by their apportioned slack • for three apportioning strategies: 1) latency+1, 2) 5-cycles to as many instructions as possible, 3) 12-cycles to as many loads as possible • Compare to baseline (no delays inserted)

  43. Worst case: Inaccuracy of 0.6% Validation

  44. Predicting slack Two steps to PC-indexed, history-based prediction: • Measure slack of a dynamic instruction • Store in array indexed by PC of staticinstruction Need: Ability to measure slack of a dynamic instruction Need:Locality of slack • can capture 80% of potential exploitable slack

  45. Locality of slack experiment For each static instruction: • Measure % slackful dynamic instances • Multiply by # of dynamic instances • Sum across all static instructions • Compare to total slackful dynamic instructions (ideal case) slackful = has enough apportioned slack to double latency

  46. Locality of slack

  47. Locality of slack

  48. PC-indexed, history-based predictor can capture most of the available slack Locality of slack

  49. Predicting slack Two steps to PC-indexed, history-based prediction: • Measure slack of a dynamic instruction • Store in array indexed by PC of staticinstruction Need: Ability to measure slack of a dynamic instruction Need:Locality of slack • can capture 80% of potential exploitable slack

  50. Measuring slack in hardware delay and observe Goal: Determine whether static instruction has n cycles of slack • Delay a dynamic instance by n cycles • Check if critical (via critical-path analyzer): • No, instruction has n cycles of slack • Yes, instruction does not have n cycles of slack 

More Related