1 / 111

Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era. Andrew Hilton University of Pennsylvania adhilton@cis.upenn.edu. Duke :: March 18, 2010. Multi-Core Architecture. Atom. Atom. Atom. Atom. Single-thread performance growth has diminished

Download Presentation

Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Energy Efficient Latency Tolerance:Single-Thread Performance for the Multi-Core Era Andrew Hilton University of Pennsylvania adhilton@cis.upenn.edu Duke:: March 18, 2010

  2. Multi-Core Architecture Atom Atom Atom Atom Single-thread performance growth has diminished • Clock frequency has hit an energy wall • Instruction level parallelism (ILP) has hit energy, memory, idea walls Future chips will be heterogeneous multi-cores • Few high-performance out-of-order cores (Core i7) for serial code • Many low-power in-order cores (Atom) for parallel code Core i7 Atom Atom Atom Atom

  3. Multi-Core Performance Atom Atom Atom Atom Obvious performance key: write more parallel software Less obvious performance key: speed up existing cores • Core i7? Keep serial portion from becoming a bottleneck (Amdahl) • Atoms? Parallelism is typically not elastic Key constraint: energy • Thermal limitations of chip, cost of energy, cooling costs,… Core i7 Atom Atom Atom Atom

  4. “TurboBoost” X X X X Existing technique: Dynamic Voltage Frequency Scaling? • Increase clock frequency (requires increasing voltage) • Simple • Applicable to both types of cores • Not very energy-efficient (energy ≈ frequency2) • Doesn’t help “memory bound” programs (performance < frequency) Clock Clock X X X X

  5. Effectiveness of “TurboBoost” Higher Is better Lower is better • Example: TurboBoost 3.2 GHz  4.0 GHz (25%) • Ideal conditions: 25% speedup, constant Energy * Delay2 • Memory bound programs: far from ideal

  6. “Memory Bound” Main Memory (250 cycles) Main memory is slow relative to core (~250 cycles) Cache hierarchy makes most accesses fast • “Memory bound” = many L3 misses • … or in some cases many L2 misses • … or for in-order cores many L1 misses • Clock frequency (“TurboBoost”) accelerates only core/L1/L2 L3$ (40 cycles) L2$ (10) L2$ (10) L2$ (10) L2$ (10) L1$ L1$ L1$ L1$ L1$ L1$ L1$ Core i7 Atom Atom Atom Atom Atom Atom

  7. Goal: Help Memory Bound Programs Wanted: complementary technique to TurboBoost Successful applicants should • Help “memory bound” programs • Be at least as energy efficient as TurboBoost (at least ED2 constant) • Work well with both out-of-order and in-order cores Promising previous idea: latency tolerance • Helps “memory bound” programs My work: energy efficient latency tolerance for all cores • Today: primarily out-of-order (BOLT) [HPCA’10]

  8. Talk Outline Introduction Background: memory latency & latency tolerance My work: energy efficient latency tolerance in BOLT • Implementation aspects • Runtime aspects Other work and future plans

  9. LLC (Last-Level Cache) Misses What is this picture? Loads A & H miss caches This is an in-order processor • Misses serialize  latencies add  dominate performance We want Miss Level Parallelism (MLP): overlap A & H 250 (not to scale) 250 Time

  10. Miss-Level Parallelism (MLP) One option: prefetching • Requires predicting address of H at A Another option: out-of-order execution (Core i7) • Requires sufficiently large “window” to do this 250 250 250 Time

  11. Out-of-Order Execution & “Window” Fetch Rename LLC miss Reorder Buffer I$ Important “window” structures • Register file (number of in-flight instructions): 128 insns on Core i7 • Issue queue (number of un-executed instructions): 36 on Core i7 • Sized to “tolerate” (keep core busy for) ~30 cycle latencies • To tolerate ~250 cycles need order of magnitude bigger structures Latency tolerance big idea: scale window virtually D C B A completed Issue Queue Register File unexecuted A A B C D FU D$ D

  12. Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Prelude: Add slice buffer • New structure (not in conventional processors) • Can be relatively large: low bandwidth, not in critical execution core D C B A Issue Queue Register File A A B C D FU D$ D

  13. Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #1: Long-latency cache miss  slice out • Pseudo-execute: copy to slice buffer, release register & IQ slot D C B A Issue Queue Register File A A B C D FU D$ D

  14. Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #1: Long-latency cache miss  slice out • Pseudo-execute: copy to slice buffer, release register & IQ slot D C B A A Issue Queue Register File B C D FU D$ D

  15. Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #1: Long-latency cache miss  slice out • Pseudo-execute: copy to slice buffer, release register & IQ slot • Propagate “poison” to identify dependents D C B A A Issue Queue Register File miss dependent B C D FU D$ D

  16. Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #1: Long-latency cache miss  slice out • Pseudo-execute: copy to slice buffer, release register & IQ slot • Propagate “poison” to identify dependents • Pseudo-execute them too D C B A D A Issue Queue Register File B C FU D$

  17. Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #1: Long-latency cache miss  slice out • Pseudo-execute: copy to slice buffer, release register & IQ slot • Propagate “poison” to identify dependents • Pseudo-execute them too • Proceed under miss I H G F E D C B A H H E E D A Issue Queue Register File I I B C FU D$ F G

  18. Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #2: Cache miss return  slice in I H G F E D C B A H H E E D A Issue Queue Register File I I B C FU D$ F G

  19. Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #2: Cache miss return  slice in • Allocate new registers I H G F E D C B A H H E E D A Issue Queue Register File I I B C A FU D$ F G

  20. Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #2: Cache miss return  slice in • Allocate new registers • Put in issue queue I H G F E D C B A H H E E D A Issue Queue Register File I I B C A FU D$ A F G

  21. Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #2: Cache miss return  slice in • Allocate new registers • Put in issue queue • Re-execute instruction I H G F E D C B A H H E E D Issue Queue Register File I I B C A FU D$ F G

  22. Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #2: Cache miss return  slice in • Allocate new registers • Put in issue queue • Re-execute instruction • Problems with sliced in instructions (exceptions, mis-predictions)? I H G F E D C B A H H E E Exception! Issue Queue Register File I I B C A FU D$ E F D G E

  23. Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer Chk I$ Phase #2: Cache miss return  slice in • Allocate new registers • Put in issue queue • Re-execute instruction • Problems with sliced in instructions (exceptions, mis-predictions)? • Recover to checkpoint (taken before A) I H G F E D C B A H H E E Exception! Issue Queue Register File I I B C A FU D$ E F D G E

  24. Slice Self Containment Important for latency tolerance: self-contained slices • A,D, & E have miss-independent inputs • Capture these values during slice out • This decouples slice from rest of program A B G C F D E H

  25. Latency Tolerance Energy ≈ #Boxes Latency tolerance example • Slice out miss and dependent instructions  “grow” window • Slice in after miss returns Delay: 0.5x Energy: 1.5x ED2: 0.38x Combine into ED2 ED2 < 1.0 = Good … Time

  26. Previous Design: CFP Higher Is better Prior design: Continual Flow Pipelines [Srinivasan’04] • Obtains speedups, but…

  27. Previous Design: CFP Higher Is better Prior design: Continual Flow Pipelines [Srinivasan’04] • Obtains speedups, but also slowdowns

  28. Previous Design: CFP Higher Is better Prior design: Continual Flow Pipelines [Srinivasan’04] • Obtains speedups, but also slowdowns • Typically not energy efficient Lower is better

  29. Energy-Efficient Latency Tolerance? Efficient Implementation • Re-use existing structures when possible • New structures must be simple, low-overhead Runtime efficiency • Minimize superfluous re-executions Previous designs have not achieved (or considered) these • Waiting Instruction Buffer [Lebeck’02] • Continual Flow Pipeline [Srinivasan’04] • Decoupled Kilo Instruction Processor [Pericas ’06,’07]

  30. Sneak Preview: Final Results Higher Is better This talk: my work on efficient latency tolerance • Improved performance • Performance robustness (do no harm) • Performance is energy efficient Lower is better

  31. Talk Outline Introduction Background: memory latency & latency tolerance My work: energy efficient latency tolerance in BOLT • Implementation aspects • Runtime aspects Other work and future plans

  32. Examination of the Problem Fetch Rename Slice Buffer Reorder Buffer Chk I$ Problem with existing design: register management • Miss-dependent instructions free registers when they execute I H G F E D C B A K L E H D A Issue Queue Register File I I B C FU D$ F G

  33. Examination of the Problem Fetch Rename Slice Buffer Chk Chk I$ Problem with existing design: register management • Miss-dependent instructions free registers when they execute • Actually, all instructions free registers when they execute What’s wrong with this? • No instruction level precise state  hurts on branch mispredictions • Execution order slice buffer K L E H D A Issue Queue Register File I FU D$  hard to re-rename & re-acquire registers

  34. BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ Youngest instructions: keep in re-order buffer • Conventional, in-order register freeing Miss-dependent instructions: in slice buffer • Execution based register freeing D C B A Issue Queue Register File A B C D FU D$

  35. BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ In-order speculative retirement stage • Head of ROB completed or poison? D C B A Issue Queue Register File A B C D FU D$

  36. BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ In-order speculative retirement stage • Head of ROB completed or poison? • Release registers D C B A Issue Queue Register File A B C D FU D$

  37. BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ In-order speculative retirement stage • Head of ROB completed or poison? • Release registers D C B A Issue Queue Register File B C D FU D$

  38. BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ In-order speculative retirement stage • Head of ROB completed or poison? • Release registers • Poison instructions enter slice buffer D C B A Issue Queue Register File B C D FU D$

  39. BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ In-order speculative retirement stage • Head of ROB completed or poison? • Release registers • Poison instructions enter slice buffer • Completed instructions are done and simply removed D C B A Issue Queue Register File B C D FU D$

  40. BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ In-order speculative retirement stage • Head of ROB completed or poison? • Release registers • Poison instructions enter slice buffer • Completed instructions are done and simply removed D C A Issue Queue Register File C D FU D$

  41. BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ Benefits of BOLT’s management • Youngest instructions (ROB) get conventional recovery (do no harm) V U T L K H E D A Issue Queue Register File T T U V FU D$

  42. BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ Benefits of BOLT’s register management • Youngest instructions (ROB) get conventional recovery (do no harm) • Program order slice buffer allows re-use of SMT (“HyperThreading”) V U T L K H E D A Issue Queue Register File T T U V FU D$

  43. BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ Benefits of BOLT’s register management • Youngest instructions (ROB) get conventional recovery (do no harm) • Program order slice buffer allows re-use of SMT (“HyperThreading”) • Scale single, conventionally sized register file V U T L K H E D A Issue Queue Register File T T U V FU D$ Register File Contribution #1: Hybrid register management—best of both worlds

  44. BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ Benefits of BOLT’s register management • Youngest instructions (ROB) get conventional recovery (do no harm) • Program order slice buffer allows re-use of SMT (“HyperThreading”) • Scale single, conventionally sized register file Challenging part: two algorithms, one register file • Note: two register files not a good solution V U T L K H E D A Issue Queue Register File T T U V FU D$

  45. Two Algorithms, One Register File Fetch Rename Slice Buffer Reorder Buffer Chk I$ Conventional algorithm (ROB) • In-order allocation/freeing from circular queue • Efficient squashing support by moving queue pointer V U T L K H E D A Issue Queue Register File T T U V FU D$

  46. Two Algorithms, One Register File Fetch Rename Slice Buffer Reorder Buffer Chk I$ Conventional algorithm (ROB) • In-order allocation/freeing from circular queue • Efficient squashing support by moving queue pointer Aggressive algorithm (slice instructions) • Execution driven reference counting scheme V U T L K H E D A Issue Queue Register File T T U V FU D$

  47. Two Algorithms, One Register File Fetch Rename Slice Buffer Reorder Buffer Chk I$ How to combine these two algorithms? • Execution based algorithm uses reference counting • Efficiently encode conventional algorithm as reference counting • Combine both into one reference count matrix V U T L K H E D A Issue Queue Register File T T U V FU D$ Register File Contribution #2: Efficient implementation of new hybrid algorithm

  48. Management of Loads and Stores Fetch Rename Slice Buffer Reorder Buffer Chk I$ Large window requires support for many loads and stores • Window effectively A-V now, what about the loads & stores ? • This could be an hour+ talk by itself… so just a small piece V U T L K H E D A Issue Queue Register File T T U V FU D$

  49. Store to Load Dependences ? A Different from register state: cannot capture inputs • Store -> load dependences determined by addresses • Cannot “capture” like registers • Must be able to find proper (older, matching) stores B C D E F

  50. Store to Load Dependences ? A Different from register state: cannot capture inputs • Store -> load dependences determined by addresses • Cannot “capture” like registers • Must be able to find proper (older, matching) stores • Must avoid younger matching stores (“write-after-read” hazards) B X C D E F

More Related