1 / 55

Dynamic Management of Microarchitecture Resources in Future Processors Rajeev Balasubramonian

Explore trade-offs, dynamic resource management, and design goals in modern processors at the University of Rochester. Discuss cache hierarchy, clustered processors, pre-execution threads, and future work.

loreta
Download Presentation

Dynamic Management of Microarchitecture Resources in Future Processors Rajeev Balasubramonian

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dynamic Management of Microarchitecture Resources in Future Processors Rajeev Balasubramonian Dept. of Computer Science, University of Rochester

  2. Talk Outline • Trade-offs in future microprocessors • Dynamic resource management • On-chip cache hierarchy • Clustered processors • Pre-execution threads • Future work University of Rochester

  3. Talk Outline • Trade-offs in future microprocessors • Dynamic resource management • On-chip cache hierarchy • Clustered processors • Pre-execution threads • Future work University of Rochester

  4. Design Goals in Modern Processors Microprocessor designs strive for: • High performance • High clock speed • High parallelism • Low power • Low design complexity • Short, simple pipelines Unfortunately, not all can be achieved simultaneously. University of Rochester

  5. Trade-Off in the Cache Size CPU CPU L1 data cache L1 data cache Size/access time: 32KB cache/2 cycles 128KB/4 cycles “sort 4000” miss rate: very low very low “sort 4000” execution time: t t + x “sort 16000” miss rate: high very low “sort 16000” execution time: T T - X University of Rochester

  6. Trade-Off in the Register File Size Register file The register file stores results for all active instructions in the processor. Large register file  more active instructions  high parallelism  long access times  slow clock speed / more pipeline stages  high power, design complexity University of Rochester

  7. Trade-Offs Involving Resource Sizes Trade-offs influence the design of the cache, register file, issue queue, etc. Large resource size high parallelism, ability to support more threads long latency  long pipelines/ low clock speed high power, high design complexity University of Rochester

  8. Parallelism-Latency Trade-Off • For each resource, performance depends on: • parallelism it can help extract • negative impact of its latency Every program has different parallelism and latency needs. University of Rochester

  9. Limitations of Conventional Designs • Resource sizes are fixed at design time – the size • that works best, on average, for all programs • This average size is often too small or too large • for many programs • For optimal performance, the hardware should • match the program’s parallelism needs. University of Rochester

  10. Dynamic Resource Management • Reconfigurable memory hierarchy (MICRO’00, • IEEE TOC, PACT’02) • Trade-offs in clusters (ISCA’03) • Selective pre-execution (ISCA’01) • Efficient register file design (MICRO’01) • Dynamic voltage/frequency scaling (HPCA’02) University of Rochester

  11. Talk Outline • Trade-offs in future microprocessors • Dynamic resource management • On-chip cache hierarchy • Clustered processors • Pre-execution threads • Future work University of Rochester

  12. Conventional Cache Hierarchies Capacity Speed L2 CPU L1 Main Memory 32KB 2-way set-associative 2 cycles Miss rate 2.3% 2MB 8-way 20 cycles Miss rate 0.2% University of Rochester

  13. Conventional Cache Layout bitline way 0 way 1 D e c o d e r Address wordline Output Driver Data University of Rochester

  14. Wire Delays • Delay is a quadratic function of the wire length • By inserting repeaters/buffers, delay grows • roughly linearly with length Length = 2x Delay ~ 4t Length = 2x Delay ~ 2t + logic_delay Length = x Delay ~ t • Repeaters electrically isolate the wire segments • Commonly used today in long wires University of Rochester

  15. Exploiting Technology D e c o d e r University of Rochester

  16. The Reconfigurable Cache Layout D e c o d e r way 0 way 1 way 2 way 3 University of Rochester

  17. The Reconfigurable Cache Layout D e c o d e r way 0 way 1 way 2 way 3 32KB 1-way cache, 2 cycles University of Rochester

  18. The Reconfigurable Cache Layout D e c o d e r way 0 way 1 way 2 way 3 64KB 2-way cache, 3 cycles The disabled portions of the cache are used as the non-inclusive L2. University of Rochester

  19. Changing the Boundary between L1-L2 L1 L2 CPU University of Rochester

  20. Changing the Boundary between L1-L2 L1 L2 CPU University of Rochester

  21. Trade-Off in the Cache Size CPU CPU L1 data cache L1 data cache Size/access time: 32KB cache/2 cycles 128KB/4 cycles “sort 4000” miss rate: very low very low “sort 4000” execution time: t t + x “sort 16000” miss rate: high very low “sort 16000” execution time: T T - X University of Rochester

  22. Salient Features • Low-cost: Exploits the benefits of repeaters • Optimizes the access time/capacity trade-off • Can reduce energy -- most efficient when cache • size equals working set size University of Rochester

  23. Control Mechanism Gather statistics at periodic intervals (every 10K instructions) Inspect stats. Is there a phase change? exploration yes no Run each configuration for an interval Remain at the selected configuration Pick the best configuration University of Rochester

  24. Metrics • Optimizing performance: • metric for best configuration is simply instructions per cycle (IPC) • Detecting a phase change: • Change in branch frequency or miss rate • frequency or sudden change in IPC  • change in program phase • To avoid unnecessary explorations, the thresholds can be adapted at run-time University of Rochester

  25. Simulation Methodology • Modified version of Simplescalar-3.0 -- includes • many details on bus contention • Executing programs from various benchmark • sets (a mix of many program types) University of Rochester

  26. Performance Results Overall harmonic mean (HM) improvement: 17% University of Rochester

  27. Energy Results Overall energy savings: 42% University of Rochester

  28. Talk Outline • Trade-offs in future microprocessors • Dynamic resource management • On-chip cache hierarchy • Clustered processors • Pre-execution threads • Future work University of Rochester

  29. Conventional Processor Design Register File Branch Predictor I s s u e Q I Cache Rename & Dispatch FU FU FU Large structures  Slower clock speed FU University of Rochester

  30. The Clustered Processor Regfile Branch Predictor r1  r3 + r4 r2  r1 + r41 IQ FU r2  r1 + r41 I Cache Rename & Dispatch Regfile IQ FU Regfile r41  r43 + r44 IQ FU Small structures  Faster clock speed But, high latency for some instructions Regfile IQ FU University of Rochester

  31. Emerging Trends • Wire delays and faster clocks will make each • cluster smaller • Larger transistor budgets and low design cost • will enable the implementation of many clusters • on the chip • The support of many threads will require many • resources and clusters •  Numerous, small clusters will be a reality! University of Rochester

  32. Communication Costs Regs Regs IQ FU IQ FU Regs Regs IQ FU IQ FU Regs Regs 4 clusters IQ FU IQ FU Regs Regs 8 clusters IQ FU IQ FU Regs IQ FU Regs Regs IQ FU IQ FU More clusters  more communication Regs IQ FU University of Rochester

  33. Communication vs Parallelism 4 clusters  100 active instrs r1  r2 + r3 r5  r1 + r3 … … r7  r2 + r3 r8  r7 + r3 8 clusters  200 active instrs r1  r2 + r3 r5  r1 + r3 … … r7  r2 + r3 r8  r7 + r3 … … r5  r1 + r7 … r9  r2 + r3 Ready instructions Distant parallelism: distant instructions that are ready to execute University of Rochester

  34. Communication-Parallelism Trade-Off • More clusters  More communication •  More parallelism • Selectively use more clusters • if communication is tolerable • if there is additional distant parallelism University of Rochester

  35. IPC with Many Clusters (ISCA’03) University of Rochester

  36. Trade-Off Management • The clustered processor abstraction exposes • the trade-off between communication and • parallelism • It also simplifies the management of resources • -- we can disable a cluster by simply not • dispatching instructions to it University of Rochester

  37. Control Mechanism Gather statistics at periodic intervals (every 10K instructions) Inspect stats. Is there a phase change? exploration yes no Run each configuration for an interval Remain at the selected configuration Pick the best configuration University of Rochester

  38. The Interval Length • Success depends on ability to repeat behavior • across successive intervals • Every program is likely to have phase changes • at different granularities • Must also pick the interval length at run-time University of Rochester

  39. Picking the Interval Length • Start with minimum allowed interval length • If phase changes are too frequent, double • the interval length – find a coarse enough • granularity such that behavior is consistent • Repeat every 10 billion instructions • Small interval lengths can result in noisy • measurements University of Rochester

  40. Varied Interval Lengths Instability factor: Percentage of intervals that flag a phase change. University of Rochester

  41. Results with Interval-Based Scheme Overall improvement: 11% University of Rochester

  42. Talk Outline • Trade-offs in future microprocessors • Dynamic resource management • On-chip cache hierarchy • Clustered processors • Pre-execution threads • Future work University of Rochester

  43. Pre-Execution • Executing a subset of the program in advance • Helps warm up various processor structures • such as the cache and branch predictor University of Rochester

  44. The Future Thread (ISCA’01) • The main program thread executes every single • instruction • Some registers are reserved for the future thread • so it can jump ahead . . . . . . . . Main thread . . . . . Pre-execution thread University of Rochester

  45. Key Innovations • Ability to advance much further • eager recycling of registers • skipping idle instructions • Integrating pre-executed results • re-using register results • correcting branch mispredicts • prefetch into the caches • Allocation of resources University of Rochester

  46. Trade-Offs in Resource Allocation • Allocating more registers for the main thread • favors nearby parallelism . . . . . . . . Main thread . . . . . Future thread • Allocating more registers for the future thread • favors distant parallelism • The interval-based mechanism can pick the • optimal allocation University of Rochester

  47. Pre-Execution Results Overall improvement with 12 registers: 11% Overall improvement with dynamic allocation: 18% University of Rochester

  48. Conclusion • Emerging technologies will make trade-off • management very vital • Approaches to hardware adaptation • cache hierarchy • clustered processors • pre-execution threads • The interval-based mechanism with exploration • is robust and applies to most problem domains University of Rochester

  49. Talk Outline • Trade-offs in future microprocessors • Dynamic resource management • On-chip cache hierarchy • Clustered processors • Pre-execution threads • Future work University of Rochester

  50. Future Scenarios • Clustered designs can be used to produce • all classes of processors • A library of simple cluster cores – with different • energy, clock speed, latency, and parallelism • characteristics • The role of the architect: putting these cores • together on the chip and exploiting them to • maximize performance University of Rochester

More Related