1 / 40

Overview

Overview. Motivation (Kevin) Thermal issues (Kevin) Power modeling (David) Thermal management (David) Optimal DTM (Lev) Clustering (Antonio) Power distribution (David) What current chips do (Lev) HotSpot (Kevin). global resources. cluster 0. cluster 1. cluster 2. cluster 3.

ehren
Download Presentation

Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview • Motivation (Kevin) • Thermal issues (Kevin) • Power modeling (David) • Thermal management (David) • Optimal DTM (Lev) • Clustering (Antonio) • Power distribution (David) • What current chips do (Lev) • HotSpot (Kevin)

  2. global resources cluster0 cluster1 cluster2 cluster3 interconnection network The clustering approach • Reduce complexity by partitioning • Less latency, area, power and temperature • Fast, simple, distributed units • Communication latency is heterogeneous and exposed to the microarchitecture • Localize critical communication within clusters (fast wires)

  3. The clustering approach (...) • Smaller structures consume less power • Higher power efficiency [Zyuban, IEEE Transactions 01] • Partitioning simplifies power management • Via clock/power gating techniques [Bahar, ISCA 01] • Via dynamic cluster resizing [González, ICCD 03] • Via DVS/DFS • Partitioning reduces temperature • Activity is distributed [Chaparro, TACS 04] • Hopping schemes can be applied [Chaparro, TACS 04] • Adds flexibility for temperature-effective layouts • IPC overheads due to communication/imbalance • Compensated by shorter latency/clock period [Palacharla, ISCA 97], [Canal, HPCA 00]

  4. Icache Fetch & decode Cluster Steering logic Issue-Queue Register File C0 C1 C2 C3 FU FU IC Network Clustered microarchitecture • Dynamic steering • Distributed Issue, Registers, FUs • Inter-cluster register communication

  5. Register Map Table C0 C1 C2 C3 phys. reg. On-demand communication • Map table tracks locations of register values • At rename • allocate register for result, in the assigned cluster • if a source operand is in a remote cluster • insert a copy instruction in remote cluster • allocate register for a copy • At commit • free allocated register(s) by previous mapping log. reg. [Canal, PACT99]

  6. 3 3 10 10 X X X X X X 14 Rename Renaming Table Cluster 1 Steering Logic src1 src2 src3 src4 src5 dst Logical 2 3 X X X 1 Physical

  7. src1 dst CL1:10 CL2:27 15 15 15 27 27 X!!! X X X X X X X X X 14 Copy instructions Copy instruction Renaming Table Cluster 2 Steering Logic src1 src2 src3 src4 src5 dst Logical 2 3 X X X 1 Physical

  8. Broadcast communication • Values sent to all register files • Local file is updated earlier than remote ones • Registers are replicated in all files • Register storage waste • Increase in power • Values are written multiple times • Increase in power • May reduce communication penalties • Values are present everywhere • But not at the same time • E.g.: Alpha 21264

  9. Cluster assignment schemes • Main goals • Minimize inter-cluster communication penalty • Maximize workload balance • Main approaches • Static approaches[Farkas, Micro 97] [Sastry, PLDI 98] • Less flexible than dynamic ones: poor load balancing • Dynamic, dependence-based[Palacharla ISCA 97] [Alpha 21264] [Kemp, ICPP 96] • Only consider dependences through unavailable operands • Lack specific balancing mechanisms • Dynamic, workload balance oriented[Baniasadi 00] • Only suitable with low communication penalty architectures • Dynamic, dependence-based and workload balance oriented[Canal HPCA 2000, Parcerisa PACT 2002] • Tries to find best trade-off between communications and workload balance

  10. Cluster assignment schemes • Accurate-Rebalancing Priority RMB 1- To minimize communication penalties: • If unavailable source register: choose producer’s cluster • Else: Select clusters with highest number of source regs. mapped 2- Choose the least loaded one of the above Exception: if imbalance > threshold, then exclude clusters with positive workload, prior to applying rules 1 and 2

  11. Evaluation SpecInt95

  12. Dynamic vs. static steering S. Sastry, S.Palacharla and J.E.Smith, PLDI 1998

  13. Data cache architectures [González, WMPI 04] • Centralized Backend Backend L1 Dcache • Dcache is a cluster • Single Load/Store queue • Simple disambiguation Backend Backend UL2

  14. UL2 DL1 DL1 DL1 DL1 BE 1 BE 2 BE 3 BE 4 Data cache architecture (II) • Attraction caches • Lines are copied on demand • A coherence scheme is needed • Steering must exploit data locality

  15. Data cache architecture (III) • Replicated • Area cost • Traffic and activity due to store broadcast UL2 DL1 DL1 DL1 DL1 BE 1 BE 2 BE 3 BE 4

  16. DL1 DL1 DL1 DL1 BE 1 BE 4 BE 2 BE 3 Data cache architecture (IV) • Interleaved • Word/line interleaved • Steering needs to predict the bank UL2

  17. Memory issues • Disambiguation • Load/Store queues are distributed • Stores are allocated in all clusters • Address is computed in one and broadcast • Loads go to memory once previous stores know their addresses • Memory coherence • Write-Invalidate / Write-Update protocols

  18. Performance comparison

  19. Integer Scheduler FP Scheduler CopyScheduler Memory Scheduler Integer Register File FP Register File Data Cache Level 1 DTLB Floating Point Execution Units Integer Execution Units Thermal benefits of clustering [Chaparro, TACS 04] Unified L2 Cache Floorplan for a quad-cluster architecture Trace Cache Cluster 0 Cluster 1 ITLB DECO BranchPredictors Reorder Buffer Cluster 3 Cluster 2 Rename Table

  20. Temperature metrics • AbsMax • Maximum sensed temperature • Average • Average temperature across time and area • AverageMax • Average temperature across time of maximum sensed temperature

  21. Clustering reduces temperature • If clustering is smart

  22. Clustering effects • May end up with higher power densities! • Simpler and smaller units may create hotspots • Layout must be thermal-effective • Surround hotspots by cold areas • Activity steering must be smart • Other techniques (e.g. throttling) can be applied at smaller granularity • Aim at particular clusters without affecting others

  23. Dynamic cluster resizing [González, ICCD 03] • Motivation

  24. Dynamic cluster resizing • Proposal • Dynamically compute the energy of blocks • Schedulers, FUs, DL0s, etc • Dynamically compute the energyxdelay2 of the processor • Use different configurations for different intervals • Measure the optimal configuration • Gate-off (disable) useless units • Scheduler level • Backend level

  25. ED2Px+3 ED2Px+1 ED2Px-3 X+y X-y X X+2 X-2 X+3 X+1 X-3 ED2Px+y ED2Px-y ED2Px ED2Px+2 ED2Px-2 memory bus disamb. bus X-1 ED2Px-1 Dynamic cluster resizing I$ UL2 cache Decode Rename Steer BEn BE4 BE5 BE1 BE2 BE3 ED2Px < ED2Px+1 < ED2Px-1 ?

  26. Dynamic cluster resizing

  27. Cluster hopping [Chaparro, TACS 04] • Motivation • Power and average temperature savings when statically Vdd gating clusters * Temperatures in the backend area when gating all but the indicated cluster(s). Reductions over in-box ambient temperature (45º) respect to a baseline quad-cluster architecture.

  28. Cluster hopping • Based on activity migration [Heo, ISLPED 03] • Vdd gate a subset of clusters • Rotate clusters to spread activity over time • Gated clusters cannot provide any register value • Before gating, some register values must be evicted • Cache/DTLB contents are lost • Unless some low power (e.g. drowsy) mode is used • Proactive and/or reactive behavior • Proactive: Per interval basis • Reactive: On thermal events

  29. 2dis-dia 1dis-rot 3dis-rot 2dis-alt Cluster hopping schemes Effective at reducing average temperature (thus leakage) but not max temperature

  30. Thermal-aware steering [Chaparro, TACS 04] • Try to minimize max temperature • Take into account cluster temperature when deciding destination • Some examples • Cold • Dispatch to coldest cluster with available resources • Lowest average temperature • Lowest peak temperature • T-Cold • Like Cold but discard clusters that are too hot • If difference in temperature with previous cluster (ordered by temperature) is higher than a threshold

  31. Thermal-aware steering • T-Thermal • Minimize communications unless candidate cluster is too hot • If temperature difference > threshold  Priority to the colder • Otherwise  Priority to the one that minimize communications, and in case of tie maximize workload balance (#instructions in the schedulers)

  32. Thermal-aware steering • Thermal-aware steering standalone

  33. Hopping + thermal steering • Putting it all together

  34. src/dst regs. assign-ments steering hit/miss PC Fetch Decode Rename Cluster Assignment DependenceChecking Br. Prediction Clustering the front-end `[Parcerisa, TR 02] Distributed Back-end

  35. (1) (2) Cluster 0 (2) (1) Cluster 1 Back-end St BrP F Dec R D Cluster 2 Cluster 3 Predictor Table Distributed branch predictor • Broadcast every prediction (next PC) to all clusters • Hardware loop: predictor uses PC as index • insert bubble when switching the predictor cluster (2) • if interleaving by low order bits: frequent bubbles • Solution • Pipeline prediction ahead of I-cache + interleave by hi-bits • Bubble only when high level interleave boundary crossed (2)

  36. Impact of distributing branch predictor • Bank switching • SpecInt95: every 24 instructions • Mbench: every 133 instructions • IPC loss • SpecInt95: 0,5% • Mbench: no loss

  37. * Back-end St BrP F Dec R D ** Broadcast assignments override assignments St BrP Back-end F Dec R D ** Dep ** Broadcast register designators Distributed cluster assignment • Make local assignments and broadcast them to all clusters • Loop: steering logic uses assignments made by other clusters • Partial solution: use outdated info (2 cycles) • Problem: outdated dependences  generates communications • Solution: • anticipate dependence-checking and • override assignment, if dependence was violated

  38. Impact of distributing assignment • W/o assignment overriding • 0.42 communications / instruction • More than 10% IPC loss • With assignment overriding • 0.17 communications / instruction • Less than 2% IPC loss

  39. Thermal benefits • Clustering the rename table and the reorder buffer [Chaparro, 04]

  40. Summary • Clustering is thermal-effective (in addition to complexity-effective) • Reduces power • Distributes activity • Clustering enables effective temperature control schemes • Adaptive configuration • DVS/DFS • Cluster hopping • Thermal steering

More Related