400 likes | 508 Views
Overview. Motivation (Kevin) Thermal issues (Kevin) Power modeling (David) Thermal management (David) Optimal DTM (Lev) Clustering (Antonio) Power distribution (David) What current chips do (Lev) HotSpot (Kevin). global resources. cluster 0. cluster 1. cluster 2. cluster 3.
E N D
Overview • Motivation (Kevin) • Thermal issues (Kevin) • Power modeling (David) • Thermal management (David) • Optimal DTM (Lev) • Clustering (Antonio) • Power distribution (David) • What current chips do (Lev) • HotSpot (Kevin)
global resources cluster0 cluster1 cluster2 cluster3 interconnection network The clustering approach • Reduce complexity by partitioning • Less latency, area, power and temperature • Fast, simple, distributed units • Communication latency is heterogeneous and exposed to the microarchitecture • Localize critical communication within clusters (fast wires)
The clustering approach (...) • Smaller structures consume less power • Higher power efficiency [Zyuban, IEEE Transactions 01] • Partitioning simplifies power management • Via clock/power gating techniques [Bahar, ISCA 01] • Via dynamic cluster resizing [González, ICCD 03] • Via DVS/DFS • Partitioning reduces temperature • Activity is distributed [Chaparro, TACS 04] • Hopping schemes can be applied [Chaparro, TACS 04] • Adds flexibility for temperature-effective layouts • IPC overheads due to communication/imbalance • Compensated by shorter latency/clock period [Palacharla, ISCA 97], [Canal, HPCA 00]
Icache Fetch & decode Cluster Steering logic Issue-Queue Register File C0 C1 C2 C3 FU FU IC Network Clustered microarchitecture • Dynamic steering • Distributed Issue, Registers, FUs • Inter-cluster register communication
Register Map Table C0 C1 C2 C3 phys. reg. On-demand communication • Map table tracks locations of register values • At rename • allocate register for result, in the assigned cluster • if a source operand is in a remote cluster • insert a copy instruction in remote cluster • allocate register for a copy • At commit • free allocated register(s) by previous mapping log. reg. [Canal, PACT99]
3 3 10 10 X X X X X X 14 Rename Renaming Table Cluster 1 Steering Logic src1 src2 src3 src4 src5 dst Logical 2 3 X X X 1 Physical
src1 dst CL1:10 CL2:27 15 15 15 27 27 X!!! X X X X X X X X X 14 Copy instructions Copy instruction Renaming Table Cluster 2 Steering Logic src1 src2 src3 src4 src5 dst Logical 2 3 X X X 1 Physical
Broadcast communication • Values sent to all register files • Local file is updated earlier than remote ones • Registers are replicated in all files • Register storage waste • Increase in power • Values are written multiple times • Increase in power • May reduce communication penalties • Values are present everywhere • But not at the same time • E.g.: Alpha 21264
Cluster assignment schemes • Main goals • Minimize inter-cluster communication penalty • Maximize workload balance • Main approaches • Static approaches[Farkas, Micro 97] [Sastry, PLDI 98] • Less flexible than dynamic ones: poor load balancing • Dynamic, dependence-based[Palacharla ISCA 97] [Alpha 21264] [Kemp, ICPP 96] • Only consider dependences through unavailable operands • Lack specific balancing mechanisms • Dynamic, workload balance oriented[Baniasadi 00] • Only suitable with low communication penalty architectures • Dynamic, dependence-based and workload balance oriented[Canal HPCA 2000, Parcerisa PACT 2002] • Tries to find best trade-off between communications and workload balance
Cluster assignment schemes • Accurate-Rebalancing Priority RMB 1- To minimize communication penalties: • If unavailable source register: choose producer’s cluster • Else: Select clusters with highest number of source regs. mapped 2- Choose the least loaded one of the above Exception: if imbalance > threshold, then exclude clusters with positive workload, prior to applying rules 1 and 2
Evaluation SpecInt95
Dynamic vs. static steering S. Sastry, S.Palacharla and J.E.Smith, PLDI 1998
Data cache architectures [González, WMPI 04] • Centralized Backend Backend L1 Dcache • Dcache is a cluster • Single Load/Store queue • Simple disambiguation Backend Backend UL2
UL2 DL1 DL1 DL1 DL1 BE 1 BE 2 BE 3 BE 4 Data cache architecture (II) • Attraction caches • Lines are copied on demand • A coherence scheme is needed • Steering must exploit data locality
Data cache architecture (III) • Replicated • Area cost • Traffic due to store broadcast UL2 DL1 DL1 DL1 DL1 BE 1 BE 2 BE 3 BE 4
DL1 DL1 DL1 DL1 BE 1 BE 4 BE 2 BE 3 Data cache architecture (IV) • Interleaved • Word/line interleaved • Steering needs to predict the bank UL2
Memory issues • Disambiguation • Load/Store queues are distributed • Stores are allocated in all clusters • Address is computed in one and broadcast • Loads go to memory once previous stores know their addresses • Memory coherence • Write-Invalidate / Write-Update protocols
ROB Cluster 0 Cluster 3 FPS CS IS ITLB FPRF IRF RAT DECO Cluster 2 Cluster 1 TC BP MS/MOB FPFU IFU DL0 DTLB UL2 Thermal benefits of clustering Example layout for a quad-cluster architecture
Temperature metrics • AbsMax • Maximum sensed temperature • Average • Average temperature across time and area • AverageMax • Average temperature across time of maximum sensed temperature
Clustering reduces temperature • If clustering is smart
Clustering effects • May end up with higher power densities! • Simpler and smaller units may create hotspots • Layout must be thermal-effective • Surround hotspots by cold areas • Activity steering must be smart • Other techniques (e.g. throttling) can be applied at smaller granularity • Aim at particular clusters without affecting others
Dynamic cluster resizing [González, ICCD 03] • Motivation
Dynamic cluster resizing • Proposal • Dynamically compute the energy of blocks • Schedulers, FUs, DL0s, etc • Dynamically compute the energyxdelay2 of the processor • Use different configurations for different intervals • Measure the optimal configuration • Gate-off (disable) useless units • Scheduler level • Backend level
ED2Px+3 ED2Px+1 ED2Px-3 X+y X-y X X+2 X-2 X+3 X+1 X-3 ED2Px+y ED2Px-y ED2Px ED2Px+2 ED2Px-2 memory bus disamb. bus X-1 ED2Px-1 Dynamic cluster resizing I$ UL2 cache Decode Rename Steer BEn BE4 BE5 BE1 BE2 BE3 ED2Px < ED2Px+1 < ED2Px-1 ?
Cluster hopping • Motivation • Power and average temperature savings when statically Vdd gating clusters * Temperatures in the backend area when gating all but the indicated cluster(s). Reductions over in-box ambient temperature (45º) respect to a baseline quad-cluster architecture.
Cluster hopping • Based on activity migration [Heo, ISLPED 03] • Vdd gate a subset of clusters • Rotate clusters to spread activity over time • Gated clusters cannot provide any register value • Before gating, some register values must be evicted • Cache/DTLB contents are lost • Unless some low power (e.g. drowsy) mode is used • Proactive and/or reactive behavior • Proactive: Per interval basis • Reactive: On thermal events
2dis-dia 1dis-rot 3dis-rot 2dis-alt Cluster hopping schemes Effective at reducing average temperature (thus leakage) but not max temperature
Thermal-aware steering • Try to minimize max temperature • Take into account cluster temperature when deciding destination • Some examples • Cold • Dispatch to coldest cluster with available resources • Lowest average temperature • Lowest peak temperature • T-Cold • Like Cold but discard clusters that are too hot • If difference in temperature with previous cluster (ordered by temperature) is higher than a threshold
Thermal-aware steering • T-Thermal • Minimize communications unless candidate cluster is too hot • If temperature difference > threshold Priority to the colder • Otherwise Priority to the one that minimize communications, and in case of tie maximize workload balance (#instructions in the schedulers)
Thermal-aware steering • Thermal-aware steering standalone
Hopping + thermal steering • Putting it all together
src/dst regs. assign-ments steering hit/miss PC Fetch Decode Rename Cluster Assignment DependenceChecking Br. Prediction Clustering the front-end `[Parcerisa, TR 02] Distributed Back-end
(1) (2) Cluster 0 (2) (1) Cluster 1 Back-end St BrP F Dec R D Cluster 2 Cluster 3 Predictor Table Distributed branch predictor • Broadcast every prediction (next PC) to all clusters • Hardware loop: predictor uses PC as index • insert bubble when switching the predictor cluster (2) • if interleaving by low order bits: frequent bubbles • Solution • Pipeline prediction ahead of I-cache + interleave by hi-bits • Bubble only when high level interleave boundary crossed (2)
Impact of distributing branch predictor • Bank switching • SpecInt95: every 24 instructions • Mbench: every 133 instructions • IPC loss • SpecInt95: 0,5% • Mbench: no loss
* Back-end St BrP F Dec R D ** Broadcast assignments override assignments St BrP Back-end F Dec R D ** Dep ** Broadcast register designators Distributed cluster assignment • Make local assignments and broadcast them to all clusters • Loop: steering logic uses assignments made by other clusters • Partial solution: use outdated info (2 cycles) • Problem: outdated dependences generates communications • Solution: • anticipate dependence-checking and • override assignment, if dependence was violated
Impact of distributing assignment • W/o assignment overriding • 0.42 communications / instruction • More than 10% IPC loss • With assignment overriding • 0.17 communications / instruction • Less than 2% IPC loss
Thermal benefits • Clustering the rename table and the reorder buffer [Chaparro, 04]
Summary • Clustering is thermal-effective (in addition to complexity-effective) • Reduces power • Distributes activity • Clustering enables effective temperature control schemes • Adaptive configuration • DVS/DFS • Cluster hopping • Thermal steering