160 likes | 282 Views
Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals. Canturk ISCI Gilberto CONTRERAS Margaret MARTONOSI. Hardware Performance Counters (HPCs) Go beyond Performance. Several explored research avenues Runtime power/thermal estimations
E N D
Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk ISCIGilberto CONTRERASMargaret MARTONOSI
Hardware Performance Counters (HPCs) Go beyond Performance • Several explored research avenues • Runtime power/thermal estimations • Dynamic management • Workload phases and application behavior prediction • HPCs provide value beyond simulations • Long-timescales • Real-system behavior Canturk Isci, Gilberto Contreras, Margaret Martonosi
Hardware Performance Counters (HPCs) Go beyond Performance • Runtime power • Isci & Martonosi [MICRO 2003] • Contreras & Martonosi [Submitted 2005] • Runtime thermal • Lee & Skadron [HP-PAC in IPDPS 2005] • Dynamic power management • Choi et al. [ISLPED 2004] • Weißel & Bellosa [CASES 2002] • Dynamic thermal management • Bellosa et al. [COLP 2003] • Workload phases and application behavior prediction • Isci & Martonosi [WWC 2003] • Duesterwald et al. [PACT 2003] Canturk Isci, Gilberto Contreras, Margaret Martonosi
Power of component I = MaxPower[I] x ArchScaling[I] x AccessRate[I]+ NonGatedPower[I] High-Performance Corner: P4 Power Estimation • Idea: • Motivation: • Fast (Real-time) • Estimated view of on-chip detail (Per physical component) • Design: • Developed heuristics using 24 events to approximate access rates for 22 chip components • Used 15 counters with 4 rotationsto collect all event data • Validation: • Real-time estimates against real-time measured power Canturk Isci, Gilberto Contreras, Margaret Martonosi
Modeled P4 Power Estimator Results • Average difference: ~5% among all benchmarks • SPEC CPU2000 & other applications Gcc Gzip Vpr Vortex Gap Crafty Measured Canturk Isci, Gilberto Contreras, Margaret Martonosi
Embedded Corner: PXA255 Power Estimation • Idea: CPU Powernx1 = PerformanceEventsnx5xLinearParameters5x1+ IdlePower Mem Powernx1 = PerformanceEventsnx2 x LinearParameters2x1+ IdlePower • Motivation: • Runtime power optimizations under DVFS • Design: • Parameter estimation (OLS) using dominant counter readings and live power measurements • Power estimation at various CPU configurations • Validation: • Comparison between estimates and real-time measured power Canturk Isci, Gilberto Contreras, Margaret Martonosi
PXA255 Results DB CDC Java • 5% average error across 3 domains • Java CDC • Java CLDC • SPEC2000 Canturk Isci, Gilberto Contreras, Margaret Martonosi
Proposals from Experiences • 1. Track each physical unit individually for power & thermal: • Ex: DispatchPorts Instr-n Queue1 TraceCache MEM μopQueue Allocate Rename Schedulers μCodeROM Instr-n Queue2 EXE All tracked with in-flight μops written to μop queue • Need individual utilization counts for each physical unit available on die for power and hotspot analyses Canturk Isci, Gilberto Contreras, Margaret Martonosi
Proposals from Experiences • 2. Need bitline activity counts • Utilization is not complete information, power in part depends on switching factor • Not necessarily fully detailed counts • Accumulate bitwise XOR of current and previous input/output ports • Sample RegFile ports/bit populations 30mW (10%) swing 400Mhz 1.3V PXA255 Processor Canturk Isci, Gilberto Contreras, Margaret Martonosi
Proposals from Experiences • 2. Need bitline activity counts • Utilization is not complete information, power in part depends on switching factor • Not necessarily fully detailed counts • Accumulate bitwise XOR of current and previous input/output ports • Sample RegFile ports/bit populations 111…11 000…01 + + 111…11 111…11 A 000…01 000…01 000…01 000…01 000…01 000…01 : 000…01 000…01 000…01 000…01 B111…11 000…00011…11 000…00001…11 000…00 :000…11 000…00000…01 000…00 20mW swing 000…00 + 111…11 111…11 + 000…00 000…01 + 000…00 400Mhz 1.3V PXA255 Processor Canturk Isci, Gilberto Contreras, Margaret Martonosi
Proposals from Experiences • 3. More detailed off-chip/memory access support in the embedded domain • Mem Power ~40% of system power • Tracking memory hierarchy transactions may help render better memory power estimates • Main memory Read/Writes • Core + DMA • Transaction length in bytes • Activity factors can be shared with RegFile REX Memory power consumption (one 16b bank) Canturk Isci, Gilberto Contreras, Margaret Martonosi
Proposals from Experiences • 4. Metrics related to queue occupancy • Modern processor ≡ Several queues • Depending on implementationPower ∝ Queue occupancy Buyuktosunoglu et al. [ISLPED’02]Tradeoffs in Power-Efficient Issue Queue Design Canturk Isci, Gilberto Contreras, Margaret Martonosi
Proposals from Experiences • 5. General/aggregate metrics in addition to specialized cases/ breakdowns simplify runtime sampling for unit accesses • P4 ex1. MOB: Only event MOB_load_replays • Counts replays for unknown st addr./data, partial/unaligned addr. match • No info for MOB entries/accesses/updates • P4 ex2. FPU: Has 8 separate events (with 2 dedicated ESCRs) • Need at least 4 rotations to collect • P4 ex3. INT ALU: No dedicated event Canturk Isci, Gilberto Contreras, Margaret Martonosi
Additional Comments for HPC Design • General/aggregate metrics in addition to specialized cases/ breakdowns simplify runtime sampling for unit accesses • Metrics related to RegFile accesses vs. forwarding • Semi-distributed implementations will always induce dependencies among simultaneously countable events • Higher parallelism among (power oriented) metrics for minimal counter rotations at runtime • Implementations that allow counter rotations without need for intermediate logging • Partitioned / Dual-mode / Buffered counters • Different events for different types of accesses to same units with different magnitude power implications • i.e. branch scan < BHT update < BTA update • Different API/SW demands: • Lightweight implementations for runtime analyses • Per-thread for application profiling vs. global for real-time measurement comparisons and hotspots Canturk Isci, Gilberto Contreras, Margaret Martonosi
Wishlist for Power/Thermal • 1) For each physical unit on die, separate events to track utilization rates • Sub events for different type of accesses with different power costs • 2) Bitline activity counters for switching units • 3) Occupancy counters for related queues • 4) Counter support for off-core memory accesses • 5) High parallelism among power events for minimal counter rotations Canturk Isci, Gilberto Contreras, Margaret Martonosi
Conclusions • New opportunities remain to be explored in future PMC designs for power and thermal studies • Direct correspondence to physical units • Bitline and occupancy counters • We believe in the feasibility of these additions with the continuing emphasis given to counter design, as long as power is also considered a primary design target. Canturk Isci, Gilberto Contreras, Margaret Martonosi