1 / 16

Canturk ISCI Gilberto CONTRERAS Margaret MARTONOSI

Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals. Canturk ISCI Gilberto CONTRERAS Margaret MARTONOSI. Hardware Performance Counters (HPCs) Go beyond Performance. Several explored research avenues Runtime power/thermal estimations

tegan
Download Presentation

Canturk ISCI Gilberto CONTRERAS Margaret MARTONOSI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hardware Performance Counters for Detailed Runtime Power and Thermal Estimations: Experiences & Proposals Canturk ISCIGilberto CONTRERASMargaret MARTONOSI

  2. Hardware Performance Counters (HPCs) Go beyond Performance • Several explored research avenues • Runtime power/thermal estimations • Dynamic management • Workload phases and application behavior prediction • HPCs provide value beyond simulations • Long-timescales • Real-system behavior Canturk Isci, Gilberto Contreras, Margaret Martonosi

  3. Hardware Performance Counters (HPCs) Go beyond Performance • Runtime power • Isci & Martonosi [MICRO 2003] • Contreras & Martonosi [Submitted 2005] • Runtime thermal • Lee & Skadron [HP-PAC in IPDPS 2005] • Dynamic power management • Choi et al. [ISLPED 2004] • Weißel & Bellosa [CASES 2002] • Dynamic thermal management • Bellosa et al. [COLP 2003] • Workload phases and application behavior prediction • Isci & Martonosi [WWC 2003] • Duesterwald et al. [PACT 2003] Canturk Isci, Gilberto Contreras, Margaret Martonosi

  4. Power of component I = MaxPower[I] x ArchScaling[I] x AccessRate[I]+ NonGatedPower[I] High-Performance Corner: P4 Power Estimation • Idea: • Motivation: • Fast (Real-time) • Estimated view of on-chip detail (Per physical component) • Design: • Developed heuristics using 24 events to approximate access rates for 22 chip components • Used 15 counters with 4 rotationsto collect all event data • Validation: • Real-time estimates against real-time measured power Canturk Isci, Gilberto Contreras, Margaret Martonosi

  5. Modeled P4 Power Estimator Results • Average difference: ~5% among all benchmarks • SPEC CPU2000 & other applications Gcc Gzip Vpr Vortex Gap Crafty Measured Canturk Isci, Gilberto Contreras, Margaret Martonosi

  6. Embedded Corner: PXA255 Power Estimation • Idea: CPU Powernx1 = PerformanceEventsnx5xLinearParameters5x1+ IdlePower Mem Powernx1 = PerformanceEventsnx2 x LinearParameters2x1+ IdlePower • Motivation: • Runtime power optimizations under DVFS • Design: • Parameter estimation (OLS) using dominant counter readings and live power measurements • Power estimation at various CPU configurations • Validation: • Comparison between estimates and real-time measured power Canturk Isci, Gilberto Contreras, Margaret Martonosi

  7. PXA255 Results DB CDC Java • 5% average error across 3 domains • Java CDC • Java CLDC • SPEC2000 Canturk Isci, Gilberto Contreras, Margaret Martonosi

  8. Proposals from Experiences • 1. Track each physical unit individually for power & thermal: • Ex: DispatchPorts Instr-n Queue1 TraceCache MEM μopQueue Allocate Rename Schedulers μCodeROM Instr-n Queue2 EXE All tracked with in-flight μops written to μop queue • Need individual utilization counts for each physical unit available on die for power and hotspot analyses Canturk Isci, Gilberto Contreras, Margaret Martonosi

  9. Proposals from Experiences • 2. Need bitline activity counts • Utilization is not complete information, power in part depends on switching factor • Not necessarily fully detailed counts • Accumulate bitwise XOR of current and previous input/output ports • Sample RegFile ports/bit populations 30mW (10%) swing 400Mhz 1.3V PXA255 Processor Canturk Isci, Gilberto Contreras, Margaret Martonosi

  10. Proposals from Experiences • 2. Need bitline activity counts • Utilization is not complete information, power in part depends on switching factor • Not necessarily fully detailed counts • Accumulate bitwise XOR of current and previous input/output ports • Sample RegFile ports/bit populations 111…11 000…01 + + 111…11 111…11 A 000…01 000…01 000…01 000…01 000…01 000…01 : 000…01 000…01 000…01 000…01 B111…11 000…00011…11 000…00001…11 000…00 :000…11 000…00000…01 000…00 20mW swing 000…00 + 111…11 111…11 + 000…00 000…01 + 000…00 400Mhz 1.3V PXA255 Processor Canturk Isci, Gilberto Contreras, Margaret Martonosi

  11. Proposals from Experiences • 3. More detailed off-chip/memory access support in the embedded domain • Mem Power ~40% of system power • Tracking memory hierarchy transactions may help render better memory power estimates • Main memory Read/Writes • Core + DMA • Transaction length in bytes • Activity factors can be shared with RegFile REX Memory power consumption (one 16b bank) Canturk Isci, Gilberto Contreras, Margaret Martonosi

  12. Proposals from Experiences • 4. Metrics related to queue occupancy • Modern processor ≡ Several queues • Depending on implementationPower ∝ Queue occupancy Buyuktosunoglu et al. [ISLPED’02]Tradeoffs in Power-Efficient Issue Queue Design Canturk Isci, Gilberto Contreras, Margaret Martonosi

  13. Proposals from Experiences • 5. General/aggregate metrics in addition to specialized cases/ breakdowns simplify runtime sampling for unit accesses • P4 ex1. MOB: Only event MOB_load_replays • Counts replays for unknown st addr./data, partial/unaligned addr. match • No info for MOB entries/accesses/updates • P4 ex2. FPU: Has 8 separate events (with 2 dedicated ESCRs) • Need at least 4 rotations to collect • P4 ex3. INT ALU: No dedicated event Canturk Isci, Gilberto Contreras, Margaret Martonosi

  14. Additional Comments for HPC Design • General/aggregate metrics in addition to specialized cases/ breakdowns simplify runtime sampling for unit accesses • Metrics related to RegFile accesses vs. forwarding • Semi-distributed implementations will always induce dependencies among simultaneously countable events • Higher parallelism among (power oriented) metrics for minimal counter rotations at runtime • Implementations that allow counter rotations without need for intermediate logging • Partitioned / Dual-mode / Buffered counters • Different events for different types of accesses to same units with different magnitude power implications • i.e. branch scan < BHT update < BTA update • Different API/SW demands: • Lightweight implementations for runtime analyses • Per-thread for application profiling vs. global for real-time measurement comparisons and hotspots Canturk Isci, Gilberto Contreras, Margaret Martonosi

  15. Wishlist for Power/Thermal • 1) For each physical unit on die, separate events to track utilization rates • Sub events for different type of accesses with different power costs • 2) Bitline activity counters for switching units • 3) Occupancy counters for related queues • 4) Counter support for off-core memory accesses • 5) High parallelism among power events for minimal counter rotations Canturk Isci, Gilberto Contreras, Margaret Martonosi

  16. Conclusions • New opportunities remain to be explored in future PMC designs for power and thermal studies • Direct correspondence to physical units • Bitline and occupancy counters • We believe in the feasibility of these additions with the continuing emphasis given to counter design, as long as power is also considered a primary design target. Canturk Isci, Gilberto Contreras, Margaret Martonosi

More Related