670 likes | 887 Views
Runtime Processor Power Monitoring. VICE VERSA 07/18/2003 Talk Canturk ISCI. Detailed. Touch. INTRODUCTION. Runtime processor power Measurement with HW Estimation with Performance counters CPU Unit Power Breakdowns Runtime verification Processor thermal modeling
E N D
Runtime Processor Power Monitoring VICE VERSA 07/18/2003 Talk Canturk ISCI
Detailed Touch INTRODUCTION • Runtime processor power • Measurement with HW • Estimation with Performance counters • CPU Unit Power Breakdowns • Runtime verification • Processor thermal modeling • Power Phase Behavior of programs • Mapping between power behavior and program structure
Real Power Measurement Performance Monitoring Program Profiling Real Power Measurement Performance Monitoring Power Modeling Program Structure Power Modeling Thermal Modeling Power Phases Thermal Modeling THE BIG PICTURE Bottom line… • To Estimate component power & temperature breakdowns for P4 at runtime… • To analyze how power phase behavior relates to program structure
Real Power Measurement Real Power Measurement Performance Monitoring Performance Monitoring Power Modeling Power Modeling Thermal Modeling Thermal Modeling Agenda • Performance Monitoring • P4 Performance Counters • Performance Reader LKM • Real Power Measurement • P4 Power Measurement Setup • Examples • Power Modeling • P4 Power Model • Model + Measurement Sync Setup, Verification • Thermal Modeling • Brief Thermal Model Intro • Ppro Thermal Model Results
Program Profiling Program Profiling Power Modeling Program Structure Program Structure Power Phases Power Phases Bonus Material • Power Phase Behavior • Similarity Based on Power Vectors • Identifying similar program regions • Profiling Execution Flow • Sampling process’ execution • “PCsampler” LKM • Program Structure • Execution vs. Code space • Power Phases Exec. Phases • <OR VICE VERSA>
Real Power Measurement Performance Monitoring Performance Monitoring Power Modeling Thermal Modeling Performance Monitoring • Related Work • Performance Monitoring • P4 Performance Counters • Performance Reader LKM • Real Power Measurement • P4 Power Measurement Setup • Examples • Power Modeling • P4 Power Model • Model + Measurement Sync Setup, Verification • Thermal Modeling • Refined Thermal Model • Ex: Ppro Thermal Model
Live CPU Performance Monitoring with Hardware Counters • Most CPUs have hardware performance counters • P4Performance Monitoring HW: • 18 Event Counters • 18 Counter Configuration Control Registers • Configure how to count • 45 Event Selection Control Registers • Configure what to count • Additional Control Registers
Our Event-Counter: Performance Reader • Performance Reader implemented as Linux Loadable Kernel Module • Implements 6 syscalls: • select_events() • reset_event_counter() • start_event_counter() • stop_event_counter() • get_event_counts() • set_replay_MSRs() • User Level Interface: • Defines the events Starts counters • Stops counters Reads counters & TSC • Event Types: • 59 event classes • 100s of events to count
Performance Reader: Example Validation • L1_Dcache benchmark • Controls cache hit behavior • Validated against measured cache events • Vary hit rate from 0-100%
Real Power Measurement Real Power Measurement Performance Monitoring Power Modeling Thermal Modeling Processor Power Measurement • Related Work • Performance Monitoring • P4 Performance Counters • Performance Reader LKM • Real Power Measurement • P4 Power Measurement Setup • Examples • Power Modeling • P4 Power Model • Model + Measurement Sync Setup, Verification • Thermal Modeling • Refined Thermal Model • Ex: Ppro Thermal Model
Serial Reader (PowerMeter) (PowerPlotter) P4 Power Measurement Setup Clamp ammeter on 12V lines on measured CPU DMM reading clamp voltages 1mV/Adc conversion Voltage readings via RS232 to logging machine Convert to Power vs. time window
“Branch exercise” (Taken rate: 1) “High-Low” “L1Dcache” Array Size 1/100 of L1 “L1Dcache” Array Size x25 of L1~L2 “L1Dcache” Array Size x4 of L2 “Fast” Benchmark Execution Initialization PowerPlotter: Example
SPEC Power Examples • Different programs show very different power characteristics • Timescale of interest can be huge => inaccessible via simulation
Real Power Measurement Performance Monitoring Power Modeling Power Modeling Thermal Modeling Processor Power Modeling • Related Work • Performance Monitoring • P4 Performance Counters • Performance Reader LKM • Real Power Measurement • P4 Power Measurement Setup • Examples • Power Modeling • P4 Power Model • Model + Measurement Sync Setup, Verification • Thermal Modeling • Refined Thermal Model • Ex: Ppro Thermal Model
Define components (I.e. L1 cache, BPU, Regs, etc.), whose powers we’ll model: • from annotated layout Define Components Define Components • Determine combination of P4 events that represent component accesses best Define Events Define Events • Gather counter info with minimal power overhead and program interruption Real Power Measurement Real Power Measurement Performance Monitoring Performance Monitoring • Convert counter info into component power breakdowns Power Modeling Power Modeling • Verify total power against measured processor power P4 POWER MODEL
Memory Subsystem ITLB & Ifetch 2nd Level BPU Mem Cntrl L1 Cache Int RF Instr-n Queue Int & FP Execution Int & FP Execution Int & FP Execution Int & FP Execution Int EXE MOB DTLB Scheduler Check Replay Retire Instr-n Decode FP FP FP RF O-O-O Engine O-O-O Engine O-O-O Engine In Order Front End In Order Front End Trace Trace Instr-n Queue EXE EXE Cache Cache 1st Level BPU Rename Ucode ROM Allocate Defining Components Bus Control L2 Cache ITLB & Ifetch 2nd Level BPU Mem Cntrl L1 Cache Int RF Instr-n Queue Int EXE MOB DTLB Scheduler Check Replay Retire Instr-n Decode FP RF Instr-n Queue 1st Level BPU Rename Ucode ROM Allocate
Defining Events Access Rates • We determined 24 events to approximate access rates for 22 components • Used Several Heuristicsto represent each access rate • Examples: • Need to rotate counters 4 times to collect all event data • Used 15 counters & 4 rotationsto collect all event data
1 Access Rates Component Powers • “Performance Counter based Access Rate estimations are used as proxy for max component power weighting together with microarchitectural details in order to estimate processor sub-unit powers” • EX: Trace cache delivers 3 uops/cycle in deliver mode and 1 uop/cycle in build mode: • Power(TC)=[Access-Rate(TC)/3 + Access-Rate(ID)] x MaxPower(TC) + Non-gated TC CLK power • Total power is computed as the sum of all 22 component powers + measured idle power (8W):
Serial Reader (PowerMeter) (PowerPlotter) Experiment Setup – Recall: Clamp ammeter on 12V lines on measured CPU DMM reading clamp voltages 1mV/Adc conversion Voltage readings via RS232 to logging machine Convert to Power vs. time window
Experiment Setup 1mV/Adc conversion Voltage readings via RS232 to logging machine
Experiment Setup 1mV/Adc conversion POWER SERVER Voltage readings via RS232 to logging machine Component access rates over ethernet POWER CLIENT Convert voltage to measured power Convert access rates to modeled powers Sync together in time window
Measured Modeled Tuning Benchmarks “Branch exercise” (Taken rate: 1) “High-Low” “L1Dcache” (Hit Rate : 0.1) “Fast”
Component Breakdowns Component Breakdowns for “branch_exercise” Colors for 4 CPU subsystems Execution Issue - Retire
High issue, exec. & branch power High L2 Cache Power High L1 Cache Power High Bus Power Benchmark Power Breakdowns
SPEC2000 Results VPR Elaboration:Integer benchmark2 runs: 1st Placement, 2nd Route1st run much stable power, 2nd more variablePlacement has higher miss than route < L2 pwr>Significant FPE power due to x87_SIMD_moves Equake Elaboration: (FP benchmark)Initialization and computation phasesFP intensive mesh computation phaseInitialization with high complex IA32 instructions Twolf Elaboration:(Integer benchmark)Several loop computations traversing memory<High Memory Power>Although ~const. Total power, component powers have slight gradients
Average SPEC Total Powers • 1st set: Overall, 2nd set: Non-idle power • Average difference between measurement and estimation: 3W • Worst case: Equake (5.8W)
Stdev of SPEC Total Powers • 1st set: Overall, 2nd set: Non-idle power • Average difference: 2W • Worst case: Vortex (3.5W)
Desktop Applications • We aim to track low power utilizations as well. • Desktop applications are usually low power with intermittent power bursts • 3 applications, with common operations such as open/close application, web, streaming media, text editing, save to disk, statistical computations.
Real Power Measurement Performance Monitoring Power Modeling Thermal Modeling Thermal Modeling Thermal Model <Rush Mode!> • Related Work • Performance Monitoring • P4 Performance Counters • Performance Reader LKM • Real Power Measurement • P4 Power Measurement Setup • Examples • Power Modeling • P4 Power Model • Model + Measurement Sync Setup, Verification • Thermal Modeling • Brief Thermal Model Intro • Ppro Thermal Model Results Skip Thermal
THERMAL MODELING: A Basic Model • Based on lumpedR-C model from packaging • Built uponpower modeling • Sampled Component Powers • Respective component areas • Physical processor Parameters • Packaging • Heat Transfer
TA R_hXA Rh Th Ch R_grXspr Rspr Tspr Cspr Rp,i Tp,i Ptotal Cp,i Rdie,i Tdie,i Cdie,i Pi Physical Structure vs. Thermal Model Ambient Temperature Ambient Airflow Heatsink Thermal Grease Heat Spreader Package Die
Simulation Outputs • Thermal nodes updated every t~20ms • Component Temperatures Build up to ~350K in ~5hrs • Theatsink moves very slowly as expected
Program Profiling Power Modeling Program Structure Power Phases Power Phases Power Phase Behavior • Power Phase Behavior • Similarity Based on Power Vectors • Identifying similar program regions • Profiling Execution Flow • Sampling process’ execution • “PCsampler” LKM • Program Structure • Execution vs. Code space • Power Phases Exec. Phases • <OR VICE VERSA>
1 Power Vectors for Similarity • Similar to basic block vectors • Use component power vector samples to represent program phases • Consider Manhattan distance between 2 vectors as the ‘measure of dissimilarity’ between the corresponding execution points • Construct a similarity matrix to represent similarity among all pairs of execution points • Each entry in the similarity matrix:
0.44 44.44 88.44 Timeline [s] Timeline [s] 132.44 176.44 Gcc & Gzip Similarity Matrices Gzip Elaboration: Much regular power behavior Spurious similarities such as 100-150s and 200-280 are distinguished by the similarity analysis Gcc Elaboration: Very variant power Almost identical power behavior at 30, 50, 180s. Although 88s, 110s, 140s, 210s and 230s show similar total power; 88, 210 and 230 share higher similarity. 44 88 132 176 220 264 308 352 396 220.44 440 264.44
Generating representative vectors • Gzip has ~1000 power vectors • Cluster vectors based on similarity • Could we represent power behavior with reasonable accuracy, with a small number of ‘signature’ vectors? • Ex: 26 representative vectors with “Thresholding Algorithm”:
Program Profiling Program Profiling Power Modeling Program Structure Program Structure Power Phases Program Execution Profile • Power Phase Behavior • Similarity Based on Power Vectors • Identifying similar program regions • Profiling Execution Flow • Sampling process’ execution • “PCsampler” LKM • Program Structure • Execution vs. Code space • Power Phases Exec. Phases • <OR VICE VERSA>
Program Execution Profile • Sample program flow simultaneously with power • Our LKM implementation: “PCsampler” • Generate code space similarity in parallel with power space similarity • Relative comparisons for: • Complexity • Accuracy • Applicability, etc.
Where all this is useful? • Measurement/Modeling for microarchitectural details • Compiler level power • SW power profiling • Power Aware OS • Dynamic power/thermal/March. Configuration • Dynamic memory allocate, Process cruise control, etc. • Demonstrates modern processor power • Need for speed! Long Timescales, thermal constants • Identify program phases w/o knowledge of code, no basic block info whatsoever • Program signatures for detailed simulation, say: “power points rather than simpoints”
ACCESS HEURISTICS
Counter Access Heuristics • 1) BUS CONTROL: • No 3rd Level cache BSQ allocations ~ IOQ allocations • Metric1: Bus accesses from all agents • Event: IOQ_allocation • Counts various types of bus transactions • Should account for BSQ as well • access based rather than duration • MASK: • Default req. type, all read (128B) and write (64B) types, include OWN,OTHER and PREFETCH • Metric2: Bus Utilization(The % of time Bus is utilized) • Event: FSB_data_activity • Counts DataReaDY and DataBuSY events on Bus • Mask: • Count when processor or other agents drive/read/reserve the bus • Expression: FSB_data_activity x BusRatio / Clocks Elapsed • To account for clock ratios
Counter Access Heuristics • 2) L2 Cache: • Metric: 2nd Level cache references • Event: BSQ_cache_reference • Counts cache ref-s as seen by bus unit • MASK: • All MESI read misses (LD & RFO) • 2nd level WR misses • 3) 2nd Level BPU: • Metric 1: Instructions fetched from L2 (predict) • Event: ITLB_Reference • Counts ITLB translations • Mask: • All hits, misses & UC hits • Metric 2: Branches retired (history update) • Event: branch_retired • Counts branches retired • Mask: • Count all Taken/NT/Predicted/MissP
Counter Access Heuristics • 4) ITLB & I-Fetch: • etc……… • 10) FP Execution: • Metric: FP instructions executed • event1: packed_SP_uop • counts packed single precision uops • event2: packed_DP_uop • counts packed single precision uops • event3: scalar_SP_uop • counts scalar double precision uops • event4: scalar_DP_uop • counts scalar double precision uops • event5: 64bit_MMX_uop • counts MMX uops with 64bit SIMD operands • event6: 128bit_MMX_uop • counts integer SSE2 uops with 128bit SIMD operands • event7: x87_FP_UOP • counts x87 FP uops • event8: x87_SIMD_moves_uop • counts x87, FP, MMX, SSE, SSE2 ld/st/mov uops Back
ADDITIONAL SLIDES
P4 Details • Karelian.ee: • P4 – 1.4GHz • 0.18, C4-FC-PGA-423 • Heatsink Folded Fin • M6, Al interconnect • Die Size: 217 mm2 • Package Size: 5.34cm x 5.17cm • Power: Idle/typ./max=??/51.8/71W • D$1&T$1/L2: 8K&12KUops/256K • Voltage: 1.7/1.75V
P4 Details • 1st LKM: <LKM_CPUinfo & UserLevel_CPUinfo> • Implements syscall: getCPUinfo() • Gathers CPU info from: • /asm/processor.h • Intel control registers (CR4) • CPUID instruction • Reveals: • Debug Store mechanism exists for PEBS • TSC exists • MSRs implemented • We can read/write performance counters • EX: • karelian (P4,willamette): UserLevel_CPUinfo • viale (P4, Northwood): UserLevel_CPUinfo Back
P4 Detector - Counter Clusters Event Detectors Event Counters EVENTS P4 Components 4 bit wide bus
Counters, ESCRs & CCCRs Simplified Recipe: • Select Event to count • Select a counter (also defines CCCR) • Select an ESCR • Set ESCR fields • Set CCCR fields • Enable CCCR
Counting Mechanisms • Counting Types • Non-retirement: Events occur any time during execution • At-Retirement: Events at the retirement of instruction • Can count BOGUS vs NBOGUS, Tag uops to count, etc. • Terminology • Mechanisms: • Front end tagging (i.e. LD/ST retired) • Execution tagging (i.e. packed_DP_retired) • Replay Tagging (i.e. L1 misses) • No Tags (i.e. uops retired) • Also: • Event Counting | IEBS | PEBS Back