Why do we need micro-benchmarks?

Systematic Energy Characterization of CMP/SMT Processor Systems via Automated Micro-BenchmarksR. Bertran*+, A. Buyuktosunoglu*, M. Gupta*, M. Gonzalez+, P. Bose**IBM T.J. Watson Research Center+Barcelona Supercomputing Center

Why do we need micro-benchmarks? What is the maximum power consumption? Any performance bug? Any reliability issues? … Micro-benchmarks! • Time consuming and tedious • Error prone task • Trial and error process • Several micro-benchmarks are required • Deep expertise limited to few designers • Detailed knowledge of the underlying architecture is required AUTOMATED SOLUTION NEEDED! 2

MicroProbe:a micro-benchmark generation framework

MicroProbe Workflow Inputs Outputs Micro-benchmark generation policy MicroProbe Framework User Micro- Bench-mark Micro- Bench-mark Micro- Bench-mark Micro- Bench-mark Endless loop for each instruction of the ISA Endless loop 50% INT 50% FP Max Power stressmark Architecture Definition files External tools Real platforms Simulators Models

MicroProbe: Distinguishing Features 5

MicroProbe Usage and Design Overview Research idea Micro-benchmark Micro-benchmark Micro-benchmark Micro-benchmark generation policies (user-defined scripts) Loop stressing the floating point unit Sequence of loads hitting 50% L1 and 50% L2 Generate a stressmark for each functional unit of the architecture Search for the sequence of 2 loads and 2 integer operations with maximum IPC MicroProbe Framework (Python API) Architecture module Code generation module Design space exploration module ISA definitions Micro-architecture analytical models ISA definitions Micro-architecture analytical models ISA definitions Micro-architecture analytical models Micro-benchmark synthesizer Search drivers Search drivers Search drivers Micro-architecture definitions Automatic bootstrap process Micro-architecture definitions Micro-architecture definitions Properties Properties Passes Properties Passes Passes External tools

Max-power Stressmark Generation Use MicroProbe to generate max-power stressmark Characterize energy per instruction (EPI) and IPC (Architecture Module) mulldo xvnmsubmdp lxvw4x Select N instructions with max (IPC* EPI) Loop: … mulldo mulldo lxvw4x lxvw4x xvnmsubmdp xvnmsubmdp … Form a basic endless loop (e.g. 4K) using selected instructions (Code Generation Module) Loop: … mulldo lxvw4x mulldo xvnmsubmdp lxvw4x xvnmsubmdp … Generate micro-benchmarks with different orders of the selected N instructions Evaluate using Design Space Exploration Module Pick the highest power microbenchmark 7

CASE studies MicroProbe:A Micro-benchmark Generation Framework 8

Experimental Methodology • Platform: • Processor: POWER7 @ 3GHz • 8-core 4-way SMT • 32KB L1, 256KB L2 and 4MB L3 per core • Memory: 32 GB DDR3 SDRAM @ 800MHz • OS: RHEL 5.7 + Linux 3.0.1 • EnergyScale architecture • Power measurements in miliwatts • Sampling rate up to 1ms • In-house software collects power and performance counter traces [C. Lefurgy et al, IBM] 9

Case Study 1: EPI Characterization High differences in EPI across instructions stressing different micro-architecture components High differences in EPI across instructions stressing the same micro-architecture components and at the same rate (IPC) 10

Case Study 2: Max-power Stressmark Generation Loops Loops Loops Loops Loops Loops Loops Loops Loops Loops Loops Loops Use MicroProbe Use complex instructions accessing different functional units with high IPC Use a computational intensive kernel Generate all possible combinations of complex instructions stressing different units ? MicroProbe Expert manual Loop: … mullw lxvd2x mullw xvmaddadp lxvd2x xvmaddadp … Loop: … mullw lxvd2x mullw xvmaddadp xvmaddadp lxvd2x … Loop: … mullw mullw xvmaddadp xvmaddadp lxvd2x lxvd2x … Expert DSE MicroProbe Heuristic: Max(EPI * IPC) DAXPY Selected intructions: mullw xvmaddadp lxvd2x Selected instructions: mulldo, xvnmsubmdp, lxvw4x MicroProbe 11

Max-power Stressmark Generation 12

Case Study 3: Counter-based Processor Power Model 1 Bottom-up Power modeling method Dynamic Power f(PMCs) Func.Unit micro- Benchmarks CMP1–SMT1 Intercept SMT1 Random micro- Benchmarks CMP1–SMT1 2 SMT effect Random micro- Benchmarks CMP1–SMT2/4 Intercept SMT2-4 CMP effect Random micro- Benchmarks CMP1/8–SMT2/4 Linear Regression f(CMP) 3 Uncore power Model: Dynamic Power f(PMCs) Uncore power SMT effect SMT enabled CMP effect # cores 13

Counter-based Processor Power ModelValidation Within acceptable error margins: < 4% on average

Counter-based Processor Power ModelValidation on Corner Cases • Models trained using non-micro-architecture aware training sets show high errors and variability • Models trained using the micro-architecture aware training set show acceptable error margins: < 5% on average

Conclusions • MicroProbe is a productive micro-benchmark generation framework • Adaptive and flexible • Includes micro-architecture semantics • Integrates design space exploration • Presented three case studies: • Instruction-based EPI characterization • Automated max-power stressmark generation • CMP/SMT-aware bottom-up counter-based processor power model 16

QUESTIONS? MicroProbe:A Micro-benchmark Generation Framework 17

Why do we need micro-benchmarks?