470 likes | 579 Views
Accuracy of Performance Monitoring Hardware. Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore University of Tennessee-Knoxville. PCAT Team. Dr. Patricia Teller Alonso Bayona - Undergraduate Alexander Sainz - Undergraduate
E N D
Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore University of Tennessee-Knoxville
PCAT Team • Dr. Patricia Teller • Alonso Bayona - Undergraduate • Alexander Sainz - Undergraduate • Trevor Morgan - Undergraduate • Leonardo Salayandia – M.S. Student • Michael Maxwell – Ph.D. Student PCAT - The University of Texas at El Paso
Credits (Financial) • DoD PET Program • NSF MIE (Model Institutions of Excellence) REU (Research Experiences for Undergraduates) Program • UTEP Dodson Endowment PCAT - The University of Texas at El Paso
Motivation • Facilitate performance-tuning efforts that employ aggregate event counts • When possible provide calibration data • Identify unexpected results, errors • Clarify misunderstandings of processor functionality PCAT - The University of Texas at El Paso
Road Map • Scope of Research • Methodology • Results • Future Work and Conclusions PCAT - The University of Texas at El Paso
Processors Under Study • MIPS R10K and R12K: 2 counters, 32 events • IBM Power3: 8 counters, 100+ events • Linux/IA-64: 4 counters, 150 events • Linux/Pentium: 2 counters, 80+ events PCAT - The University of Texas at El Paso
Events Studied So Far • Number of load and store instructions executed • Number of floating-point instructions executed • Total number of instructions executed (issued/committed) • Number of L1 I-cache and L1 D-cache misses • Number of L2 cache misses • Number of TLB misses • Number of branch mispredictions PCAT - The University of Texas at El Paso
PAPI Overhead • Extra instructions • Read counter before and after workload • Processing of counter overflow interrupts • Cache pollution • TLB pollution PCAT - The University of Texas at El Paso
Methodology • Validation micro-benchmark • Configuration micro-benchmark • Prediction via tool, mathematical model, and/or simulation • Hardware-reported event count collection via PAPI (instrumented benchmark run 100 times; mean event count and standard deviation calculated) • Comparison/analysis • Report findings PCAT - The University of Texas at El Paso
Validation Micro-benchmark • Simple, usually small program • Stresses a portion of the microarchitecture or memory hierarchy • Its size, simplicity, or execution time facilitates the tracing of its execution path and/or prediction of the number of times an event is generated PCAT - The University of Texas at El Paso
Validation Micro-benchmark • Basic types: • array • loop • in-line • floating-point • Scalable w.r.t. granularity, i.e., number of generated events PCAT - The University of Texas at El Paso
Example – Loop Validation Micro-benchmark For (I = 0; I < number_of_loops; I++) { sequence of 100 instructions with data dependencies that prevent compiler reorder or optimization } Used to stress a particular functional unit,e.g., the load/store unit PCAT - The University of Texas at El Paso
Configuration Micro-benchmark • Program designed to provide insight into microarchitecture organization and/or the algorithms that control it • Examples • Page size used – for TLB miss counts • Cache prefetch algorithm • Branch prediction buffer size/organization PCAT - The University of Texas at El Paso
Some Results PCAT - The University of Texas at El Paso
Reported Event Counts: Expected, Consistent and Quantifiable Results • Overhead related to PAPI and other sources is consistent and quantifiable • Reported Event Count – Predicted Event Count = Overhead PCAT - The University of Texas at El Paso
Platform MIPS R12K IBM Power3 Linux/IA-64 Linux/ Pentium Loads 46 28 86 N/A Stores 31 129 N/A Example 2: Number of StoresPower3 and Itanium Multiplicative PCAT - The University of Texas at El Paso
Example 3: Total Number of Floating Point Operations – Pentium II, R10K and R12K, and Itanium ProcessorAccurateConsistent Pentium II MIPS R10K, R12K Itanium Even when counters overflow. No overhead due to PAPI. PCAT - The University of Texas at El Paso
Reported Event Counts: Unexpected and Consistent Results --Errors? • The hardware-reported counts are multiples of the predicted counts • Reported Event Count / Multiplier = Predicted Event Count • Cannot identify overhead for calibration PCAT - The University of Texas at El Paso
Example - Total Number of Floating-Point Operations – Power3 AccurateConsistent
Reported Counts: Expected (Not Quantifiable) Results • Predictions: only possible under special circumstances • Reported event counts seem reasonable • But are they useful without knowing more about the algorithm used by the vendor? PCAT - The University of Texas at El Paso
Example 1: Total Data TLB Misses • Replacement policy can (unpredictably) affect event counts • PAPI may (unpredictably) affect event counts • Other processes may (unpredictably) affect event counts PCAT - The University of Texas at El Paso
Example 2: L1 D-Cache Misses# misses relatively constant as # of array references increase
Example 3: L1 D-Cache Misses with Random Access (Foil Prefetch Scheme used by Stream Buffers) L1 D cache misses as a function of % filled 400.0 350.0 300.0 250.0 200.0 150.0 Power3 % Error R12k Pentium 100.0 50.0 0.0 0.0 50.0 100.0 150.0 200.0 250.0 300.0 -50.0 -100.0 -150.0 % of cache filled
Example 4: A Mathematical Model that Verifies that Execution Time increases Proportionately with L1 D-Cache Misses total_number_of_cycles = iterations * exec_cycles_per_iteration + cache_misses * cycles_per_cache_miss
Reported Event Counts: Unexpected but ConsistentResults • Predicted counts and reported counts differ significantly but in a consistent manner • Is this an error? • Are we missing something? PCAT - The University of Texas at El Paso
Example: Compulsory Data TLB Misses • % difference per no. of references • Reported counts are consistent • Vary between platforms
Reported Event Counts: Unexpected Results • Outliers • Puzzles PCAT - The University of Texas at El Paso
Example 1: Supporting Data PCAT - The University of Texas at El Paso
Example 2: L1 I-Cache Misses and Instructions Retired - Itanium Both about 17% more than expected.
Future Work • Extend events studied – include multiprocessor events • Extend processors studied – include Power4 • Study sampling on Power4; IBM collaboration re: workload characterization/system resource usage using sampling PCAT - The University of Texas at El Paso
Conclusions • Performance counters provide informative data that can be used for performance tuning • Expected frequency of event may determine usefulness of event counts • Calibration data can make event counts more useful to application programmers (loads, stores, floating-point instructions) • The usefulness of some event counts -- as well as our research – could be enhanced with vendor collaboration • The usefulness of some event counts is questionable without documentation of the related behavior PCAT - The University of Texas at El Paso
Should we attach the following warning to some event counts on some platforms? CAUTION: The values in the performance counters may be greater than you think. PCAT - The University of Texas at El Paso
And should we attach the PCAT Seal of Approval on others? PCAT PCAT - The University of Texas at El Paso
Invitation to Vendors Help us understand what’s going on, when to attach the “warning”,and when to attach the “seal of approval.” Application programmers will appreciate your efforts and so will we! PCAT - The University of Texas at El Paso
Question to You On-board Performance Counters: What do they really tell you? With all the caveats, are they useful nonetheless? PCAT - The University of Texas at El Paso
Example 1: Total Compulsory Data TLB Misses for R10K • % difference per no. of references • Predicted values consistently lower than reported • Small standard deviations • Greater predictability with increased no. of references
Example 1: Compulsory Data TLB Misses for Itanium • % difference per no. of references • Reported counts consistently ~5 times greater than predicted
Example 3: Compulsory Data TLB Misses for Power 3 • % difference per no. of references • Reported counts consistently ~5/~2 times greater than predicted for small/large counts
Example 3: L1 D-Cache Misses with Random Access – Itaniumonly when at array size = 10x cache size
Example 2: L1 D-Cache Misses • On some of the processors studied, as the number of accesses increased, the miss rate approached 0 • Accessing the array in strides of size two cache-size units plus one cache-line resulted in approximately the same event count as accessing the array in strides of one word • What’s going on?
Example 2: R10K Floating-Point Division Instructions a = init_value; b = init_value; c = init_value; a = b / init_value; b = a / init_value; c = b / init_value; a = init_value; b = init_value; c = init_value; a = a / init_value; b = b / init_value; c = c / init_value; 1 FP Instruction Counted 3 FP Instructions Counted
Example 2: Assembler Code Analysis l.d s.d l.d s.d l.d s.d l.d l.d div.d s.d l.d l.d div.d s.d l.d l.d div.d s.d l.d s.d l.d s.d l.d s.d l.d l.d div.d s.d l.d l.d div.d s.d l.d l.d div.d s.d • No optimization • Same instructions • Different (expected) operands • Three division instructions in both • No reason for different FP counts