Accuracy of Performance Monitoring Hardware

Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore University of Tennessee-Knoxville

PCAT Team • Dr. Patricia Teller • Alonso Bayona - Undergraduate • Alexander Sainz - Undergraduate • Trevor Morgan - Undergraduate • Leonardo Salayandia – M.S. Student • Michael Maxwell – Ph.D. Student PCAT - The University of Texas at El Paso

Credits (Financial) • DoD PET Program • NSF MIE (Model Institutions of Excellence) REU (Research Experiences for Undergraduates) Program • UTEP Dodson Endowment PCAT - The University of Texas at El Paso

Motivation • Facilitate performance-tuning efforts that employ aggregate event counts • When possible provide calibration data • Identify unexpected results, errors • Clarify misunderstandings of processor functionality PCAT - The University of Texas at El Paso

Road Map • Scope of Research • Methodology • Results • Future Work and Conclusions PCAT - The University of Texas at El Paso

Processors Under Study • MIPS R10K and R12K: 2 counters, 32 events • IBM Power3: 8 counters, 100+ events • Linux/IA-64: 4 counters, 150 events • Linux/Pentium: 2 counters, 80+ events PCAT - The University of Texas at El Paso

Events Studied So Far • Number of load and store instructions executed • Number of floating-point instructions executed • Total number of instructions executed (issued/committed) • Number of L1 I-cache and L1 D-cache misses • Number of L2 cache misses • Number of TLB misses • Number of branch mispredictions PCAT - The University of Texas at El Paso

PAPI Overhead • Extra instructions • Read counter before and after workload • Processing of counter overflow interrupts • Cache pollution • TLB pollution PCAT - The University of Texas at El Paso

Methodology • Validation micro-benchmark • Configuration micro-benchmark • Prediction via tool, mathematical model, and/or simulation • Hardware-reported event count collection via PAPI (instrumented benchmark run 100 times; mean event count and standard deviation calculated) • Comparison/analysis • Report findings PCAT - The University of Texas at El Paso

Validation Micro-benchmark • Simple, usually small program • Stresses a portion of the microarchitecture or memory hierarchy • Its size, simplicity, or execution time facilitates the tracing of its execution path and/or prediction of the number of times an event is generated PCAT - The University of Texas at El Paso

Validation Micro-benchmark • Basic types: • array • loop • in-line • floating-point • Scalable w.r.t. granularity, i.e., number of generated events PCAT - The University of Texas at El Paso

Example – Loop Validation Micro-benchmark For (I = 0; I < number_of_loops; I++) { sequence of 100 instructions with data dependencies that prevent compiler reorder or optimization } Used to stress a particular functional unit,e.g., the load/store unit PCAT - The University of Texas at El Paso

Configuration Micro-benchmark • Program designed to provide insight into microarchitecture organization and/or the algorithms that control it • Examples • Page size used – for TLB miss counts • Cache prefetch algorithm • Branch prediction buffer size/organization PCAT - The University of Texas at El Paso

Some Results PCAT - The University of Texas at El Paso

Reported Event Counts: Expected, Consistent and Quantifiable Results • Overhead related to PAPI and other sources is consistent and quantifiable • Reported Event Count – Predicted Event Count = Overhead PCAT - The University of Texas at El Paso

Example 1: Number of Loads Itanium, Power3, and R12K

Example 2: Number of Stores Itanium, Power3, and R12K

Platform MIPS R12K IBM Power3 Linux/IA-64 Linux/ Pentium Loads 46 28 86 N/A Stores 31 129 N/A Example 2: Number of StoresPower3 and Itanium Multiplicative PCAT - The University of Texas at El Paso

Example 3: Total Number of Floating Point Operations – Pentium II, R10K and R12K, and Itanium ProcessorAccurateConsistent Pentium II MIPS R10K, R12K Itanium Even when counters overflow. No overhead due to PAPI. PCAT - The University of Texas at El Paso

Reported Event Counts: Unexpected and Consistent Results --Errors? • The hardware-reported counts are multiples of the predicted counts • Reported Event Count / Multiplier = Predicted Event Count • Cannot identify overhead for calibration PCAT - The University of Texas at El Paso

Example - Total Number of Floating-Point Operations – Power3 AccurateConsistent

Reported Counts: Expected (Not Quantifiable) Results • Predictions: only possible under special circumstances • Reported event counts seem reasonable • But are they useful without knowing more about the algorithm used by the vendor? PCAT - The University of Texas at El Paso

Example 1: Total Data TLB Misses • Replacement policy can (unpredictably) affect event counts • PAPI may (unpredictably) affect event counts • Other processes may (unpredictably) affect event counts PCAT - The University of Texas at El Paso

Example 2: L1 D-Cache Misses# misses relatively constant as # of array references increase

Example 2 Enlarged

Example 3: L1 D-Cache Misses with Random Access (Foil Prefetch Scheme used by Stream Buffers) L1 D cache misses as a function of % filled 400.0 350.0 300.0 250.0 200.0 150.0 Power3 % Error R12k Pentium 100.0 50.0 0.0 0.0 50.0 100.0 150.0 200.0 250.0 300.0 -50.0 -100.0 -150.0 % of cache filled

Example 4: A Mathematical Model that Verifies that Execution Time increases Proportionately with L1 D-Cache Misses total_number_of_cycles = iterations * exec_cycles_per_iteration + cache_misses * cycles_per_cache_miss

Reported Event Counts: Unexpected but ConsistentResults • Predicted counts and reported counts differ significantly but in a consistent manner • Is this an error? • Are we missing something? PCAT - The University of Texas at El Paso

Example: Compulsory Data TLB Misses • % difference per no. of references • Reported counts are consistent • Vary between platforms

Reported Event Counts: Unexpected Results • Outliers • Puzzles PCAT - The University of Texas at El Paso

Example 1: Outliers L1 D-Cache Misses for Itanium

Example 1: Supporting Data PCAT - The University of Texas at El Paso

Example 2: L1 I-Cache Misses and Instructions Retired - Itanium Both about 17% more than expected.

Future Work • Extend events studied – include multiprocessor events • Extend processors studied – include Power4 • Study sampling on Power4; IBM collaboration re: workload characterization/system resource usage using sampling PCAT - The University of Texas at El Paso

Conclusions • Performance counters provide informative data that can be used for performance tuning • Expected frequency of event may determine usefulness of event counts • Calibration data can make event counts more useful to application programmers (loads, stores, floating-point instructions) • The usefulness of some event counts -- as well as our research – could be enhanced with vendor collaboration • The usefulness of some event counts is questionable without documentation of the related behavior PCAT - The University of Texas at El Paso

Should we attach the following warning to some event counts on some platforms? CAUTION: The values in the performance counters may be greater than you think. PCAT - The University of Texas at El Paso

And should we attach the PCAT Seal of Approval on others? PCAT PCAT - The University of Texas at El Paso

Invitation to Vendors Help us understand what’s going on, when to attach the “warning”,and when to attach the “seal of approval.” Application programmers will appreciate your efforts and so will we! PCAT - The University of Texas at El Paso

Question to You On-board Performance Counters: What do they really tell you? With all the caveats, are they useful nonetheless? PCAT - The University of Texas at El Paso

PCAT - The University of Texas at El Paso

Example 1: Total Compulsory Data TLB Misses for R10K • % difference per no. of references • Predicted values consistently lower than reported • Small standard deviations • Greater predictability with increased no. of references

Example 1: Compulsory Data TLB Misses for Itanium • % difference per no. of references • Reported counts consistently ~5 times greater than predicted

Example 3: Compulsory Data TLB Misses for Power 3 • % difference per no. of references • Reported counts consistently ~5/~2 times greater than predicted for small/large counts

Example 3: L1 D-Cache Misses with Random Access – Itaniumonly when at array size = 10x cache size

Example 2: L1 D-Cache Misses • On some of the processors studied, as the number of accesses increased, the miss rate approached 0 • Accessing the array in strides of size two cache-size units plus one cache-line resulted in approximately the same event count as accessing the array in strides of one word • What’s going on?

Example 2: R10K Floating-Point Division Instructions a = init_value; b = init_value; c = init_value; a = b / init_value; b = a / init_value; c = b / init_value; a = init_value; b = init_value; c = init_value; a = a / init_value; b = b / init_value; c = c / init_value; 1 FP Instruction Counted 3 FP Instructions Counted

Example 2: Assembler Code Analysis l.d s.d l.d s.d l.d s.d l.d l.d div.d s.d l.d l.d div.d s.d l.d l.d div.d s.d l.d s.d l.d s.d l.d s.d l.d l.d div.d s.d l.d l.d div.d s.d l.d l.d div.d s.d • No optimization • Same instructions • Different (expected) operands • Three division instructions in both • No reason for different FP counts

Accuracy of Performance Monitoring Hardware

Accuracy of Performance Monitoring Hardware

Presentation Transcript

Performance Monitoring

Using Performance Monitoring Hardware for Application Performance Analysis

PI Performance Monitoring

Monitoring Contractor Performance

Status of Computing Performance Monitoring

Monitoring Performance of Career Academies

Exploring the Potential of Performance Monitoring Hardware to Support Run-time Optimization

High Performance Monitoring

Performance Monitoring

Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

ECAL Monitoring (non-hardware)

Performance monitoring

Software Performance Monitoring

Code Coverage Testing Using Hardware Performance Monitoring Support

Performance and Accuracy

Control Performance Monitoring

Performance Monitoring

Network Performance Monitoring

Performance Monitoring

Application Performance Monitoring