1 / 47

Accuracy of Performance Monitoring Hardware

Accuracy of Performance Monitoring Hardware. Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore University of Tennessee-Knoxville. PCAT Team. Dr. Patricia Teller Alonso Bayona - Undergraduate Alexander Sainz - Undergraduate

carys
Download Presentation

Accuracy of Performance Monitoring Hardware

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Accuracy of Performance Monitoring Hardware Michael E. Maxwell, Patricia J. Teller, and Leonardo Salayandia University of Texas-El Paso and Shirley Moore University of Tennessee-Knoxville

  2. PCAT Team • Dr. Patricia Teller • Alonso Bayona - Undergraduate • Alexander Sainz - Undergraduate • Trevor Morgan - Undergraduate • Leonardo Salayandia – M.S. Student • Michael Maxwell – Ph.D. Student PCAT - The University of Texas at El Paso

  3. Credits (Financial) • DoD PET Program • NSF MIE (Model Institutions of Excellence) REU (Research Experiences for Undergraduates) Program • UTEP Dodson Endowment PCAT - The University of Texas at El Paso

  4. Motivation • Facilitate performance-tuning efforts that employ aggregate event counts • When possible provide calibration data • Identify unexpected results, errors • Clarify misunderstandings of processor functionality PCAT - The University of Texas at El Paso

  5. Road Map • Scope of Research • Methodology • Results • Future Work and Conclusions PCAT - The University of Texas at El Paso

  6. Processors Under Study • MIPS R10K and R12K: 2 counters, 32 events • IBM Power3: 8 counters, 100+ events • Linux/IA-64: 4 counters, 150 events • Linux/Pentium: 2 counters, 80+ events PCAT - The University of Texas at El Paso

  7. Events Studied So Far • Number of load and store instructions executed • Number of floating-point instructions executed • Total number of instructions executed (issued/committed) • Number of L1 I-cache and L1 D-cache misses • Number of L2 cache misses • Number of TLB misses • Number of branch mispredictions PCAT - The University of Texas at El Paso

  8. PAPI Overhead • Extra instructions • Read counter before and after workload • Processing of counter overflow interrupts • Cache pollution • TLB pollution PCAT - The University of Texas at El Paso

  9. Methodology • Validation micro-benchmark • Configuration micro-benchmark • Prediction via tool, mathematical model, and/or simulation • Hardware-reported event count collection via PAPI (instrumented benchmark run 100 times; mean event count and standard deviation calculated) • Comparison/analysis • Report findings PCAT - The University of Texas at El Paso

  10. Validation Micro-benchmark • Simple, usually small program • Stresses a portion of the microarchitecture or memory hierarchy • Its size, simplicity, or execution time facilitates the tracing of its execution path and/or prediction of the number of times an event is generated PCAT - The University of Texas at El Paso

  11. Validation Micro-benchmark • Basic types: • array • loop • in-line • floating-point • Scalable w.r.t. granularity, i.e., number of generated events PCAT - The University of Texas at El Paso

  12. Example – Loop Validation Micro-benchmark For (I = 0; I < number_of_loops; I++) { sequence of 100 instructions with data dependencies that prevent compiler reorder or optimization } Used to stress a particular functional unit,e.g., the load/store unit PCAT - The University of Texas at El Paso

  13. Configuration Micro-benchmark • Program designed to provide insight into microarchitecture organization and/or the algorithms that control it • Examples • Page size used – for TLB miss counts • Cache prefetch algorithm • Branch prediction buffer size/organization PCAT - The University of Texas at El Paso

  14. Some Results PCAT - The University of Texas at El Paso

  15. Reported Event Counts: Expected, Consistent and Quantifiable Results • Overhead related to PAPI and other sources is consistent and quantifiable • Reported Event Count – Predicted Event Count = Overhead PCAT - The University of Texas at El Paso

  16. Example 1: Number of Loads Itanium, Power3, and R12K

  17. Example 2: Number of Stores Itanium, Power3, and R12K

  18. Platform MIPS R12K IBM Power3 Linux/IA-64 Linux/ Pentium Loads 46 28 86 N/A Stores 31 129 N/A Example 2: Number of StoresPower3 and Itanium Multiplicative PCAT - The University of Texas at El Paso

  19. Example 3: Total Number of Floating Point Operations – Pentium II, R10K and R12K, and Itanium ProcessorAccurateConsistent Pentium II MIPS R10K, R12K Itanium Even when counters overflow. No overhead due to PAPI. PCAT - The University of Texas at El Paso

  20. Reported Event Counts: Unexpected and Consistent Results --Errors? • The hardware-reported counts are multiples of the predicted counts • Reported Event Count / Multiplier = Predicted Event Count • Cannot identify overhead for calibration PCAT - The University of Texas at El Paso

  21. Example - Total Number of Floating-Point Operations – Power3 AccurateConsistent

  22. Reported Counts: Expected (Not Quantifiable) Results • Predictions: only possible under special circumstances • Reported event counts seem reasonable • But are they useful without knowing more about the algorithm used by the vendor? PCAT - The University of Texas at El Paso

  23. Example 1: Total Data TLB Misses • Replacement policy can (unpredictably) affect event counts • PAPI may (unpredictably) affect event counts • Other processes may (unpredictably) affect event counts PCAT - The University of Texas at El Paso

  24. Example 2: L1 D-Cache Misses# misses relatively constant as # of array references increase

  25. Example 2 Enlarged

  26. Example 3: L1 D-Cache Misses with Random Access (Foil Prefetch Scheme used by Stream Buffers) L1 D cache misses as a function of % filled 400.0 350.0 300.0 250.0 200.0 150.0 Power3 % Error R12k Pentium 100.0 50.0 0.0 0.0 50.0 100.0 150.0 200.0 250.0 300.0 -50.0 -100.0 -150.0 % of cache filled

  27. Example 4: A Mathematical Model that Verifies that Execution Time increases Proportionately with L1 D-Cache Misses total_number_of_cycles = iterations * exec_cycles_per_iteration + cache_misses * cycles_per_cache_miss

  28. Reported Event Counts: Unexpected but ConsistentResults • Predicted counts and reported counts differ significantly but in a consistent manner • Is this an error? • Are we missing something? PCAT - The University of Texas at El Paso

  29. Example: Compulsory Data TLB Misses • % difference per no. of references • Reported counts are consistent • Vary between platforms

  30. Reported Event Counts: Unexpected Results • Outliers • Puzzles PCAT - The University of Texas at El Paso

  31. Example 1: Outliers L1 D-Cache Misses for Itanium

  32. Example 1: Supporting Data PCAT - The University of Texas at El Paso

  33. Example 2: L1 I-Cache Misses and Instructions Retired - Itanium Both about 17% more than expected.

  34. Future Work • Extend events studied – include multiprocessor events • Extend processors studied – include Power4 • Study sampling on Power4; IBM collaboration re: workload characterization/system resource usage using sampling PCAT - The University of Texas at El Paso

  35. Conclusions • Performance counters provide informative data that can be used for performance tuning • Expected frequency of event may determine usefulness of event counts • Calibration data can make event counts more useful to application programmers (loads, stores, floating-point instructions) • The usefulness of some event counts -- as well as our research – could be enhanced with vendor collaboration • The usefulness of some event counts is questionable without documentation of the related behavior PCAT - The University of Texas at El Paso

  36. Should we attach the following warning to some event counts on some platforms? CAUTION: The values in the performance counters may be greater than you think. PCAT - The University of Texas at El Paso

  37. And should we attach the PCAT Seal of Approval on others? PCAT PCAT - The University of Texas at El Paso

  38. Invitation to Vendors Help us understand what’s going on, when to attach the “warning”,and when to attach the “seal of approval.” Application programmers will appreciate your efforts and so will we! PCAT - The University of Texas at El Paso

  39. Question to You On-board Performance Counters: What do they really tell you? With all the caveats, are they useful nonetheless? PCAT - The University of Texas at El Paso

  40. PCAT - The University of Texas at El Paso

  41. Example 1: Total Compulsory Data TLB Misses for R10K • % difference per no. of references • Predicted values consistently lower than reported • Small standard deviations • Greater predictability with increased no. of references

  42. Example 1: Compulsory Data TLB Misses for Itanium • % difference per no. of references • Reported counts consistently ~5 times greater than predicted

  43. Example 3: Compulsory Data TLB Misses for Power 3 • % difference per no. of references • Reported counts consistently ~5/~2 times greater than predicted for small/large counts

  44. Example 3: L1 D-Cache Misses with Random Access – Itaniumonly when at array size = 10x cache size

  45. Example 2: L1 D-Cache Misses • On some of the processors studied, as the number of accesses increased, the miss rate approached 0 • Accessing the array in strides of size two cache-size units plus one cache-line resulted in approximately the same event count as accessing the array in strides of one word • What’s going on?

  46. Example 2: R10K Floating-Point Division Instructions a = init_value; b = init_value; c = init_value; a = b / init_value; b = a / init_value; c = b / init_value; a = init_value; b = init_value; c = init_value; a = a / init_value; b = b / init_value; c = c / init_value; 1 FP Instruction Counted 3 FP Instructions Counted

  47. Example 2: Assembler Code Analysis l.d s.d l.d s.d l.d s.d l.d l.d div.d s.d l.d l.d div.d s.d l.d l.d div.d s.d l.d s.d l.d s.d l.d s.d l.d l.d div.d s.d l.d l.d div.d s.d l.d l.d div.d s.d • No optimization • Same instructions • Different (expected) operands • Three division instructions in both • No reason for different FP counts

More Related