1 / 37

A Survey about Performance Counters, Libraries and Tools

A Survey about Performance Counters, Libraries and Tools. Joseph Bryant Manzano Franco. Agenda. Introduction W3H: The Why, The What, The When, and The How Hardware Performance Libraries Performance Application Programming Interface (PAPI) Performance Counters Libraries (PCL)

yeshaya
Download Presentation

A Survey about Performance Counters, Libraries and Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Survey about Performance Counters, Libraries and Tools Joseph Bryant Manzano Franco

  2. Agenda • Introduction • W3H: The Why, The What, The When, and The How • Hardware Performance Libraries • Performance Application Programming Interface (PAPI) • Performance Counters Libraries (PCL) • Visualization Tools • TAU: An example of a data collector • KOJAK: Semi automatic instrumentation tool • VAMPIR: An example of a script language • PE: The All levels approach

  3. Introduction Program Optimization Search for the most effective algorithms and data structures Algorithm Optimization Consider common architecture features such cache structures Other ubiquitous optimizations Apply architecture specific characteristic (PIM instructions, atomic load and stores, massive memory allocations, etc) Architecture Optimizations Data Collection Identify and solve unexpected problems with the interaction between hardware and software (memory and network bottlenecks, false sharing, poor cache management, etc) Data Analysis

  4. IntroductionThe Why Data Collection Data Analysis High Level Library Functions Manual Analysis Easy to use and available on almost all libraries. Restricted and intrusive Compose of timing function and clever data manipulation Simple, but limited in its use Prone to human error Automatic Statistical Analysis Performance Counters Organize the data in a suitable format Still need to deal with numbers Easy to use (especially with high level wrappers) Provides a range of measurements and is less intrusive Visualization Tools Simulation environments Graphical representation of data or its properties. Easy to identify trends even in large sets of data Complete control over the environment including hardware, memory hierarchies and application code. Development is long for new architectures Steep learning curve

  5. Introduction:The What Performance Counters Special Registers that are present in an specific architecture Designed to count architectural events • An event is defined as an action that the hardware takes • Predefined • Examples: cache misses / hits, TLB misses / hits, context switches, cache invalidations, total instructions, etc Sun Ultra SPARC  Two 32 bit registers called PIC (Performance Instrumentation Counters). User control restricted Pentium Pro Two 40 bit registers called PerfCrt0/1. User control available

  6. Introduction: The When

  7. Introduction:The How Example: Ultra SPARC Architecture Two counters - 32 bits each Event that are being counted: Number of Instructions (pic0), and Cache invalidations (pic1) pic1 pic0 pic0 pic1 load 0,s1 load 0,s1 1 0 4 2 3 0 1 0 1 4 3 2 0 1 load 1,s2 load 1,s2 inc s2 add s1, s2, s1 CPU CPU load 0,s1 store 0,s1 $ $ Bus

  8. Agenda • Introduction • W3H: The Why, The What, The When, and The How • Hardware Performance Libraries • Performance Application Programming Interface (PAPI) • Performance Counters Libraries (PCL) • Visualization Tools • TAU: An example of a data collector • KOJAK: Semi automatic instrumentation tool • VAMPIR: An example of a script language • PE: The All levels approach

  9. Hardware Performance Libraries • Performance Counters: Good idea, but only accessible to hardware experts. • Solution: High Level Wrappers. • Usually written in C and Fortran. • Easy to make them thread safe and to integrate them in existent code. • Examples: • Performance Application Programming Interface (PAPI) • Performance Counters Library (PCL)

  10. Performance Application Programming Interface • A high Level wrapper functions that includes a vast set of architectures and events • Available for Power3, Power4, Ultra SPARC II and III, all flavors of Pentium, Itanium, AMD Athlon, etc. • Well documented, stable and reliable programming interface. • Goals of the PAPI project: • To provide a solid foundation for cross platform performance analysis tools • To present a set of standard definitions for performance metrics on all platforms • To provide a standardize API among users, vendors, and academics • To be easy to use, well documented, and freely available (Excerpt obtained from the PAPI user guide) • PAPI is an effort of the Innovative Computer Laboratory (ICL) that is part of the Department of Computer Science at the University of Tennessee

  11. PAPI High Level API Low Level API Portable Layer Block Diagram Substrate Machine Dependent Layer Kernel Extensions Operating System Hardware Performance Counters Platform PAPI_read() – PAPI 3.0 Altix (Itanium 2 -Madison Chip) 1357 Cycles/Call IBM Power 4 4034 Cycles/Call Itanium 2 (libpfm 2.0) 1606 Cycles/Call Pentium 3 (perfctr 2.4.5) 324 Cycles/Call Pentium 4 (perfctr 2.4.5) 401 Cycles/Call SGI R12k 3681 Cycles/Call Ultrasparc II 2150 Cycles/Call Overhead

  12. PAPI:Terminology • Native Events: • Defined as countable by an specific CPU. • Machine dependent • Hexadecimal value and a mask provided by PAPI libraries • Present Events: • Predefined events. • Events (or group of events) that are considered useful and relative ubiquitous across architectures. • A PAPI identifier is provided • Event List: • A array of events (usually the consist of PAPI identifiers)

  13. PAPI:Terminology • High Level API: • A group of functions • A single of list of events • Access to Native Events is prohibited. • Flexibility and performance is lost due to its easiness to use • Low Level API: • Another group of functions • Multiple event list definitions and native events interface. • Only one event list can be running at any point in time

  14. PAPI:Steps #include <papi.h> #include <stdio.h> #define NUM_EVENTS 2 int main(int argc, char **argv) { int Events[NUM_EVENTS] = { PAPI_TOT_INS, PAPI_TOT_CYC }; long_long values[NUM_EVENTS], val2[NUM_EVENTS]; int a= 0; int retval; retval = PAPI_library_init(PAPI_VER_CURRENT); PAPI_start_counters(Events, 2); PAPI_read_counters(values, 2); a++; PAPI_read_counters(values, 2); PAPI_read_counters(val2, 2); printf("The value of a is: %i \n", a); printf("The Coarse Instructions are: %10lld\n", values[0]); printf("The Coarse Cycles are: %10lld\n", (values[1])); printf("The Overhead Instructions are: %10lld\n", val2[0]); printf("The Overhead Cycles are: %10lld\n", (val2[1])); printf("The Total Instructions are: %10lld\n", (-val2[0] + values[0])); printf("The Total Cycles are: %10lld\n", (-val2[1] + values[1])); PAPI_stop_counters(values, 2); return 0; } Initialization of the PAPI library Start the counters Operate on the counters Stop the counters De-allocate any resource that has been allocated

  15. Assembly Output of a++ ld [%fp-52],%l0 add %l0,1,%l0 st %l0,[%fp-52] add %fp,-32,%o0 PAPI:Output The value of a is: 1 The Coarse Instructions are: 179 The Coarse Cycles are: 641 The Overhead Instructions are: 175 The Overhead Cycles are: 395 The Total Instructions are: 4 The Total Cycles are: 246 The first access to produce a (L2) cache miss

  16. PAPI:Extra Features • Multithread safe and support • Multiplexing where available • Overflow control with thresholds • Statistical Profiling and related functions • Error detection and control features

  17. Performance Counters Libraries • Another Example of High Level performance counters • Events are classified (as in PAPI) as Memory Hierarchy events (caches, TLB, memory, etc), Instructions (Instruction types, Instructions completed, etc), Status of Functional Units and rates and ratios. • It supports the Pentium architectures up to Pentium 4, the AMD Athlon / Duron, the IBM Power series up to Power 3-II, Alpha’s 21164 and 21264, SGI’s R10000 and R12000 and the UltraSPARC family of processors • PCL is available for C, C++ and Java • PCL is an effort of Forschungszentrum Juelich GmbH and the University of Applied Sciences Bonn-Rhein-Sieg from Germany and currently it is in its second version

  18. PCL • High Level API: • Similar to PAPI High Level API but the functions are different. • Events lists can be created in this API • Access to predefine events only • Recommended • Low Level API: • Let to access the performance counters directly • Not recommended • Handle: • A single Data (usually an integer) that is used to uniquely identify a set of resources. • Used to provide a thread specific link to the resources (the list of events)

  19. PLC:Steps #include <pcl.h> int main(int argc, char **argv) { int counter_list[2], a = 0; int ncounter; unsigned int mode; PCL_CNT_TYPE i_result_list[2]; PCL_FP_CNT_TYPE fp_result_list[2]; PCL_DESCR_TYPE descr; PCLinit(&descr); ncounter = 2; counter_list[0] = PCL_CYCLES; counter_list[1] = PCL_INSTR; mode = PCL_MODE_USER; PCLstart(descr, counter_list, ncounter, mode); a++; PCLstop(descr, i_result_list, fp_result_list, ncounter); printf("%f instructions in %f cycles\n", (double)i_result_list[1], (double)i_result_list[0]); PCLexit(descr); return 0; } Initialization of the PCL library Start the counters Operate on the counters Stop the counters De-allocate any resource that has been allocated

  20. PLC:Differences with PAPI • Nested function call enabled • Rates and Ratios are function calls in PAPI libraries • Low Level API deals with native code as PAPI’s Low level does but its used is not recommended in PCL

  21. Agenda • Introduction • W3H: The Why, The What, The When, and The How • Hardware Performance Libraries • Performance Application Programming Interface (PAPI) • Performance Counters Libraries (PCL) • Visualization Tools • TAU: An example of a data collector • KOJAK: Semi automatic instrumentation tool • VAMPIR: An example of a script language • PE: The All levels approach

  22. Visualization Tools • After gathering the information for the tools, how to present it to the user in the most efficient matter? • The visualization tools provide a good way to present trends in data across extensive data sets • Examples of Visualization tools: • Tuning and Analysis Utilities • Kit for Objective Judgement and Knowledge-basedDetection of Performance Bottlenecks • VAMPIR / VAMPIRTrace • Performance Evaluator

  23. Tuning and Analysis Utilities (TAU) • Program and Performance analysis tool framework for high-performance parallel and distributed computing. • A suite of tools for static and dynamic analysis of programs written in C, C++, FORTRAN 77/90, Python, High Performance FORTRAN, and Java. • Instrumentation by functions • The concept of Inclusive and Exclusive • With Time • Exclusive time  Refers to the time spent in the function minus all the time spent on functions that has instrumented and called by this function • Inclusive time  Total time of the function • With Performance Counter • The same as time with the properties of that performance counter • Supported extensions in C and FORTRAN: MPI and OpenMP • Hardware Counters supported: PAPI and PCL

  24. TAU Infrastructure

  25. KOJAK • Kit for Objective Judgement and Knowledge-based Detection of Performance Bottlenecks • A complete infrastructure dedicated to find performance bottlenecks and application properties • Consists of the following components • OpenMP Pragma And Region Instrumentor (OPARI) (Redirect the OpenMP function call and directives toward wrappers that contains instrumentation information (POMP)) and PMPI • TAU (function instrumentation) • Event Processing, Investigating and Logging (EPILOG) runtime library (event oriented trace creator utility)

  26. KOJAK • Extensive Performance Tool (EXPERT) (trace files analyzer  search for low performing sections on them and classify them according to severity) uses the Event Analysis and Recognition Library (EARL) • CUBE (KOJAK’s Trace visualization tool) • Trace transformations to different formats (to VAMPIR trace format)

  27. KOJAK Infrastructure

  28. KOJAK Snapshots

  29. KOJAK Snapshots

  30. VAMPIR • A configurable visualization trace tool • Converts trace information into a variety of graphical views: • Process State Display • Statistics Display • Timeline Display • Communications Statistics • Configured by using • Pull-down menus • Configuration file • The displays can be related to the source code • Zoom in and Zoom out Advance feature • Defined trace format: VAMPIR-Trace (runtime library enhanced with trace creation calls)

  31. VAMPIR Infrastructure Source Code Object Files Guide Compiler Executable Linker VAMPIRTrace Libraries Config File Guide Libraries VAMPIR V Trace File

  32. VAMPIR Snapshot

  33. Performance Evaluator • Java Based Tool • All level analysis of a program behavior: • Application Software level analysis • Data / Algorithm Analysis • Operation System level analysis • Thread context switching • Thread scheduling • Hardware Level Analysis • Memory Hierarchy • Used PMAPI performance counters (IBM proprietary)

  34. Performance Evaluator Infrastructure K42 Infrastructure 3 2 Parser / Modifier AIX OS 1 Others PE2 Trace Format PE Trace Format 3 2 1 PE2 Visualization Tool 1  Trace Format File 2  Map File 3  Meta File

  35. Performance Evaluator:A Run • Get Hardware Information from the infrastructures (the source has been instrumented and the OS is collecting information also) • Create: • Trace file (s)  Trace records of a program with short hand versions of events • Map file  Have static information about functions, threads and other structures • Meta file (s)  Properties of a trace, records type definitions and Map type definitions

  36. Performance Evaluator:A Run • Feed the files to the tool • Visualize the information with graphs • Contemplate the whole application behavior since beginning to the end • Complete GUI with the Eclipse Workbench • Designed to work with several Multi Threaded packages in C and Java • OpenMP not supported

  37. Questions? Comments? Thanks so much for your time

More Related