1 / 27

Intel Trace Collector and Trace Analyzer Evaluation Report

Intel Trace Collector and Trace Analyzer Evaluation Report . Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida. Color encoding key: Blue: Information Red: Negative note Green: Positive note. Basic Information.

yorick
Download Presentation

Intel Trace Collector and Trace Analyzer Evaluation Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Intel Trace Collector andTrace Analyzer Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green: Positive note

  2. Basic Information • Name: Intel Trace Collector, Intel Trace Analyzer • Developer: Intel • Current versions: • Intel Trace Collector 5.0.1.0 • Intel Trace Analyzer 4.0.3.1 • Website:http://www.intel.com/software/products/cluster • Contact: • http://premier.intel.com

  3. Intel Cluster Tools Overview • A toolkit for creating high-performance applications on Intel’s architectures (x86, IA64) • Intel MPI Library • Intel’s implementation of MPI • Intel Cluster Math Kernel Library • Contains several Intel-optimized math routines • Also has a version of ScaLAPACK • Intel Trace Collector & Trace Analyzer • Represent the performance analysis portion of Intel Cluster Tools • The two are used in conjunction to analyze performance of parallel applications (mostly MPI): • Trace Collector:Provides a method for instrumenting programs and recording performance data • Trace Analyzer:Provides graphical representation of trace data from STF trace file • Formerly known as Vampirtrace & Vampir

  4. Trace Collector Overview • What can be traced: • MPI applications can be traced automatically by linking against profiling library • Records of MPI routine calls • Data describing communication (point-to-point and collective) • Hardware counter data if available • Statistics – function calls, sent messages, collective operations (count, duration, bytes) • User-level code can be traced through manual instrumentation using ITC API • User defined states • User defined counters • Non-MPI (distributed) applications can be traced • Use same API calls as instrumenting user code in MPI apps • Binaries instrumentation without recompilation is possible • Use itcinstrument tool • Must use MPI or must explicitly initialize/finalize Trace Collector • Java programs

  5. Trace Collector Libraries • ITC Offers four different libraries for creating trace files. Each offers different operation characteristics • libVT • Contains wrapper functions for automatic logging of MPI calls • Offers extended functionality through an API for logging of user defined data • libVTnull • Contains dummy versions of API calls • libVTfs • Same functionality as libVT • Trace file writing is done via TCP sockets • In case of failure, trace data is not lost • libVTcs • Similar to VTfs in that it uses TCP sockets to write tracefiles • Does not automatically log MPI calls • Requires that a process be explicitly designated as server for trace file creation coordination

  6. Structured Trace File Format (STF) • Structured Trace File format is the default format for traces • Data is divided into logical frames, which helps to partition data for large-scale programs with large traces (possibly GBs) • Time axis • Location axis • Type of data (state, collective operations, point-to-point messages, counter values, MPI-IO) • Indexing allows for quick random access • Uses multiple files • File division does not necessarily reflect frame division • Allows for parallelism in reading and writing • Documentation does not detail the innerworkings • Can be converted to single-file STF for ease of file handling and transmission • No documentation provided on how actual construct STF trace files without using Trace Collector

  7. STF Utilities • STF files can be manipulated using stftool and xstftool: • Extract various data • Manipulate frames, and groupings • Convert STF files into AVT, or XVT • AVT • Format used by previous versions of Vampir • Should be understood by Trace Analyzer • Created by other existing tools • XVT • Similar to AVT in syntax • Replaces integer descriptors with more easily understood titles • Combine all data in one file • Alternatives are human and script readable • No means is provided to facilitate importing the data into another tool

  8. Trace Collector API • Intel Trace Collector offers an API to • Trace user code in detail • Trace non-MPI distributed apps • Functions are defined to: • Record user defined states in trace • Record user defined communication events in the trace • Record source code locations for correlation in Intel Trace Analyzer • Record user defined counters in trace • Define process groupings used in trace analyzer • Define frames (recommended to use config options instead) • Turn tracing on and off during execution • Enable tracing of multithreaded applications • Initialize and finalize Intel Trace Collector - needed for non-MPI applications

  9. Trace Collector Overhead • All programs executed correctly when instrumented • Benchmarks marked with a star had high variability in execution time • Readings with stars probably not accurate • In most cases overhead less than 8% • Wasn’t able to test overhead of hardware counter instrumentation • However, trace file writing for class B LU with 32 processes took almost 20 minutes!

  10. Trace Analyzer • Intel Trace Analyzer (ITA) is a visualization program • Reads STF tracefiles • Tracefiles from previous versions should also work • ITA can display: • Event based data (including messages) • Statistical data • Counter data if it is contained in the trace • Displays may represent view of: • Multiple processes • Individual processes • Group of processes(depending on selected filtering options) • Single process • Possible to configure the views in variousways • Activities / Symbols • Absolute Time / Scaled (percentage of total) Time • Number of processes displayed at once • Colors used for activities

  11. Trace Analyzer (2) • Data from a large trace file can be viewed in increments • Select the appropriate frames from the STF file • Views may be linked to visible portion of zoomed timeline • Pre-computed statistical data can be viewed without loading trace data • General Notes on ITA Interface • Uses X-windows • Is quite stable • Provides good interface responsiveness • Interface is intuitive (for the most part) • ITA is not capable of automatic analysis of trace data.

  12. Trace Analyzer Views • Summary Chart Display • Allows the user to see how much work is spent in MPI calls • Timeline Display • Zoomable, scrollable timeline representation of program execution Summary Chart Timeline Display

  13. Trace Analyzer Views (2) • Summary Timeline • Timeline/histogram representation showing the number of processes in each activity per time bin • Counter Timeline • Value over time representation (behavior depends on counter definition in trace) Summary Timeline Counter Timeline

  14. Trace Analyzer Views (3) • Message Statistics Display • Message data to/from each process (count,length, rate, duration) • Process Profile Display • Per process data regarding activities Message Statistics Process Profile Display

  15. Trace Analyzer Views (4) • Statistics Display • Various statistics regarding activities in histogram, table, or text format • Call Tree Display Statistics Display Call Tree Display

  16. Trace Analyzer Views (5) • Source View • Source code correlation with events in Timeline • Activity Chart • Per Process histograms of Application and MPI activity Source View Activity Chart

  17. Trace Analyzer Views (6) • Process Timeline • Activity timeline and counter timeline for a single process • Process Activity Chart • Same type of information as Global Summary Chart • Process Call Tree • Same type of information as Global Call Tree Process Timeline Process Activity Chart & Call Tree

  18. Bottleneck Identification Test Suite • Testing metric: what did trace visualization tell us (automatic instrumentation)? • CAMEL: PASSED • Identified large number of small messages at beginning of program execution • Easily see that MPI calls take up small portion of run time (<3%) • NAS LU: PASSED • Showed communication bottlenecks very clearly • Large(!) number of small messages • Shows sensitivity to latency for processors waiting on data from other processors • “W” Class: 18 MB trace file • Loads quickly • “B” Class: 240 MB trace file • Loads slowly (2-3 min.), responsiveness of program is diminished • However, can be loaded in small pieces that load much faster • Some information is available with out loading any frames • Took nearly 20 minutes to write trace after program completion!

  19. Bottleneck Identification Test Suite (2) • Big message: PASSED • Traces illustrated large amount of time spent in send and receive • Diffuse procedure: PASSED • Traces illustrated a lot of synchronization with each process executing user code in an exclusive, alternating manner • Hot procedure: TOSS-UP • Assuming hardware counters work, would be easy to see extra CPU utilization • Manually instrumenting code would improve accuracy of source code correlation • Intensive server: PASSED • Trace clearly shows that all processes communicate with a single process whose response time is delayed by user code • Ping pong: PASSED • Traces illustrated that most time is spent in MPI code sending and receiving messages, with little time spent in user code • Random barrier: PASSED • Traces show that there are many barriers, with each one held up by a random processor in user code • Small messages: PASSED • Traces illustrated a large number of messages being sent to node 0 • System time: TOSS-UP • Hardware counter timeline might be able to indicate bottleneck if they were working • Wrong way: PASSED • Trace shows that first receive takes a long time, but the rest of the messages sent during this time period are received quickly

  20. General Comments • Intel Trace Collector/Analyzer are very popular and effective tools for creating and displaying trace files. • These tools are proprietary, and closed source. • Analyzing performance of MPI applications is the primary intended use. • Support for analyzing non-MPI applications is provided via an API, and a special library (libVTcs - allows for coordination of tracefile creation without MPI). • Performance analysis requires the user to have a good understanding of the types of problems likely to affect performance. • No automatic detection of bottlenecks

  21. Evaluation (1) • Available metrics: 4.5/5 • Can use PAPI • Many metrics (event-based and counter-based) are available, but it is not possible to create custom metrics as in Paraver • Cost: 3/5 • A single-user license costs ~$500 • Multiple user licenses are for a single cluster only • A 20-user license costs ~$5000 • A100-user license costs ~$15,000 , • A unlimited user license costs ~$30,000 • Documentation quality: 4/5 • Documentation covers most of the features in a clear and consistent fashion • Trace Analyzer documentation includes a section that walks a user through the process of analyzing a trace file for bottlenecks through a sample scenario • However, some parts of the documentation are confusing if the document is not read in it’s entirety • Doesn’t describe inner-workings of trace collection/display *Note: evaluated IA:32 MPICH Linux version

  22. Evaluation (2) • Extensibility: 0/5 • Commerical (no source) • Trace file format is not documented • However could possible use distributed application tracing features to create traces • Filtering and aggregation: 4/5 • Much of what is recorded in trace files can be controlled through a configuration file (or command line arguments) • Some post-mortem filtering and aggregation can be controlled from within Trace Analyzer, but it is not as customizable as other tools • Hardware support: 1/5 • Supports only systems using Intel IA-32, Itanium 2, or Intel Extended Memory 64 • Heterogeneity support: 5/5 • Through the use of libVTcs one may manually instrument the code of distributed applications across heterogeneous platforms • No automatic event capturing for heterogeneous applications, however

  23. Evaluation (3) • Installation: 4.5/5 • Install was very simple, and worked immediately • However, I was never able to get hardware counters to function due to incompatibilities with installed PAPI and getrusage • Interoperability: 1/5 • Trace Analyzer is capable of reading older vampirtrace trace file format files which can be output by some other tools • A tracefiles can be output in (or converted to) older ASCII-based vampirtrace trace file format • Learning curve: 4.5/5 • Most important, and useful views and features are intuitive and easy to understand • Some features seem a bit redundant or oddly named • Manual overhead: 3/5 • MPI call tracing is done automatically by linking against profiling library • Can also instrument all functions or a handful of functions using binary instrumentation • More detailed tracing information requires manually inserting API function calls • A null library is included so that binaries utilizing API function calls need not be altered

  24. Evaluation (4) • Measurement accuracy: 4/5 • CAMEL overhead ~5% • Tracing overhead is negligible • However, sometimes trace analyzer finds reversed messages that shouldn’t be there • Multiple executions: 1/5 • Multiple instances of Trace Analyzer can be opened at once, but comparing views must be done manually • Some support is offered for comparing statistics between two different tracefiles but it is greatly limited (difference or quotient of histograms between two runs) • Multiple analyses & views: 4/5 • A number of common, useful views are available • However, the values displayed are not as customizable as other tools • No automatic analysis is offered • Analysis can be performed by examining timelines, histograms, or textual representations • Performance bottleneck identification: 4.5/5 • No automatic detection • Views provided should allow for manual detection of most common bottlenecks

  25. Evaluation (5) • Profiling/tracing support: 5/5 • Both tracing (recording events, and messages) and profiling (recording statistics) are supported and can be used independent of each other • Response time: 2/5 • No data at all until after run has completed and tracefile has been opened • Some information available without fully loading tracefile • Large trace files can take a long time to write out and read back in • Searching: 0/5 (not supported) • Software support: 4.5/5 • MPI profiling interface should permit use with many MPI implementations (support of Intel, Lam, and MPICH is explicitly offered) • Full support is available for C/C++, Fortran, and some support for Java and OpenMP

  26. Evaluation (6) • Source code correlation: 4/5 • All MPI calls on time line offer click source code correlation • User code correlation requires more manual effort • System stability: 4.5/5 • Trace Analyzer crashed (segmentation fault) only once throughout evaluation • Trace Collector never caused an application to fail • Technical support: 4/5 • Quick initial response through support webpage (a few hours) • Subsequent responses required a few days

  27. References • Intel Trace Analyzer 4.0 • User’s Guide 4.0.3.0 • Intel Trace Collector - IA32-LIN-MPICH PRODUCT.5.0.1.0 • User’s Guide PRODUCT 5.0.1.0

More Related