270 likes | 444 Views
Intel Trace Collector and Trace Analyzer Evaluation Report . Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida. Color encoding key: Blue: Information Red: Negative note Green: Positive note. Basic Information.
E N D
Intel Trace Collector andTrace Analyzer Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green: Positive note
Basic Information • Name: Intel Trace Collector, Intel Trace Analyzer • Developer: Intel • Current versions: • Intel Trace Collector 5.0.1.0 • Intel Trace Analyzer 4.0.3.1 • Website:http://www.intel.com/software/products/cluster • Contact: • http://premier.intel.com
Intel Cluster Tools Overview • A toolkit for creating high-performance applications on Intel’s architectures (x86, IA64) • Intel MPI Library • Intel’s implementation of MPI • Intel Cluster Math Kernel Library • Contains several Intel-optimized math routines • Also has a version of ScaLAPACK • Intel Trace Collector & Trace Analyzer • Represent the performance analysis portion of Intel Cluster Tools • The two are used in conjunction to analyze performance of parallel applications (mostly MPI): • Trace Collector:Provides a method for instrumenting programs and recording performance data • Trace Analyzer:Provides graphical representation of trace data from STF trace file • Formerly known as Vampirtrace & Vampir
Trace Collector Overview • What can be traced: • MPI applications can be traced automatically by linking against profiling library • Records of MPI routine calls • Data describing communication (point-to-point and collective) • Hardware counter data if available • Statistics – function calls, sent messages, collective operations (count, duration, bytes) • User-level code can be traced through manual instrumentation using ITC API • User defined states • User defined counters • Non-MPI (distributed) applications can be traced • Use same API calls as instrumenting user code in MPI apps • Binaries instrumentation without recompilation is possible • Use itcinstrument tool • Must use MPI or must explicitly initialize/finalize Trace Collector • Java programs
Trace Collector Libraries • ITC Offers four different libraries for creating trace files. Each offers different operation characteristics • libVT • Contains wrapper functions for automatic logging of MPI calls • Offers extended functionality through an API for logging of user defined data • libVTnull • Contains dummy versions of API calls • libVTfs • Same functionality as libVT • Trace file writing is done via TCP sockets • In case of failure, trace data is not lost • libVTcs • Similar to VTfs in that it uses TCP sockets to write tracefiles • Does not automatically log MPI calls • Requires that a process be explicitly designated as server for trace file creation coordination
Structured Trace File Format (STF) • Structured Trace File format is the default format for traces • Data is divided into logical frames, which helps to partition data for large-scale programs with large traces (possibly GBs) • Time axis • Location axis • Type of data (state, collective operations, point-to-point messages, counter values, MPI-IO) • Indexing allows for quick random access • Uses multiple files • File division does not necessarily reflect frame division • Allows for parallelism in reading and writing • Documentation does not detail the innerworkings • Can be converted to single-file STF for ease of file handling and transmission • No documentation provided on how actual construct STF trace files without using Trace Collector
STF Utilities • STF files can be manipulated using stftool and xstftool: • Extract various data • Manipulate frames, and groupings • Convert STF files into AVT, or XVT • AVT • Format used by previous versions of Vampir • Should be understood by Trace Analyzer • Created by other existing tools • XVT • Similar to AVT in syntax • Replaces integer descriptors with more easily understood titles • Combine all data in one file • Alternatives are human and script readable • No means is provided to facilitate importing the data into another tool
Trace Collector API • Intel Trace Collector offers an API to • Trace user code in detail • Trace non-MPI distributed apps • Functions are defined to: • Record user defined states in trace • Record user defined communication events in the trace • Record source code locations for correlation in Intel Trace Analyzer • Record user defined counters in trace • Define process groupings used in trace analyzer • Define frames (recommended to use config options instead) • Turn tracing on and off during execution • Enable tracing of multithreaded applications • Initialize and finalize Intel Trace Collector - needed for non-MPI applications
Trace Collector Overhead • All programs executed correctly when instrumented • Benchmarks marked with a star had high variability in execution time • Readings with stars probably not accurate • In most cases overhead less than 8% • Wasn’t able to test overhead of hardware counter instrumentation • However, trace file writing for class B LU with 32 processes took almost 20 minutes!
Trace Analyzer • Intel Trace Analyzer (ITA) is a visualization program • Reads STF tracefiles • Tracefiles from previous versions should also work • ITA can display: • Event based data (including messages) • Statistical data • Counter data if it is contained in the trace • Displays may represent view of: • Multiple processes • Individual processes • Group of processes(depending on selected filtering options) • Single process • Possible to configure the views in variousways • Activities / Symbols • Absolute Time / Scaled (percentage of total) Time • Number of processes displayed at once • Colors used for activities
Trace Analyzer (2) • Data from a large trace file can be viewed in increments • Select the appropriate frames from the STF file • Views may be linked to visible portion of zoomed timeline • Pre-computed statistical data can be viewed without loading trace data • General Notes on ITA Interface • Uses X-windows • Is quite stable • Provides good interface responsiveness • Interface is intuitive (for the most part) • ITA is not capable of automatic analysis of trace data.
Trace Analyzer Views • Summary Chart Display • Allows the user to see how much work is spent in MPI calls • Timeline Display • Zoomable, scrollable timeline representation of program execution Summary Chart Timeline Display
Trace Analyzer Views (2) • Summary Timeline • Timeline/histogram representation showing the number of processes in each activity per time bin • Counter Timeline • Value over time representation (behavior depends on counter definition in trace) Summary Timeline Counter Timeline
Trace Analyzer Views (3) • Message Statistics Display • Message data to/from each process (count,length, rate, duration) • Process Profile Display • Per process data regarding activities Message Statistics Process Profile Display
Trace Analyzer Views (4) • Statistics Display • Various statistics regarding activities in histogram, table, or text format • Call Tree Display Statistics Display Call Tree Display
Trace Analyzer Views (5) • Source View • Source code correlation with events in Timeline • Activity Chart • Per Process histograms of Application and MPI activity Source View Activity Chart
Trace Analyzer Views (6) • Process Timeline • Activity timeline and counter timeline for a single process • Process Activity Chart • Same type of information as Global Summary Chart • Process Call Tree • Same type of information as Global Call Tree Process Timeline Process Activity Chart & Call Tree
Bottleneck Identification Test Suite • Testing metric: what did trace visualization tell us (automatic instrumentation)? • CAMEL: PASSED • Identified large number of small messages at beginning of program execution • Easily see that MPI calls take up small portion of run time (<3%) • NAS LU: PASSED • Showed communication bottlenecks very clearly • Large(!) number of small messages • Shows sensitivity to latency for processors waiting on data from other processors • “W” Class: 18 MB trace file • Loads quickly • “B” Class: 240 MB trace file • Loads slowly (2-3 min.), responsiveness of program is diminished • However, can be loaded in small pieces that load much faster • Some information is available with out loading any frames • Took nearly 20 minutes to write trace after program completion!
Bottleneck Identification Test Suite (2) • Big message: PASSED • Traces illustrated large amount of time spent in send and receive • Diffuse procedure: PASSED • Traces illustrated a lot of synchronization with each process executing user code in an exclusive, alternating manner • Hot procedure: TOSS-UP • Assuming hardware counters work, would be easy to see extra CPU utilization • Manually instrumenting code would improve accuracy of source code correlation • Intensive server: PASSED • Trace clearly shows that all processes communicate with a single process whose response time is delayed by user code • Ping pong: PASSED • Traces illustrated that most time is spent in MPI code sending and receiving messages, with little time spent in user code • Random barrier: PASSED • Traces show that there are many barriers, with each one held up by a random processor in user code • Small messages: PASSED • Traces illustrated a large number of messages being sent to node 0 • System time: TOSS-UP • Hardware counter timeline might be able to indicate bottleneck if they were working • Wrong way: PASSED • Trace shows that first receive takes a long time, but the rest of the messages sent during this time period are received quickly
General Comments • Intel Trace Collector/Analyzer are very popular and effective tools for creating and displaying trace files. • These tools are proprietary, and closed source. • Analyzing performance of MPI applications is the primary intended use. • Support for analyzing non-MPI applications is provided via an API, and a special library (libVTcs - allows for coordination of tracefile creation without MPI). • Performance analysis requires the user to have a good understanding of the types of problems likely to affect performance. • No automatic detection of bottlenecks
Evaluation (1) • Available metrics: 4.5/5 • Can use PAPI • Many metrics (event-based and counter-based) are available, but it is not possible to create custom metrics as in Paraver • Cost: 3/5 • A single-user license costs ~$500 • Multiple user licenses are for a single cluster only • A 20-user license costs ~$5000 • A100-user license costs ~$15,000 , • A unlimited user license costs ~$30,000 • Documentation quality: 4/5 • Documentation covers most of the features in a clear and consistent fashion • Trace Analyzer documentation includes a section that walks a user through the process of analyzing a trace file for bottlenecks through a sample scenario • However, some parts of the documentation are confusing if the document is not read in it’s entirety • Doesn’t describe inner-workings of trace collection/display *Note: evaluated IA:32 MPICH Linux version
Evaluation (2) • Extensibility: 0/5 • Commerical (no source) • Trace file format is not documented • However could possible use distributed application tracing features to create traces • Filtering and aggregation: 4/5 • Much of what is recorded in trace files can be controlled through a configuration file (or command line arguments) • Some post-mortem filtering and aggregation can be controlled from within Trace Analyzer, but it is not as customizable as other tools • Hardware support: 1/5 • Supports only systems using Intel IA-32, Itanium 2, or Intel Extended Memory 64 • Heterogeneity support: 5/5 • Through the use of libVTcs one may manually instrument the code of distributed applications across heterogeneous platforms • No automatic event capturing for heterogeneous applications, however
Evaluation (3) • Installation: 4.5/5 • Install was very simple, and worked immediately • However, I was never able to get hardware counters to function due to incompatibilities with installed PAPI and getrusage • Interoperability: 1/5 • Trace Analyzer is capable of reading older vampirtrace trace file format files which can be output by some other tools • A tracefiles can be output in (or converted to) older ASCII-based vampirtrace trace file format • Learning curve: 4.5/5 • Most important, and useful views and features are intuitive and easy to understand • Some features seem a bit redundant or oddly named • Manual overhead: 3/5 • MPI call tracing is done automatically by linking against profiling library • Can also instrument all functions or a handful of functions using binary instrumentation • More detailed tracing information requires manually inserting API function calls • A null library is included so that binaries utilizing API function calls need not be altered
Evaluation (4) • Measurement accuracy: 4/5 • CAMEL overhead ~5% • Tracing overhead is negligible • However, sometimes trace analyzer finds reversed messages that shouldn’t be there • Multiple executions: 1/5 • Multiple instances of Trace Analyzer can be opened at once, but comparing views must be done manually • Some support is offered for comparing statistics between two different tracefiles but it is greatly limited (difference or quotient of histograms between two runs) • Multiple analyses & views: 4/5 • A number of common, useful views are available • However, the values displayed are not as customizable as other tools • No automatic analysis is offered • Analysis can be performed by examining timelines, histograms, or textual representations • Performance bottleneck identification: 4.5/5 • No automatic detection • Views provided should allow for manual detection of most common bottlenecks
Evaluation (5) • Profiling/tracing support: 5/5 • Both tracing (recording events, and messages) and profiling (recording statistics) are supported and can be used independent of each other • Response time: 2/5 • No data at all until after run has completed and tracefile has been opened • Some information available without fully loading tracefile • Large trace files can take a long time to write out and read back in • Searching: 0/5 (not supported) • Software support: 4.5/5 • MPI profiling interface should permit use with many MPI implementations (support of Intel, Lam, and MPICH is explicitly offered) • Full support is available for C/C++, Fortran, and some support for Java and OpenMP
Evaluation (6) • Source code correlation: 4/5 • All MPI calls on time line offer click source code correlation • User code correlation requires more manual effort • System stability: 4.5/5 • Trace Analyzer crashed (segmentation fault) only once throughout evaluation • Trace Collector never caused an application to fail • Technical support: 4/5 • Quick initial response through support webpage (a few hours) • Subsequent responses required a few days
References • Intel Trace Analyzer 4.0 • User’s Guide 4.0.3.0 • Intel Trace Collector - IA32-LIN-MPICH PRODUCT.5.0.1.0 • User’s Guide PRODUCT 5.0.1.0