200 likes | 292 Views
MPE/Jumpshot Evaluation Report . Adam Leko Hans Sherburne, UPC Group HCS Research Laboratory University of Florida. Color encoding key: Blue: Information Red: Negative note Green: Positive note. Basic Information. Name: MPE/Jumpshot Developer: Argonne National Labratory
E N D
MPE/Jumpshot Evaluation Report Adam Leko Hans Sherburne, UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green: Positive note
Basic Information • Name: MPE/Jumpshot • Developer: Argonne National Labratory • Current versions: • MPE 1.26 • Jumpshot-4 • Website:http://www-unix.mcs.anl.gov/perfvis/ • Contacts: • Anthony Chan (chan@mcs.anl.gov) • David Ashton (ashton@mcs.anl.gov) • Rusty Lusk (lusk@mcs.anl.gov) • William Gropp (gropp@mcs.anl.gov)
What Is MPE/Jumpshot? • The “quintessential” MPI logging and post-mortem visualization toolset • MPE – Multi-Processing Environment • A software package for MPI programmers • Has three main parts: • A tracing library that outputs all MPI calls to stdout • A shared-display parallel X graphics and animation library • A logging library for logging events • Note: MPE/Jumpshot “logging” -> what we call tracing • Jumpshot • A visualization tool for logfiles created by the MPE package • Written in Java (crossplatform) • Provides a “time line” (GANTT) view of MPI and program events • Also has basic search and summary (histogram) functionality
Logfiles: What’s In A Format? • By default, MPE still outputs logfiles in CLOG • Low overhead • Can be easily converted to other formats as needed • SLOG: “scalable” format • State-based logging format • Visualization tool: Jumpshot-3 • Rewrite of Jumpshot-2 to use SLOG • Can scale to ~GB logfiles • SLOG-2: Current logfile format • Next-generation SLOG file format • “Graphical” logfile format to speed logfile parsing • Visualization tool: Jumpshot-4 • Much thought has been put into logfile formats • “Traditional” tracing results in large trace files • Trace file format can play a large part in visualization tool’s response time • ALOG: original format (Argonne LOGging format) • Text-based format • Visualization tool: Upshot • An X-windows application using the Athena widget toolset • Later rewritten using Tcl/Tk for easy coding • Turned out to be too slow • Parts rewritten in C (“Nupshot”) but Tcl->C interface kept changing • BLOG: intermediary format • CLOG • Binary file format created to improve upon ALOG and BLOG • Visualization tools • Jumpshot-1 • Complete rewrite of Upshot/Nupshot • Coded in Java/AWT for cross-platformness • Bad performance, not widely used • Jumpshot-2 • Improved version using Java/SWING • Slightly better performance
MPE Overview • Tracing capability • Automatic instrumentation: mpicc –mpitrace • Writes to stdout at every MPI call, eg • [1] Starting MPI_Send with count = 28, dest = 0, tag = 0... • [1] Ending MPI_Send • Equivalent “manual” method: printf • Very simple & intuitive • Parallel graphics ability • Automatic instrumentation: mpicc -mpianim -L/usr/X11R6/lib -lX11 –lm • Displays graphics on one machine • Circle for each process, arrow indicate sends/receives • Slows down execution considerably • Graphics are also available via library calls • Calls seem relatively easy to use: MPE_Draw_string, MPE_Draw_circle, MPE_Update, etc • Probably not all that useful
MPE Overview (2) • Logging ability • Automatic instrumentation: mpicc -mpilog • Logs start and stop of events • Can overlap starting and stopping of events • Can add “custom” events • Easy to do using library calls • MPE_Log_get_event_number: create a new event • MPE_Describe_state: gives name and color to event • MPE_Log_event: records event in logfile, uses MPI_Wtime to get global time • Custom events show up in Jumpshot-4 just like events from automatic instrumentation • Conventions • Automatic instrumentation uses all caps (SEND, RECV) • Manual instrumentation uses mixed case
MPE Overhead • All programs executed correctly when instrumented • Expect about 5% overhead of “real-world” applications • Barrier recording mechanism has a lot of overhead • Most applications don’t use a bunch of barriers, though
MPE Overhead: Barriers • Programs that have large measurement overhead shown below • Tons of barriers! (yellow) PPerfMark: diffuse procedure PPerfMark: random barrier
Jumpshot Overview • Jumpshot-4 supports two types of visualizations for metrics • Timeline (right, top) • Histogram (right, bottom) • Visualization is dependant on SLOG-2 format and Data model • Real drawables • State – Single timeline ID, start/end timestamp • Arrow – Pair of timeline IDs, start/end timestamp • Event – Single timeline ID, single timestamp • Preview drawables • Amalgamation of real drawables • One corresponding type for each of the real drawables • Serve to optimize performance of visualization Timeline view Histogram view
Jumpshot Overview (2) • Emphasis on providing useful profile analysis from • High-level (entire program execution) view • Low-level (individual events) view • Nice features • Intuitive interface • Automatically converts from CLOG to SLOG-2 • Very good support for zooming and scrolling • User manual very thorough • Things that could use improvement • Java application -> usesa lot of memory (~70-100MB during typical runs) • Memory uses seems to scale nicely with logfile size though • No direct support for non-event-based data (running averages, time-varying histograms for cache miss numbers, etc) • Documentation a little unclear/excessively technical in some places Timeline view Histogram view
Bottleneck Identification Test Suite • Testing metric: what did trace visualization tell us (automatic instrumentation)? • CAMEL: PASSED • Identified large number of small messages at beginning of program execution • Also identified sequential parts of algorithm (sort on node 0, etc) • No other problems visible from trace • NAS LU (“W” workload): PASSED • Showed communication bottlenecks very clearly • Large(!) number of small messages • Illustrated time taken for repartitioning data • Shows sensitivity to latency for processors waiting on data from other processors
Bottleneck Identification Test Suite (2) • Big message: PASSED • Traces illustrated large amount of time spent in send and receive • Diffuse procedure: PASSED • Traces illustrated a lot of synchronization with one process doing more work • Since no source code correlation, hard to tell why problem existed • Hot procedure: FAILED • CLOG trace file conversion failed (no communication events) • Even if trace loaded, no communication problems • Intensive server: PASSED • Traces showed that other nodes were waiting on node 0 • Ping pong: PASSED • Traces illustrated that the application was very latency-sensitive • Much time being spent on waiting for messages to arrive • Random barrier: PASSED • Traces showed that one noe was doing more work than the others • Small messages: PASSED • Traces illustrated a large number of messages being sent to node 0 • System time: FAILED • CLOG trace file conversion failed (no communication events) • Even if trace loaded, no communication problems • Wrong way: PASSED • First receive took a long time for message to arrive in trace
NAS LU (Class W) Visualization Much time taken for data redistribution Large number of small messages
General Comments • Good things • Jumpshot-4 represents a well-written, scalable event-based tracefile viewer • Formats used by Jumpshot are well-defined • Low measurement overhead in MPICH • Mature GUI, few bugs, has been around for a long time in one form or another • To leverage, just need to write logfile in a specific format • Things that could use improvement • Adding support for metrics other than events would require hacking SLOG-2 format • E.g., how to support showing L-2 miss rates as time increases? • Seems like it would be best used as part of our toolkit • Automatic instrumentation really necessary to make tool useful • Jumpshot-4 can fit in our toolkit as an event-based tracefile viewer if we can easily write to a format it understands
Adding UPC/SHMEM Support • At a minimum, need mechanism to output CLOG trace files • CLOG library currently uses many MPI calls • E.g., MPI_Wtime for timing information • Therefore, cannot just insert MPE logging calls and use the MPE library unmodified • However, CLOG format is defined • Could (relatively) easily create a C implementation that used UPC calls instead of MPI calls • Would need to come up with our own buffering scheme though • Can’t write files as data comes in, too slow • Should be able to steal a lot of code from MPE source • Not necessarily a problem, since we will most likely have to come up with a method if we go the tracing route anyways • Could also use slog2sdk SDK kit for writing to SLOG-2 files directly, but • API in Java only • SLOG-2 may have larger creation overhead than simple event-based formats such as CLOG • Several examples (and example C code) given for converting logfiles of arbitrary format to SLOG-2 format using slog2sdk • Can use our own log file format if needed! • Recommend going with CLOG though, so we can steal existing code
Evaluation (1) • Available metrics: 1/5 • Only communication-based metrics (timeline + histograms) available • Restricted to recording event-based metrics • Cost: free 5/5 • Documentation quality: 3.5/5 • Jumpshot-4 has a very good but lengthy user’s manual • slog2sdk (SDK for reading/writing SLOG-2 files) is not very clear, although SLOG-2 is also described in a lengthy paper • Extensibility: 3.5/5 • Jumpshot-4 written in Java (easy to find Java coders at UF) • Can easily add new events using MPE library calls • Adding time-varying metrics (histograms, etc) would require writing code from scratch • Filtering and aggregation: 3/5 • Can restrict event types being displayed from trace • Preview drawables and histograms provide aggregation abilities • Does not filter or aggregate data directly when recording data • Hardware support: 4/5 • 64-bit Linux(Opteron, Itanium), Tru64(AlphaServer), IRIX, IBM SP(AIX), Cray MPI • Can be used with any MPICH or LAM installation many more • Heterogeneity support: 0/5 (not supported)
Evaluation (2) • Installation: 5/5 • About as easy as you could expect • Zero effort if using MPICH already, compiling from source also easy • Interoperability: 0.5/5 • No way provided to export SLOG-2 files to other viewers • Example code provided in slog2sdk on how to convert existing formats into SLOG-2 format • Learning curve: 4.5/5 • Easy to learn, well-written documentation • MPE really easy to use (mpicc -mpilog) • Manual overhead: 1/5 • All MPI calls automatically instrumented for you when linking against MPE • Adding other events requires manual work (not much though) • No way to turn on/off tracing in places without recompilation • Measurement accuracy: 5/5 • CAMEL overhead < 1% • Correctness of programs not affected • Measurements seem accurate to millisecond (relies on MPI_Wtime resolution though) • Only large numbers of messages (106 or more back-to-back) or frequent barriers seem to introduce any appreciable overhead
Evaluation (3) • Multiple executions: 0/5 (not supported) • Multiple analyses & views: 2/5 • Only shows timeline and histograms (but does both very well) • Excellent zooming and scrolling features (scalable to GB logfiles) • Performance bottleneck identification: 4.5/5 • No automatic methods supported • Traces do very good job of showing communication and synchronization bottlenecks • Can also use custom events to indirectly determine some types of bottlenecks (e.g., load imbalance) • Profiling/tracing support: 3/5 • Only supports tracing • Trace format compact & scalable so viewer can comfortably show GB logfiles • Automatic tracing is either entirely on or entirely off • Turning on/off manual tracing requires code modification and recompilation
Evaluation (4) • Response time: 2/5 • No results until after run • For an 850MB CLOG tracefile: • Converting to SLOG-2 took 5 minutes • Opening up 350MB SLOG-2 file took about 10 seconds • However, large trace files will be slower than a method that incorporates more filtering and aggregation • Limitation of tracing method, not tool implementation • Software support: 3/5 • Supports C & Fortran • Tied closely to MPI applications • Supports linking with any library supported by GCC/platform C compiler, but linked libraries will not be profiled unless they contain MPI calls • Source code correlation: 1/5 • Not directly supported • Can correlate indirectly by using custom events at function entry/exit points • Searching: 1.5/5 • Only a simple search function available
Evaluation (5) • System stability: 4.5/5 • MPE very stable (no problems observed) • Jumpshot-4 has very few bugs (small ones exist but do not get in the way) • Extremely good for a freely-downloadable research project • Technical support: 4/5 • Jumpshot-4 does give very good error messages • Developers responded within 24 hours • Developers willing to help point us in the right direction for writing SLOG-2 files using their APIs