290 likes | 394 Views
CEPBA Tools (DiP) Evaluation Report . Adam Leko Hans Sherburne, UPC Group HCS Research Laboratory University of Florida. Color encoding key: Blue: Information Red: Negative note Green: Positive note. Basic Information. Name: Dimemas, MPITrace, Paraver
E N D
CEPBA Tools (DiP) Evaluation Report Adam Leko Hans Sherburne, UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green: Positive note
Basic Information • Name: Dimemas, MPITrace, Paraver • Developer: European Center for Parallelism of Barcelona • Current versions: • MPITrace 1.1 • Paraver 3.3 • Dimemas 2.3 • Website:http://www.cepba.upc.es/tools_i.htm • Contact: • Judit Gimenez (judit@cepba.upc.edu)
DiP Overview Write code Instrument (MPITrace) • DiP = Dimemas, Paraver • Toolset used for improving performance of parallel programs • Created by CEPBA ca. 1992/93, still in development • Has three main components: • Trace collection • MPITrace for MPI programs • OMPTrace for OpenMP programs (not evaluated) • OMPITrace for hybrid OpenMP/MPI programs (not evaluated) • Trace visualization: Paraver • Trace simulation: Dimemas • Uses MPIDTrace for instrumentation • Workflow encouraged by DiP • “Measure-modify” approach • Pictured right Examine tracefile (Paraver) Hypothesize about bottlenecks Test new hypothesis Verify via simulation (Dimemas) Fixbottlenecks
MPITrace Overview • Automatically profiles all MPI commands using MPI profiling interface • Compilation command:mpicc -L/path/to/mpitrace/libs \ -L/path/to/papi/libs -lmpitrace -lpapi \ <rest of compilation cmds> • Can record other information too • Hardware counters via PAPI (MPItrace_counters) • Custom events (MPItrace_event) • Requires special runtime wrapper script to produce tracefile • Command:mpitrace mpirun <rest of regular cmds> • mpitrace requires license to run • mpitrace must be started from machine listed in license file
MPITrace Overview (2) • After running mpitrace, several .mpit files are created (one per MPI process) • Collect them into a single tracefile with command:mpi2prv –syn *.mpit • -syn flag necessary to line up events correctly (not mentioned in docs [1]) • This command creates a single logfile (.prv) and Paraver config file (.pcf) • .pcf file also contains names and colors of custom events • Tracefile format • ASCII (plain text), well-documented (see [1]) • Can get to be quite large • .prv files can be converted to faster-loading, platform-dependent, undocumented binary format via prv2log command • Was never able to get hardware counters working • Took several tries to get any tracefile to be created • PAPI 3.0.7 installed with no problems on Kappas 1-8 • No errors but no hardware counter events in tracefile! • Rest of review assumes that this can be fixed given enough time
MPITrace Overhead • All programs executed correctly when instrumented • Benchmarks marked with a star had high variability in execution time • Readings with stars probably not accurate • Based on LU benchmark, expect ~ 30% tracing overhead • More communication == more overhead • Wasn’t able to test overhead of hardware counter instrumentation
Paraver Overview • Four main pieces of Paraver (see right): • Filtering • Semantic module • Visualization • Graphical timeline • Text • Analysis (1D/2D) • Complex piece of software! • Had to review several documents to get a feel for how to use [2, 3, 4, 5] • Tutorial short but not too clear • Reference manual best documentation, but lengthy Image courtesy [2]
Paraver: Process/Resource Models Process model (courtesy [3]) Resource model (courtesy [3])
Paraver: Graphical Timeline • Graphic display uses standard timeline view • Event view similar to Jumpshot, Upshot, etc. (right, top) • Can also display time-varying data like global CPU utilization (right, bottom) • Tool can display more than one trace file at a time • Uses “tape” metaphor instead of scrolling • Play, pause, rewind to beginning, fast forward to end • Cumbersome and nonintuitive • Breaks intuition of what scroll bars do (scroll bars do not scroll window) • Moving window creates animations which slows things down compared to regular scrolling • Interface is workable, but takes some getting used to • Zooming always brings up another window • Quickly results in many open windows • This complexity handled by adding the a save/restore open windows function • Save/restore windows nice feature • Interface is generally snappy • Uses ugly widget set by today’s standards
Paraver: Text Views • Provide very detailed information about trace files • Textual listing of events • Which events happen when • Access by clicking on graphical timeline
Paraver: 1D/2D Analysis • 1D Analysis (right, top) • Shows statistics about various types of events • Shown per thread as text or histogram • 2D Analysis (right, bottom) • Shows statistics for 1 event type between • Pairs of threads • Item chosen by semantic module • Uses color to encode information (high variance, max/min) • Analysis mode takes into account filter and semantic modules (described next) • Very complex and user-unfriendly, but • Allows complicated analyses to be performed, can easily reconstruct most “normal” profiling information
Paraver: Filter Module • Filter module allows filtering of events before they are • Shown in the timeline • Processed by the semantic module • Analyzed by the 1D/2D analyzers • Can filter events by communication parameters • Who sends/receives the message • Message tag (MPI tag) • Logical times (when send/receive functions are called) or physical times (when send/receive actually takes place) • Combination of ANDs/ORs from the above • Also by user events • Type and/or value • Interface for filtering events is straightforward
Paraver: Semantic Module • Interface between raw tracefile data what user sees • Sits above filter, below visualization modules • Makes heavy use of the runtime/process model • Uses 3 different methods for getting values • Work with the process model (next slide) • Application, task, thread, and workload levels • Work with the available system resources (next slide) • Node, CPU, and system levels • Combine different existing views • E.g., combine TLB misses with loads for average TLB miss ratios • In a few words: controls how trace file information is displayed • Flexible way of being able to display disparate types of information (communication vs. hardware counters) • Can take a lot of work to get Paraver to show what information you’re looking for • Saved window configurations can help greatly here (perform steps only once, use for all traces later on) • Easily the most confusing aspect of Paraver • Documentation doesn’t necessarily help with this
Dimemas Overview • Uses generic “network of SMPs” model to perform trace-driven simulation • Outputs trace files that can be directly visualized by Paraver • Uses different tracefile format for input than Paraver • Was never able to get this to work • “dimemas” GUI crashed • Java version works, but other problems exist…. • “Dimemas” complained about missing license even though one was in $DIMEMAS_HOME/etc/license.dat • Need MPIDTrace? • Rest of evaluation based on available documentation [4, 5, 6]
Dimemas: Architectural/Process Model • Simulated architecture: network of SMPs • Parameters for interconnection network • Number of buses (models resource contention) • Bisection bandwidth of network • Full duplex/half duplex links (from node to bus) • Parameters for nodes • Bandwidth and latency for intra-node communication • Latency for inter-node communication • Processor speed (uses linear speedup model) • Parameters for existing systems are collected (manually) via microbenchmarks • Uses the same process model as Paraver • Application (Ptask), task, thread levels • Can model MPI, OMP, hybrid models with this model Image courtesy [5]
Dimemas: Communication Model • Figures to right illustrate timing information that is simulated • Point-to-point communication model • Shown right top • Straightforward model based on latencies, bandwidth, and contention (bus model) • Collective communication model • Shown right bottom • Implicit barrier before all collective operations • Two phases: • Fan in • Fan out • Collective communication time represented 3 ways (selected by user) • Constant • Linear • Logarithmic • User specifies parameters • Located in special Dimemas “database” text files • Existing set covers IBM SP, SGI Origin 2000, and a few others Image courtesy [5] Image courtesy [5]
Dimemas: Accuracy, Other Features • Accuracy • On trivial applications (ping-pong), expected error with correct parameters is less than 12% [4] • Collective communication model for MPI verified in [6] on NAS benchmark suite • Most applications within 30% accuracy (IS.A.8 jumped to over 150% error) • Other features • Critical path selection • Starts at end, shows dependency path back to beginning of critical path • Sensitivity analysis (factorial analysis, vary parameters within 10%) • “What-if” analysis • Can adjust the time taken for each function call to see what would happen if you could write a faster version • Can also answer questions like “what would happen if we double our bandwidth?” • Simulation time: unknown (not reported in any documentation) • Only communication events are simulated • Therefore, assume simulation time is proportional to amount of communication • Also, uses simple (coarse bus-based) contention model, so simulation times should be reasonable
Bottleneck Identification Test Suite • Testing metric: what did trace visualization tell us (automatic instrumentation)? • Assumed a fully-functional installation of Paraver and Dimemas • CAMEL: PASSED • Identified large number of small messages at beginning of program execution • Assuming hardware counters worked, could also identify sequential parts of algorithm (sort on node 0, etc) • NAS LU (“W” workload): PASSED • Showed communication bottlenecks very clearly • Large(!) number of small messages • Illustrated time taken for repartitioning data • Shows sensitivity to latency for processors waiting on data from other processors • Could use Dimemas to pinpoint latency problem by testing on ideal network with no/little latency • Moderately-sized trace file (62MB), loaded slowly (> 60 seconds) in Paraver
Bottleneck Identification Test Suite (2) • Big message: PASSED • Traces illustrated large amount of time spent in send and receive • Diffuse procedure: PASSED • Traces illustrated a lot of synchronization with one process doing more work • Since no source code correlation, hard to tell why problem existed • Hot procedure: TOSS-UP • Assuming hardware counters work, would be easy to see extra CPU utilization • No source code correlation would make it difficult to pinpoint problem • Intensive server: PASSED • Traces showed that other nodes were waiting on node 0 • Ping pong: PASSED • Traces illustrated that the application was very latency-sensitive • Much time being spent on waiting for messages to arrive • Random barrier: PASSED • Traces showed that one was doing more work than the others • Small messages: PASSED • Traces illustrated a large number of messages being sent to node 0 • Also illustrated overhead of instrumentation for writing tracefile information • System time: FAILED • No way to tell system time vs. user time • Wrong way: PASSED • First receive took a long time for message to arrive in trace
General Comments • Very large learning curve • Complex software with lots of concepts • Concepts must be totally understood or • The software doesn’t make sense • The software seems like it has no functionality • Some “common” actions (e.g., view TLB cache misses) can be very difficult to do at first in Paraver • Stored window configuration helps with this • Older tools • Seem to have grown and gained features as the need for them arose • Lots of “cruft” and strange ways of presenting things • User interface clunky by today’s standards • User interface complicated by anyone’s standards!
General Comments (2) • Trace-driven simulation: useful? • Can be useful for performing “what-if” studies and sensitivity analyses • But, still limited on what you can explore without modifying the application • Can see what happens when a function can run twice as fast • Can’t see effect of different algorithms without rerunning application • Tools provide little guidance on what user should do next • Heavily reliant on skill of user to make efficient use of tools
Adding UPC/SHMEM Support • Commercial tool! • No way to explicitly add support into Dimemas or Paraver for UPC or SHMEM • However, tools written using modular design • Existing process and resource models can be used to model UPC and SHMEM applications • Paraver and Dimemas do not need to explicitly support UPC and SHMEM, just trace files • Assuming we have methods for instrumenting UPC and SHMEM code, all that is required is writing to the .prv file format • Documented! • Not sure about Dimemas’ trace file format…
Evaluation (1) • Available metrics: 5/5 • Can use PAPI and existing hardware counters • Paraver can combine trace information and give you just about any metric you can think of • Cost: 1/5 • For Paraver, Dimemas, and MPITrace, 1 seat: 2000 Euros (~$2,600) • Documentation quality: 1/5 • MPITrace: Inadequate documentation for Linux • Dimemas: Only tutorial available unless you want to read through conference papers and PhD theses • Paraver: User manual very thorough but technical and unclear • Many grammar errors impair reading! • “temporal files” -> temporary files • Many more… *Note: evaluated Linux version
Evaluation (2) • Extensibility: 0/5 • Commerical (no source), but • Can add new functions to semantic module for Paraver • Flexible design lets you support a wide variety of programming paradigms by using documented trace file format • Filtering and aggregation: 5/5 • Paraver has powerful filtering & aggregation capability • Filtering & aggregation only post-mortem, however • Hardware support: 3/5 • AlphaServer (Tru64), 64-bit Linux (Opteron, Itanium), IBM SP (AIX), IRIX, HP-UX • Most everything supported: Linux, AIX, IRIX, HP-UX • No Cray support • Heterogeneity support: 0/5 (not supported)
Evaluation (3) • Installation: 1/5 • Linux installation riddled with errors and problems • PAPI dependency for hardware counters complicates things (needs kernel patch) • Have had the software over 2 months, still not working correctly • According to our contact, this is not normal, but other tools nowhere near as hard to install • Interoperability: 1/5 • No export interoperability with other tools • Apparently tools exist to import SDDF and other formats (but I couldn’t find them) • Can import UTE traces • Learning curve: 1/5 • All graphical interfaces have unintuitive interfaces • Software is complex, and tutorials do not lessen learning curve very much • Manual overhead: 1/5 • MPITrace only records MPI events • Linux needs extra instructions in source code to get hardware counter information • Need to relink or recode to turn tracing on or off • Measurement accuracy: 4/5 • CAMEL overhead: ~8% • Tracing overhead not negligible, but within acceptable limits • Dimemas accuracy decent, but good enough to do what Dimemas is intended for
Evaluation (4) • Multiple executions: 1/5 • Paraver supports displaying multiple tracefiles at the same time • This lets you relate different runs (with different parameters) to each other relatively easily • Multiple analyses & views: 4/5 • Semantic modules provide a convenient (if awkward) way of displaying different types of data • Semantic modules also allow the displaying of the same type of data in different ways • Analysis modules show statistical summary information over time ranges • Performance bottleneck identification: 4.5/5 • No automatic bottleneck identification • All the information you need to identify a bottleneck should be available between Paraver and Dimemas • However, much manual effort is needed to determine where bottlenecks are • Also, no information is related back to the source code level • Profiling/tracing support: 2/5 • Only supports tracing • Trace files can be quite large and can take some time to open • Response time: 3/5 • No data at all until after run has completed and tracefile has been opened • Dimemas requires simulation to fully finish and Paraver to open up the generated tracefile before information is shown to user
Evaluation (5) • Searching: 3/5 • Search features provided by Dimemas • Software support: 3.5/5 • MPI profiling library allows linking against any existing libraries • OpenMP, OpenMP+MPI programs also supported via add-on instrumentation libraries • Source code correlation: 0/5 • Not supported directly, can use user events to identify program phases • System stability: 3/5 • MPITrace stable (had no problems other than installation) • Paraver crashed relatively often (>= 1 time per hour) • Dimemas stability not tested • Technical support: 3/5 • Responses from contact within 24-48 hours • Some problems not resolved quickly, though
References [1] “MPITrace tool version 1.1: User’s guide,” November 2000. http://www.cepba.upc.es/paraver/docs/MPItrace.pdf [2] “Paraver version 2.1: Tutorial,” November 2000. http://www.cepba.upc.es/paraver/docs/Paraver_TUTORIAL.pdf [3] “Paraver version 3.1: Reference manual (DRAFT),” October 2001. http://www.cepba.upc.es/paraver/docs/Paraver_MANUAL.pdf [4] “DiP: A Parallel Program Development Environment,” Jesús Labarta et al. In proc. of 2nd International EuroPar Conference (EuroPar 96), Lyon (France), August 1996.
References (2) [5] “Performance Prediction and Evaluation Tools,” Sergi Turell. PhD thesis, Universitat Politecnica de Catalunya, March 2003. [6] “Validation of Dimemas communication model for collective MPI communications,” S. Girona et al. In proc. of EuroPVM/MPI 2000, Balatonfüred, Lake Balaton, Hungary, September 2000. [7] “Introduction to Dimemas,” (tutorial). http://www.cepba.upc.edu/dimemas/docs/Dimemas_MANUAL.pdf