310 likes | 444 Views
KOJAK Evaluation Report . Adam Leko, Hans Sherburne UPC Group HCS Research Laboratory University of Florida. Color encoding key: Blue: Information Red: Negative note Green: Positive note. Basic Information. Name: KOJAK Developer: Forschungszentrum Jülich, ICL @ UTK
E N D
KOJAK Evaluation Report Adam Leko, Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green: Positive note
Basic Information • Name: KOJAK • Developer: Forschungszentrum Jülich, ICL @ UTK • Current versions: • Stable: KOJAK-v2.0 • Development: KOJAK v2.1b1 • Website:http://icl.cs.utk.edu/kojak/http://www.fz-juelich.de/zam/kojak/ • Contacts: • Felix Wolf (fwolf@cs.utk.edu) • Bernd Mohr (b.mohr@fz-juelich.de) • Generic email: kojak@cs.utk.edu
KOJAK Overview • A collection of tools for automated performance analysis • Instrumentation utilities: DUCTAPE, OPARI • Trace file format/library: EPILOG • High-level trace API: EARL • Pattern matching/performance knowledge representation: EXPERT • Visualization tool: CUBE • Also can export to Vampir’s VT3 format • Acronym soup • KOJAK: Kit for Objective Judgement and Knowledge-based detection of performance bottlenecks • DUCTAPE: C++ program Database Utilities and Conversion Tools APplication Environment • EPILOG: Event Processing, Investigating and LOGging • EARL: Event Analysis and Recognition Library • EXPERT: Extensible Performance Tool • OPARI: OpenMP Pragma And Region Instrumentor • CUBE: CUBE Uniform Behavioral Encoding
Instrumentation Overview • Binary instrumentation (elg_dpcl) • Uses IBM’s DPCL library • Only available on AIX • OpenMP instrumentation (opari) • Accomplished via • Source-to-source transforms • Linking against POMP library • Only instruments OpenMP regions and constructs • Still need to manually instrument functions or other code regions • Automatic instrumentation (kinst) • Only available on a few platforms • Linux clusters, PGI compilers • Hitachi SR-8000 • Solaris, Sun Fortran90 compiler • NEC SX • Based on undocumented compiler features • Manual instrumentation • MPI profiling interface • Just need to link against the elg.mpi library • Only instruments MPI calls • EPILOG API • Place macros at start and end of every function • ELG_USER_START(“function-name”); • ELG_USER_END(“function-name”); • Compile with -DEPILOG Note: website mentions instrumentation via DUCTAPE and TAU, but these have not been integrated into the available versions of KOJAK as of 3/05
Instrumentation Overhead: CAMEL • Performed manual instrumentation of CAMEL • Attempt to get a rough estimate of overhead • Instrumented all functions • Ran CAMEL with 1/64th problem size • Execution was slowed down by an order of magnitude • Trace file size: 919M • CAMEL contains several hundred thousand function calls in a given execution • Instrumented two functions within an inner loop • Execution time increased by a factor of 2.2 • Trace file size: 153MB • Instrumented outside large loops • Execution time increased by a few percent • Trace file only 9.1KB • Clearly the naïve approach of “instrument all functions” is too expensive for KOJAK • Behavior is common for any tracing approach, though
Instrumentation Overhead: Test Suite • Instrumentation performed using MPI profiling interface • Overall, instrumentation overhead very low (one of the lowest seen thus far) • Instrumentation with PAPI enabled (FLOPS, L1 data miss rate) has no measurable extra overhead • Ping-pong has highest reproducible overhead at 10% (worst case for MPI) • Note: Benchmarks marked with * have high variability in runtimes
EPILOG Overview • Binary trace file format used by KOJAK • Supports OpenMP, MPI, or hybrid applications • Fairly compact • NAS LU, W workload, 8 processors: 23MB • Roughly on par with size of SLOG-2 files • Documented • Complete spec available on website • Has an existing API (open source) for reading, writing EPILOG files • Can also add information from hardware counters • PAPI supported • Can be converted to VAMPIR format using elg2vtf • Requires vptmerge • Does not work with updated Intel version of Cluster Tools (vptmerge not included)
EARL Overview • Provides high-level access to trace events • Random access to trace events • Also provides links between related events • API documented, spec available on website • Existing implementation also available (open source) for C++ and Python • Machine model: clusters of SMPs
EXPERT Overview • Performs automatic analysis of EPILOG traces • Main feature of KOJAK suite • Matches collection of “performance problems” (bottleneck patterns) against trace file • Bottlenecks are specified using EARL • User can add in their own patterns using Python or C++ • New C++ patterns have to be compiled back into EXPERT • Detection method • Pattern objects register for certain types of trace events • Event trace reader performs callbacks when requested events are encountered • Pattern objects receive callback & update state information • If pattern object matches state to its performance problem, a bottleneck is reported • Output from EXPERT is a .cube file which can be visualized using the CUBE tool
EXPERT Bottleneck List Grey boxes (leaf nodes) are bottlenecks that can be currently detected
EXPERT Analysis Times • EXPERT scalability • Sequential tool; analysis time scales proportionally to trace file size • Balancing act • Try to detect too many/too complex bottlenecks: analysis time becomes intractable • Try to totally minimize analysis time: miss useful bottlenecks • Current analysis speed tractable for trace files up to a few hundred MB • Plans to parallelize the analysis phase, but no implementation available yet
CUBE Overview: Description • Generic visualization tool • Used by KOJAK to visualize EXPERT’s analyses • X-Windows application (uses wxWindows toolkit) • Buzzword description • Displays multidimensional data in a scalable fashion • Reduces all data to hierarchical display of 3 dimensions (“cube”) • Data is aggregated across dimensions as needed • Dimension space • Set of metrics (M) • Set of call paths (C) • Set of locations (L) • Each data point (m, c, l) is mapped onto a number representing • actual metric m (also referred to as severity) • while program was execution call path c • at location l • Browsers for each dimension are linked together • User views one dimension with respect to another • Uses documented XML format to represent data
CUBE Overview: Simple Description • Uses a 3-pane approach to display information • Metric pane • Module/calltree pane • Right-clicking brings up source code location • Location pane (system tree) • Each item is displayed along with a color to indicate severity of condition • Severity can be expressed 4 ways • Absolute (time) • Percentage • Relative percentage (changes module & location pane) • Comparative percentage (differences between executions) • Despite documentation, interface is actually quite intuitive
CUBE Example: CAMEL After opening the .cube file (default metric shown = absolute time take in seconds)
CUBE Example: CAMEL After expanding all 3 root nodes; color shown indicates metric “severity” (amount of time)
CUBE Example: CAMEL Selecting “Execution” shows execution time, broken down into part of code & machine
CUBE Example: CAMEL Selecting mainloop adjusts system tree to only show time spent in mainloop per each processor
CUBE Example: CAMEL Expanded nodes show exclusive metric (only time spent by node)
CUBE Example: CAMEL Collapsed nodes show inclusive metric (time spent by node and all children nodes)
CUBE Example: CAMEL Metric pane also shows detected bottlenecks; here, shows “Late Sender” in MPI_Recv within main spread across all nodes
Bottleneck Detection: Test Suite [1] • Testing metric: what did CUBE tell us after processing trace file with EXPERT? • Excluding what can be accomplished with VAMPIR export • Programs correctness not affected by instrumentation • CAMEL: PASSED • Not many problems detected • “Late sender” attributed to a few places in code, due to CAMEL’s unique communication pattern • LU: TOSS-UP • No “too many small messages” bottleneck pattern • Late sender, messages in wrong order correctly identified though • PPerf: Big messages: PASSED • Showed most time being spent in MPI_Send/MPI_Recv • Pperf: Diffuse procedure: FAILED • Just showed lots of time being spent in barriers • Pperf: Hot procedure: FAILED • Time incorrectly attributed to MPI_Init
Bottleneck Detection: Test Suite [2] • PPerf: Intensive server: PASSED • Late sender bottleneck detected for overloaded server • PPerf: Ping-pong: PASSED • Late sender bottleneck detected • Indicates dependence of messages on each other • PPerf: Random barrier: PASSED • Detected “wait at barrier” bottleneck • PPerf: Small messages: TOSS-UP • Illustrated large time spent in point-to-point MPI routines • Bottleneck incorrectly attributed to late receiver • PPerf: System time: FAILED • Incorrectly attributed to MPI_Init time • PPerf: Wrong order: PASSED • Correctly identified messages received in wrong order
KOJAK General Comments [1] • Good things • Portable, automatic performance analysis • CUBE GUI uses novel way to present metrics • Source code correlation! • Bottlenecks are shown according to which parts of code they occur in and which machines see them • Data presentation in a form that makes it easier for user to not become overwhelmed • Libraries are well-separated into APIs and documented • We have the opportunity to re-use their existing code! • Automatic instrumentation is available, although only for a limited number of platforms • Installation relatively easy • Code compiled pretty cleanly • Can still export data into VAMPIR format for more thorough user analysis • Tool very stable (no crashes, only a few bugs)
KOJAK General Comments [2] • Things that could use improvement • Only a few PAPI metrics shown in GUI • FLOPS & L1 data miss rates • No PAPI metrics used for bottleneck detection! • Could write new pattern in EARL though • When using PAPI, trace file creation fails • Complains about out-of-sync files • Some time at beginning of application gets incorrectly recorded under MPI_Init • CUBE becomes does not correlate with source code unless automatic/binary instrumentation is used • Call tree in second pane turns into flat structure when only MPI profiling library interface is used • Impossible to see specific communication patterns in CUBE • Exporting to VAMPIR trace format possible, but relies on hard-to-find tool vptmerge • Effectiveness of automatic analysis on a day-to-day basis still unknown • However, very powerful tool when combined with VAMPIR
KOJAK: Adding UPC & SHMEM • SHMEM • Not much extra work needed • Need to create a SHMEM profiling interface that writes to EPILOG • Add a few extra SHMEM-specific bottleneck patterns • UPC • Could potentially be difficult • If we solve the UPC instrumentation problem, then we just need to use EPILOG instead of (other trace format) • Could use manual instrumentation for everything but implicit communication • Add (many?) UPC-specific bottleneck patterns • In either case, if manual (or source-source) instrumentation used, not much additional code has to be written • Also, since formats defined (and existing API implementations are readily available), it should be relatively easy to export to EPILOG traces
Evaluation (1) • Available metrics: 3/5 • Supports recording execution time (broken down into call trees) • Supports recording communication patterns • Supports a few PAPI metrics • Cost: 5/5 • Free! • Documentation quality: 4/5 • Excellent “USAGE” file describes how to use application • CUBE documentation overly technical in some areas • Extendibility: 4/5 • Can easily add new benchmark patterns • Open source, uses documented APIs • Filtering and aggregation: 2/5 • Simple filtering & aggregation functionality in CUBE GUI • Not supported at the tracefile level, though • Cannot restrict analysis to only certain parts of trace
Evaluation (2) • Hardware support: 5/5 • Many platforms supported • Instrumentation, Measurement, and Analysis • Linux IA-32, IA-64, and Opteron clusters with GNU, PGI, or Intel compilers; IBM Power3 / Power4 based clusters; SGI Mips based clusters (O2k, O3k); SGI IA-64 based clusters (Altix); SUN Sparc based clusters; DEC/HP Alpha based clusters; Generic UNIX workstation (clusters) • Instrumentation and Measurement only • Cray T3E and X1; IBM BG/L; NEC SX; Hitachi SR-8000 • Heterogeneity support: 0/5(not supported) • Installation: 4.5/5 • Comes in source form, but very easily to compile & installation (no problems) • Interoperability: 1.5/5 • Can only export to VAMPIR trace files • Learning curve: 4/5 • MPI trace library easy to use, EXPERT very easy to use • CUBE has a small learning curve but is easy to use
Evaluation (3) • Manual overhead: 4/5 • Automatic instrumentation for a few platforms/compilers • MPI and OpenMP instrumentation support • Measurement accuracy: 3.5/5 • Binary instrumentation more accurate but only available on AIX • Tracing lots of function calls has high overhead • Very low overhead for instrumenting MPI calls only • Multiple analyses: 3/5 • CUBE GUI can show % differences between different runs of a program • Multiple executions: 1/5 • Can perform multiple executions manually • Multiple views: 1/5 • CUBE only has one way of looking at things • Can export to VAMPIR, however
Evaluation (4) • Performance bottleneck identification: 4/5 • Bottleneck rules work pretty well (could use more though) • Bottlenecks related to system and source code • Profiling/tracing support: 3/5 • Only performs tracing • Profiling data shown in CUBE extracted from trace data • Response time: 2/5 • Have to wait until after program finishes executing and EXPERT is done analyzing before you get any feedback • Software support: 4/5 • Supports OpenMP, MPI • Can support linking against any library • Will not instrument linked code, though (maybe AIX supports?) • Source code correlation: 5/5 • Well-supported in CUBE, down to the source code line level
Evaluation (5) • System stability: 4.5/5 • No program crashes encountered • A few minor bugs discovered • Technical support: ?/5 • Unknown at this time • Have not personally contacted developers, but tool is still under development