1 / 31

Paradyn Evaluation Report

Paradyn Evaluation Report . Adam Leko, UPC Group HCS Research Laboratory University of Florida. Color encoding key: Blue: Information Red: Negative note Green: Positive note. Basic Information. Name: Paradyn Developer: University of Wisconsin-Madison Current versions:

eve-whitney
Download Presentation

Paradyn Evaluation Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Paradyn Evaluation Report Adam Leko, UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green: Positive note

  2. Basic Information • Name: Paradyn • Developer: University of Wisconsin-Madison • Current versions: • Paradyn: 4.1.1 • DynInst: 4.1.1 • KernInst: 2.0.1 • Website:http://www.paradyn.org/index.html • Contact: Matthew Legendre

  3. Bandwidth Time What is Paradyn? • A performance analysis tool (PAT) for sequential and parallel programs • Uses dynamic binary instrumentation to record program metrics (may use unmodified executables) • Visualizations • Metric-focus grids (right, top) • Rows: performance metrics • Columns: resources to collect a performance from • Metrics can be reported as current value, statistics (min/max/average), or time-histograms (right, bottom) • Performance consultant • Automated search to identify bottlenecks in program • Uses W3 model – where, when, why • A generic project that includes tools related to performance analysis • Paradyn PAT • DynInst: a dynamic binary instrumentation library • KernInst: a dynamic instrumentation library for instrumenting running operating system (OS) kernels • Not very useful for a PAT unless PAT needs to be applied to an OS kernel • MRNet: a high-performance communication library supporting master-slave software architectures • “Multicast/Reduction Network” • Not immediately useful for the design phase of our PAT Example metric-focus grid Visual representation of example time-histogram

  4. General Paradyn Architecture • Four main components • User interface (green) • Visualization (red) • Performance consultant (purple) • Instrumentation (blue) • Thick circles represent running processes, dotted circles represent threads within a single process • Will present each using “bottom up” approach Image courtesy [1]

  5. Part 1: Instrumentation

  6. Instrumentation Overview • Paradyn terminology: points, primitives, and predicates • Points – places where instrumentation code can be placed • Supported points: procedure entry, procedure exit, individual call statements • Primitives – simple operations that change the value of a counter or timer • Predicates – boolean expressions that guard execution of primitives • Using predicates and primitives, points in a program may be instrumented • Predicates and primitives are controlled via PCL and MDL (discussed later) • Paradyn uses dynamic instrumentation to record performance data at points • Paradyn attaches its performance daemons to a running process or starts a new process using an unmodified binary • Instrumentation workflow: • User or performance consultant requests a metric-focus from Paradyn • Data manager in Paradyn uses Remote Procedure Call (RPC) to communicate with remote processes asking them to start instrumentation for a specific metric focus • RPC allows heterogeneity in runtime environment • Metric manager receives instrumentation request and turns that into an abstract, machine-independent request • Instrumentation manager inserts code into executable corresponding to machine-independent abstraction • Executable is stopped • Code is inserted • Executable resumes running • Instrumented data is periodically sampled by the metric manager and sent back to the data manager

  7. Binary Instrumentation • Binary instrumentation accomplished by inserting base trampolines for each instrumentation point • Base trampolines handle storing current state of program so instrumentations do not affect execution • In some architectures, only registers that are used are saved (if can be inferred from machine calling convention) • Mini trampolines are the machine-specific realizations of predicates and primitives • One base trampoline may handle many mini-trampolines, but a base trampoline is needed for every instrumentation point • Basic flow of trampoline shown in right, top • Mini trampoline assembly code for SPARC machine shown in right, bottom • Binary instrumentation difficult! • Have to deal with • Compiler optimizations • Branch delay slots • Different sizes of instructions for x86 (may increase the number of instructions that have to be relocated) • Creating and inserting mini trampolines somewhere in program (at end?) • Limited-range jumps may complicate this • Luckily, DynInst library available separately for use in other applications • Paradyn’s instrumentation cost <= 80 clock cycles per base trampoline [2] Trampoline flow (courtesy [2]) Mini trampoline (courtesy [2])

  8. PCL & MDL • Paradyn provides a TCL-like language to configure and add metrics without recompiling or modifying Paradyn • Stored in paradyn.rc, user may use their own version (.paradynrc) • PCL – Paradyn Control Language • Controls available daemons (MPI, sequential, etc) • Can add processes automatically at startup (which programs to record performance data for) • Can customize Paradyn options (colors and other “tunable constants”) • Can add visualizations (described later) • Can add metrics via MDL • MDL – Metric Description Language • Sublanguage of PCL • Describe metrics • Types provided: counters and timers • Can specify constraints for each metric that limit how they can be used/what they can be used with • May be exclusive or inclusive (include a point’s calls to other procedures or just include a point’s cost by itself, excluding time spent in other procedures called from this point) • Language not Turing complete: no looping construct provided • Example counter shown right Counter MDL code (courtesy [2])

  9. Paradyn Overhead • Instrumentation very low for most test programs for 5 metrics on all functions • Communication metrics • Number of messages sent • Number of point-to-point messages • Number of collective messages • I/O bytes • CPU metric • CPU utilization • Instrumenting CAMEL’s main routine had 800% overhead • Instrumenting a function also instruments its call sites • main routine had many small function calls • Performance consultant (discussed later) added a large amount of overhead to most programs during searches

  10. Part 2: Performance Consultant

  11. Performance Consultant (PC) Overview • PC performs an automated search on the program • Identifies bottlenecks in programs • Uses W3 search (described next slide) • Search is guided, based on program’s call graph [3] • Iterative method that tests hypothesis against sections of code • Starts with main and examines subroutine calls • “Drills down” and examines subroutines based on frequency of they are called • Call graph search method was successfully applied to several large programs containing thousands of lines of code • However, method can miss functions called by more than one parent function whose individual parent functions do not appear as “problem functions” • Call graph automatically generated from executable’s symbol table • Example PC run shown at top right, corresponding call graph for application shown at bottom right Example PC run Call graph used in PC run

  12. W3 Model: Why, Where, When • Paradyn’s goal: “… to assist the user in locating performance problems in a program; a performance problem is a part of the program that contributes a significant amount of time to its execution” • W3 model attempts to answer: • Why is the program performing poorly? • Where is the program performing poorly? • When is the application performing poorly? • Performance consultant shows why and where axes graphically to the user (see right) • Yellow line: why axis refinement • Purple line: where axis refinement W3 refinements (blue=true, pink=false)

  13. W3 Model: Why, Where, When (2) • Why axis • Paradyn applies hypotheses to code • ExcessiveSyncWaitingTime? • CPUBound? • ExcessiveIOBlockingTime? • TooManySmallIOOps? • Each hypothesis is represented by a tunable predicate • E.g., CPUBound := CPUTime > 20% • After a hypothesis is determined to be false, no more searching is done for that type of bottleneck • Where axis • Once a hypothesis is tested to be true (why refinement), • An automated search is started to determine where the problem lies • Each subroutine is examined to see if the hypothesis is also true (where refinement) • The program’s call graph is used to guide search of subroutines • Where axis is iteratively searched until the deepest node of the call graph is reached that the hypothesis tests true for • When axis • Indirectly supported through the use of “phases” • Phases are defined by the user • Phases represent specific time intervals in a program’s execution • When axis refinement relies on the user’s interaction • While axis refinements are made, performance consultant automatically requests instrumentation • Frequency of instrumentation and a limit on number of concurrent instrumentations can be set by the user W3 refinements (blue=true, pink=false)

  14. Bottleneck Identification Test Suite • Testing metric: what did Performance Consultant tell us? • Programs correctness not affected by instrumentation  • CAMEL: PASSED • Identified program as CPU-bound • However, Performance Consultant added much overhead and resulted in a misdetection on the where axis • LU: TOSS-UP • Identified as excessive sync time bottleneck • Not further resolved to too many small messages, only was able to track down to the ssor.f source code file • Big messages: PASSED • Identified excessive sync time @ Grecv_message function • Diffuse procedure: FAILED • Identified excessive sync time at MPI_Barrier, but did not localize to bottleneck procedure • Missed picking up on diffuse CPU-bound behavior

  15. Bottleneck Identification Test Suite (2) • Hot procedure: PASSED • Correctly identified CPU-bound bottleneck procedure • Due to excessive instrumentation, Performance Consultant overhead slightly misdiagnosed where location • Attributed to all nodes except one when all nodes exhibit the problem • Intensive server: TOSS-UP • Identified excessive sync waiting time on Grecv_message from main • However, due to lack of trace view, it would be difficult/impossible to see all threads waiting on the master thread • Ping-pong: PASSED • Identified excessive sync waiting time on Grecv_message • Random barrier: TOSS-UP • Identified excessive sync waiting time on barrier in main • No trace view means it would be nearly impossible to see randomness of which node was (inconsistently) taking more time

  16. Bottleneck Identification Test Suite (3) • Small messages: TOSS-UP • Identified excessive sync waiting time on Gsendmessage in main • Did not localize to a particular node, though • System time: FAILED • Performance Consultant failed to instrument code • Possibly due to OS being too busy with user code to handle dynamic binary instrumentation • Wrong order: TOSS-UP • Identified excessive sync waiting time on messages on main • Would best be seen by a trace, but classification here was different than other communication-based bottlenecks

  17. Part 3: Visualizations

  18. Terrain visualization Histogram visualization Visualizations Overview • Paradyn supports several types of built-in visualizations (visis) for metrics • Bar charts • Histograms (right, top) • Table (text representation, can show current/max/min values for each metric) • “Terrain” – 3D histogram (see right, bottom) • Axes are time, metric, location • Visualizations may handle multiple metrics at once • Visualizations are implemented as separate processes • Callback functions are used to provide continuous data to visualization programs • Users may add custom visualizations • Paradyn provides a simple library and RPC interface • Configured to show up in interface via PCL files (paradyn.rc, .paradynrc)

  19. Terrain visualization Histogram visualization Visualizations Overview (2) • When a user creates a visualization, • Paradyn automatically instruments running program accordingly • Visualization continues until user closes it • After closing, Paradyn automatically removes instrumented code • Histograms are stored using a fixed-size data structure • Metric values sorted into “buckets” • When buckets fill, data is reorganized and number of buckets doubles (though keeping structure of a fixed size) • As execution time increases, sampling rate decreases logarithmically to keep data sizes small

  20. Part 4: User Interface

  21. User Interface Overview • Current interface uses Tcl/Tk for graphics (right) • Multiple windows for everything • Makes for a cluttered interface • Tcl/Tk provides a useable but crude-looking interface Example Paradyn session

  22. Paradyn Bugs • Can’t detect end of MPI program run (Paradyn will crash unless you start over from scratch) • Program crashes almost every time shortly after MPI program completes • Buggy startup code (starting a new process twice gives errors; program must be restarted) • Doesn’t work with code compiled with profiling information (gcc –g), see error dialog to right • “Can’t read .shstrtabsection” • Pausing execution and adding a visualization crashes Paradyn (program continues execution while Paradyn thinks it is still paused) • Often leaves zombie children processes, even on error-free runs • Paradyn left unkillable processes hanging around after crashes on etas • killall -9 could not get rid of them

  23. Paradyn Complaints • Slow startup (~5 seconds for each MPI node on etas) • Performance consultant takes a while to identify bottlenecks • Although, search is entirely automated • However, only seems to pick up on code that exhibits obvious bottlenecks • Cluttered and confusing interface • Why is there separate windows for the callgraph and where axes? • Many bugs, although most are handled by displaying a nice dialog box • However, some bugs necessitate a Paradyn restart • Function list on “where axis” dialog box contains a huge number of functions for MPI programs (~100+, includes MPI functions in list which makes it hard to single out your application’s functions) • Phase function difficult to use • Should be easier to define phases, or base phases on subroutine entry/exit points • No “stop process” button!

  24. Paradyn General Comments • W3 search hypotheses and threshold functions seem overly simplistic (-) • Doesn’t seem to work well on code that alternates quickly between communication and computation • Small amount of hypotheses, perhaps due to large cost of evaluating each one? • Cutoff values for hypotheses seems arbitrary • Are tunable, but is a fixed cutoff appropriate? • Performance consultant was not able to detect/classify a sleep(1) statement inserted for a single MPI process • Should have labeled the receiving node as ExcessiveSyncWaitingTime, but did not label the process at all • Quick changes between computation and communication may have fooled it, perhaps adjusting thresholds would have helped; • How would you know which thresholds to change? • How useful is the information provided by the W3 search? • Seems to only be able to pick out obvious things • Says what is the problem, but does it offer insight on how to fix it? • Overhead introduced by dynamic instrumentation seems very tiny (+++) • < 1% for 16 metrics being collected on a 16-node MPI application • However, overhead can increase dramatically for functions that call other (lightweight) functions many times over

  25. Paradyn General Comments (2) • Platform support (--) • Paradyn: No support for 64-bit applications or Cray platforms! • DynInst: No support for 64-bit Opteron or Cray platforms! (Support for Itanium is provided though) • Dependence on DynInst combined with difficulty in porting DynInst to new platforms a potential problem • Adding and removing instrumentations is fast and works well (+++) • DynInst seems to be much more stable than Paradyn, minus the parsing bugs for executables compiled with gcc -g • Adding instrumentations to code usually takes one second or less • Helps reduce the measure stage of the “measure-modify” approach • However, time needed to start programs significantly increased, especially with many processes (-) • However, extra delays incurred during instrumentation affect the ability to gather traces of program execution • Is dynamic instrumentation necessary? • Things are greatly simplified when dynamic binary-level instrumentation is not implemented • Is it worth the added cost and complexity? • Fairly complex piece of software, takes a while to learn how to use effectively, even with tutorials (-) • This, along with its complicated installation procedure, may discourage its use • Though documentation is pretty good • PCL and MDL allow configuration and addition of user-defined metrics (+++)

  26. Feasibility for UPC & SHMEM • In order to add support for UPC & SHMEM: • Need to create Paradyn daemons for UPC and SHMEM codes • This may be very difficult, since Paradyn daemons need to handle instrumentation • For UPC, how should communication be handled? • Instrument runtime libraries? • Which runtimes should be supported? • Is it feasible to support all runtimes of interest? • What about proprietary UPC languages and runtimes? • This could be an insurmountable problem • Paradyn has been around for a long time • Is there a lot of crufty code in the source code that is left alone because no one understands it? • Is the current user interface (Tcl/Tk) acceptable? • Also: • Would need to port DynInst to targeted architectures • This may be problematic for architectures with no publicly available information on executable file formats/etc • Should include performance metrics as recorded by PAPI • MDL should help, but • Will MDL present too large of an overhead for the level of granularity needed by PAPI? • Is a lack of tracing ability acceptable? What if more details are needed?

  27. Evaluation (1) • Available metrics: 5/5 • Many built in • Number of CPUs, number of active threads, CPU and inclusive CPU time • Function calls to and by • Synchronization (# operations, wait time, inclusive wait time) • Overall communication (# messages, bytes sent and received), collective communication (# messages, bytes sent and received), point-to-point communication (# messages, bytes sent and received) • I/O (# operations, wait time, inclusive wait time, total bytes) • Can add more using MDL and PCL • Cost: free 5/5 • Documentation quality: 4/5 • Tutorial for using sequential and MPI programs with Paradyn • Well-written manuals • Programming guides included for DynInst, visualization library, and MDL • Extensibility: 2/5 • Creating a SHMEM daemon wouldn’t take a lot of work • Creating a UPC daemon will be problematic for proprietary runtimes • Depends on DynInst, and porting DynInst to a new platform may take an immense amount of work • Filtering and aggregation: 3/5 • Only supports rudimentary aggregation on metrics (min, max, averages) • Hardware support: 2/5 • No support for Opteron, Itanium, or Cray architectures in Paradyn • DynInst supports Itanium • Porting DynInst (which Paradyn depends on) would be very difficult • Heterogeneity support: 5/5 • Authors claim Paradyn supports heterogeneity due to use of RPC interfaces • Not directly supported by user interface for MPI programs, so cannot test

  28. Evaluation (2) • Installation: 2/5 • Binaries are easy to find • http://www.paradyn.org/html/paradyn4.1-software.html • Compiling from source extremely difficult and error-prone • Relies on specific versions of libdwarf (Linux only) and Tcl/Tk (all), which complicates the installation if your distribution or OS uses incompatible versions • Installation time: approximately 2-3 hours for a shared environment • Need to create scripts that set about 6 environment variables before program will run correctly • Interoperability: 1/5 • Paradyn can save output in simple, documented format, but usefulness of data unknown • No detailed, trace-like information can be provided as it is not collected • Dynamic instrumentation interferes with tracing due to timing perturbations • Learning curve: 2/5 • Difficult, complex program with many parts • Took approximately 1 week to get comfortable with the program • Manuals and tutorials are very helpful • Manual overhead: 5/5 • No modification needed for executables • Measurement accuracy: 4/5 • Dynamic instrumentation incurs very low overhead (~80 cycles for trampoline overhead) • Time-histogram loses accuracy as time goes on due to fixed size • Multiple execution: 0/5 (not supported)

  29. Evaluation (3) • Multiple views and analyses: 5/5 • Several visualization types are supported for all metrics • Histograms, bar charts, 3D histograms, tables, summary tables (min/max/average) • Users can add new visualization programs as desired using Paradyn’s RPC interface and visualization library • Call graphs and where axis give user a hierarchical view of their code • Default visualizations support zooming and panning • Performance bottleneck identification: 2.5/5 • Performance consultant can help identify “obvious” bottlenecks automatically • Due to limited search space (only 4 types of bottlenecks), bottleneck identification is limited • Tweaking thresholds used for search may be necessary to identify bottlenecks • Profiling/tracing support: 2/5 • Uses a “hybrid” approach of sampling and tracing • Detailed tracing information cannot be logged for later analysis • Cannot create a trace file of when MPI functions were called (e.g., what you’d need for Jumpshot) • Paradyn daemons report values back to main Paradyn process at time intervals • Main Paradyn process only has an approximation of “real” values for metrics at any given time • However, values recorded by main Paradyn process can be exported (uses a simple documented format) • Response time: 5/5 • Dynamic instrumentation allows arbitrarily turning on and off instrumentation without needed to restart or recompile application • Only takes a few seconds to start collecting metrics once they are requested

  30. Evaluation (4) • Software support: 3/5 • Supported languages: threaded C code, MPI code • Supported software platforms: Linux kernel version 2.4 & 2.6, AIX, Tru64, Windows 2000/XP, and IRIX • Source code correlation: 2/5 • Can correlate back to the function name level (reads executable symbol tables) • No line numbers or statement information available • Searching: 0/5 (not supported) • System stability: 2/5 • Many bugs in Linux version, but bugs seem to be limited to Paradyn GUI; DynInst seems very stable • Technical support: 4/5 • Helpful responses from our contact within 24 hours

  31. References [1] B. P. Miller et. al. “The Paradyn parallel performance measurement tool,” IEEE Computer, November 1995, pp. 37-46. [2] J.K. Hollingsworth et. al. “MDL: A Language and Compiler for Dynamic Program Instrumentation,” IEEE PACT, 1997, pg. 201. [3] H. Cain, B.P. Miller, and B.J.N. Wylie. “A Callgraph-Based Search Strategy for Automated Performance Diagnosis,” European Conference on Parallel Computing (Euro-Par), Munich, Germany, August 2000, pg. 108.

More Related