740 likes | 898 Views
Tool Visualizations, Metrics, and Profiled Entities Overview . Adam Leko HCS Research Laboratory University of Florida. Summary. Give characteristics of existing tools to aid our design discussions Metrics (what is recorded, any hardware counters, etc) Profiled entities Visualizations
E N D
Tool Visualizations, Metrics, and Profiled Entities Overview Adam Leko HCS Research Laboratory University of Florida
Summary • Give characteristics of existing tools to aid our design discussions • Metrics (what is recorded, any hardware counters, etc) • Profiled entities • Visualizations • Most information & some slides taken from tool evaluations • Tools overviewed • TAU • Paradyn • MPE/Jumpshot • Dimemas/Paraver/MPITrace • mpiP • Dynaprof • KOJAK • Intel Cluster Tools (old Vampir/VampirTrace) • Pablo • MPICL/Paragraph
TAU • Metrics recorded • Two modes: profile, trace • Profile mode • Inclusive/exclusive time spent in functions • Hardware counter information • PAPI/PCL: L1/2/3 cache reads/writes/misses, TLB misses, cycles, integer/floating point/load/store/stalls executed, wall clock time, virtual time • Other OS timers (gettimeofday, getrusage) • MPI message size sent • Trace mode • Same as profile (minus hardware counters?) • Message send time, message receive time, message size, message sender/recipient(?) • Profiled entities • Functions (automatic & dynamic), loops + regions (manual instrumentation)
TAU • Visualizations • Profile mode • Text-based: pprof (example next slide), shows a summary of profile information • Graphical: racy (old), jracy a.k.a. paraprof • Trace mode • No built-in visualizations • Can export to CUBE (see KOJAK), Jumpshot (see MPE), and Vampir format (see Intel Cluster Tools)
TAU – pprof output Reading Profile files in profile.* NODE 0;CONTEXT 0;THREAD 0: --------------------------------------------------------------------------------------- %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call --------------------------------------------------------------------------------------- 100.0 0.207 20,011 1 2 20011689 main() (calls f1, f5) 75.0 1,001 15,009 1 2 15009904 f1() (sleeps 1 sec, calls f2, f4) 75.0 1,001 15,009 1 2 15009904 main() (calls f1, f5) => f1() (sleeps 1 sec, calls f2, f4) 50.0 4,003 10,007 2 2 5003524 f2() (sleeps 2 sec, calls f3) 45.0 4,001 9,005 1 1 9005230 f1() (sleeps 1 sec, calls f2, f4) => f4() (sleeps 4 sec, calls f2) 45.0 4,001 9,005 1 1 9005230 f4() (sleeps 4 sec, calls f2) 30.0 6,003 6,003 2 0 3001710 f2() (sleeps 2 sec, calls f3) => f3() (sleeps 3 sec) 30.0 6,003 6,003 2 0 3001710 f3() (sleeps 3 sec) 25.0 2,001 5,003 1 1 5003546 f4() (sleeps 4 sec, calls f2) => f2() (sleeps 2 sec, calls f3) 25.0 2,001 5,003 1 1 5003502 f1() (sleeps 1 sec, calls f2, f4) => f2() (sleeps 2 sec, calls f3) 25.0 5,001 5,001 1 0 5001578 f5() (sleeps 5 sec) 25.0 5,001 5,001 1 0 5001578 main() (calls f1, f5) => f5() (sleeps 5 sec)
Paradyn • Metrics recorded • Number of CPUs, number of active threads, CPU and inclusive CPU time • Function calls to and by • Synchronization (# operations, wait time, inclusive wait time) • Overall communication (# messages, bytes sent and received), collective communication (# messages, bytes sent and received), point-to-point communication (# messages, bytes sent and received) • I/O (# operations, wait time, inclusive wait time, total bytes) • All metrics recorded as “time histograms” (fixed-size data structure) • Profiled entities • Functions only (but includes functions linked to in existing libraries)
Paradyn • Visualizations • Time histograms • Tables • Barcharts • “Terrains” (3-D histograms)
MPE/Jumpshot • Metrics collected • MPI message send time, receive time, size, message sender/recipient • User-defined event entry & exit • Profiled entities • All MPI functions • Functions or regions via manual instrumentation and custom events • Visualization • Jumpshot: timeline view (space-time diagram overlaid on Gantt chart), histogram
Dimemas/Paraver/MPITrace • Counter 2 • Cycles • Graduated instructions, loads, stores, store conditionals, floating point instructions • TLB misses • Mispredicted branches • Primary/secondary data cache miss rates • Data mispredictions from scache way prediction table(?) • External intervention/invalidation (cache coherency?) • Store/prefetch exclusive to clean/shared block • Metrics recorded (MPITrace) • All MPI functions • Hardware counters (2 from the following two lists, uses PAPI) • Counter 1 • Cycles • Issued instructions, loads, stores, store conditionals • Failed store conditionals • Decoded branches • Quadwords written back from scache(?) • Correctible scache data array errors(?) • Primary/secondary I-cache misses • Instructions mispredicted from scache way prediction table(?) • External interventions (cache coherency?) • External invalidations (cache coherency?) • Graduated instructions
Dimemas/Paraver/MPITrace • Profiled entities (MPITrace) • All MPI functions (message start time, message end time, message size, message recipient/sender) • User regions/functions via manual instrumentation • Visualization • Timeline display (like Jumpshot) • Shows Gantt chart and messages • Also can overlay hardware counter information • Clicking on timeline brings up a text listing of events near where you clicked • 1D/2D analysis modules
mpiP • Metrics collected • Start time, end time, message size for each MPI call • Profiled entities • MPI function calls + PMPI wrapper • Visualization • Text-based output, with graphical browser that displays statistics in-line with source • Displayed information: • Overall time (%) for each MPI node • Top 20 callsites for time (MPI%, App%, variance) • Top 20 callsites for message size (MPI%, App%, variance) • Min/max/average/MPI%/App% time spent at each call site • Min/max/average/sum of message sizes at each call site • App time = wall clock time between MPI_Init and MPI_Finalize • MPI time = all time consumed by MPI functions • App% = % of metric in relation to overall app time • MPI% = % of metric in relation to overall MPI time
Dynaprof • Metrics collected • Wall clock time or PAPI metric for each profiled entity • Collects inclusive, exclusive, and 1-level call tree % information • Profiled entities • Functions (dynamic instrumentation) • Visualizations • Simple text-based • Simple GUI (shows same info as text-based)
Dynaprof – output [leko@eta-1 dynaprof]$ wallclockrpt lu-1.wallclock.16143 Exclusive Profile. Name Percent Total Calls ------------- ------- ----- ------- TOTAL 100 1.436e+11 1 unknown 100 1.436e+11 1 main 3.837e-06 5511 1 Inclusive Profile. Name Percent Total SubCalls ------------- ------- ----- ------- TOTAL 100 1.436e+11 0 main 100 1.436e+11 5 1-Level Inclusive Call Tree. Parent/-Child Percent Total Calls ------------- ------- ----- -------- TOTAL 100 1.436e+11 1 main 100 1.436e+11 1 - f_setarg.0 1.414e-05 2.03e+04 1 - f_setsig.1 1.324e-05 1.902e+04 1 - f_init.2 2.569e-05 3.691e+04 1 - atexit.3 7.042e-06 1.012e+04 1 - MAIN__.4 0 0 1
KOJAK • Metrics collected • MPI: message start time, receive time, size, message sender/recipient • Manual instrumentation: start and stop times • 1 PAPI metric / run (only FLOPS and L1 data misses visualized) • Profiled entities • MPI calls (MPI wrapper library) • Function calls (automatic instrumentation, only available on a few platforms) • Regions and function calls via manual instrumentation • Visualizations • Can export traces to Vampir trace format (see ICT) • Shows profile and analyzed data via CUBE (described on next few slides)
CUBE overview: simple description • Uses a 3-pane approach to display information • Metric pane • Module/calltree pane • Right-clicking brings up source code location • Location pane (system tree) • Each item is displayed along with a color to indicate severity of condition • Severity can be expressed 4 ways • Absolute (time) • Percentage • Relative percentage (changes module & location pane) • Comparative percentage (differences between executions) • Despite documentation, interface is actually quite intuitive
CUBE example: CAMEL After opening the .cube file (default metric shown = absolute time take in seconds)
CUBE example: CAMEL After expanding all 3 root nodes; color shown indicates metric “severity” (amount of time)
CUBE example: CAMEL Selecting “Execution” shows execution time, broken down into part of code & machine
CUBE example: CAMEL Selecting mainloop adjusts system tree to only show time spent in mainloop per each processor
CUBE example: CAMEL Expanded nodes show exclusive metric (only time spent by node)
CUBE example: CAMEL Collapsed nodes show inclusive metric (time spent by node and all children nodes)
CUBE example: CAMEL Metric pane also shows detected bottlenecks; here, shows “Late Sender” in MPI_Recv within main spread across all nodes
Intel Cluster Tools (ICT) • Metrics collected • MPI functions: start time, end time, message size, message sender/recipient • User-defined events: counter, start & end times • Code location for source-code correlation • Instrumented entities • MPI functions via wrapper library • User functions via binary instrumentation(?) • User functions & regions via manual instrumentation • Visualizations • Different types: timelines, statistics & counter info • Described in next slides
ICT visualizations – timelines & summaries • Summary Chart Display • Allows the user to see how much work is spent in MPI calls • Timeline Display • Zoomable, scrollable timeline representation of program execution Fig. 1 Summary Chart Fig. 2 Timeline Display
ICT visualizations – histogram & counters • Summary Timeline • Timeline/histogram representation showing the number of processes in each activity per time bin • Counter Timeline • Value over time representation (behavior depends on counter definition in trace) Fig. 3 Summary TImeline Fig 4. Counter Timeline
ICT visualizations – message stats & process profiles • Message Statistics Display • Message data to/from each process (count,length, rate, duration) • Process Profile Display • Per process data regarding activities Fig. 5 Message Statistics Fig. 6 Process Profile Display
ICT visualizations – general stats & call tree • Statistics Display • Various statistics regarding activities in histogram, table, or text format • Call Tree Display Fig. 7 Statistics Display Fig. 8 Call Tree Display
ICT visualizations – source & activity chart • Source View • Source code correlation with events in Timeline • Activity Chart • Per Process histograms of Application and MPI activity Fig 9. Source View Fig. 10 Activity Chart
ICT visualizations – process timeline & activity chart • Process Timeline • Activity timeline and counter timeline for a single process • Process Activity Chart • Same type of informartion as Global Summary Chart • Process Call Tree • Same type of information as Global Call Tree Figure 11. Process Timeline Figure 12. Process Activity Chart & Call Tree
Pablo • Metrics collected • Time inclusive/exclusive of a function • Hardware counters via PAPI • Summary metrics computed from timing info • Min/max/avg/stdev/count • Profiled entities • Functions, function calls, and outer loops • All selected via GUI • Visualizations • Displays derived summary metrics color-coded and inline with source code • Shown on next slide
MPICL/Paragraph • Metrics collected • MPI functions: start time, end time, message size, message sender/recipient • Manual instrumentation: start time, end time, “work” done (up to user to pass this in) • Profiled entities • MPI function calls via PMPI interface • User functions/regions via manual instrumentation • Visualizations • Many, separated into 4 categories: utilization, communication, task, “other” • Described in following slides
ParaGraph visualizations • Utilization visualizations • Display rough estimate of processor utilization • Utilization broken down into 3 states: • Idle – When program is blocked waiting for a communication operation (or it has stopped execution) • Overhead – When a program is performing communication but is not blocked (time spent within MPI library) • Busy – if execution part of program other than communication • “Busy” doesn’t necessarily mean useful work is being done since it assumes (not communication) := busy • Communication visualizations • Display different aspects of communication • Frequency, volume, overall pattern, etc. • “Distance” computed by setting topology in options menu • Task visualizations • Display information about when processors start & stop tasks • Requires manually instrumented code to identify when processors start/stop tasks • Other visualizations • Miscellaneous things
Utilization visualizations – utilization count • Displays # of processors in each state at a given moment in time • Busy shown on bottom, overhead in middle, idle on top
Utilization visualizations – Gantt chart • Displays utilization state of each processor as a function of time
Utilization visualizations – Kiviat diagram • Shows our friend, the Kiviat diagram • Each spoke is a single processor • Dark green shows moving average, light green shows current high watermark • Timing parameters for each can be adjusted • Metric shown can be “busy” or “busy + overhead”
Utilization visualizations – streak • Shows “streak” of state • Similar to winning/losing streaks of baseball teams • Win = overhead or busy • Loss = idle • Not sure how useful this is