1 / 38

Using Dyninst for Simulation Tracking and Code Coverage on Large Scientific Applications

Using Dyninst for Simulation Tracking and Code Coverage on Large Scientific Applications. David R. “Chip” Kent IV High Performance Computing Environments Group Los Alamos National Laboratory March 21, 2006. Outline. Overview of LANL and computing at LANL

hester
Download Presentation

Using Dyninst for Simulation Tracking and Code Coverage on Large Scientific Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Dyninst for Simulation Tracking and Code Coverage on Large Scientific Applications David R. “Chip” Kent IV High Performance Computing Environments Group Los Alamos National Laboratory March 21, 2006

  2. Outline • Overview of LANL and computing at LANL • Code coverage in scientific applications • Tracking scientific simulations • Dyninst challenges at LANL

  3. Los Alamos National Laboratory • History • Birthplace of the atomic bomb • Current mission • Ensure the safety and reliability of US nuclear weapons • Prevent the spread of weapons of mass destruction • Protect the homeland from attack

  4. Computing at LANL • No nuclear testing by the US since 1992 • Nuclear arsenal is now well over a decade old • Simulations and laboratory experiments are now used in place • of nuclear tests • Software correctness is extremely important • Simulation repeatability is extremely important • Simulation results must reproduce laboratory experiments and • old nuclear tests • Requires huge computing resources • Application performance is important • Requires research into computing areas ranging from hardware • to OS to physics simulations

  5. Simulation software at LANL • Applications are developed over decades • O(1M) source lines for large applications • Large applications contain a mixture of programming languages • Fortran 77/9x • C/C++ • Preprocessed variants of Fortran • Compilation done with multiple compilers • pgf90, pgcc • gcc, g++ • Some teams provide single-physics libraries and other teams • merge the libraries into multi-physics simulations • Libraries are typically linked in statically (not always) • “100MB Binary of Death” -- Drew • Binaries are often at least 100MB • MPI is used for parallel simulations • Simulations can run for months

  6. Code coverage in scientific applications

  7. What is Javelina? • An advanced code coverage tool (what code got executed) • Can portably acquire data (any platform Dyninst supports) • x86/Linux • ia64/Linux* • x86_64/Linux* (any day now) • PowerPC/AIX 5.1* • MIPS/IRIX 6.5* • Alpha/Tru64 • x86/ Windows 2000/XP* • Operates on the binary with no source or build changes • Acquires data with minimal overhead • Dynamic instrumentation (Dyninst) is used • Coverage instrumentation can be removed once it is executed • Coverage data can be analyzed using arbitrarily complex logic • Can find code executed by end users but not executed by tests • Can be incorporated into python scripts *untested

  8. Using Javelina: Linux, etc. • Build your program • make flag • Run the program • mpirun javelina flag include inputs • Perform logic on code coverage data • python mylogic.py • View the resulting data • javelinagui mydata.xml No Code/Build Modifications

  9. Binary Analysis, Instrumentation, & Coverage Data Generation • Javelina analyzes and instruments binaries (no source or build modifications) • Binary instrumentation is used on Tru64 systems (Atom) • 2-3x uninstrumented runtime • A new binary is created which contains the coverage instrumentation • Dynamic instrumentation is used on Linux and other supported systems (Dyninst) • 1.06-3x uninstrumented runtime (working to improve this range) • Binary is instrumented when execution starts • Once a block is executed, its instrumentaiton will be removed • Coverage is measured at the instruction block level. • Instruction blocks are mapped to source lines using debugging information. • Supports C/C++, Fortran 77/90/95, and mixtures of these (Anything the compilers support) • Supports parallel applications • Working to reduce the Dyninst overhead so that end-user runs can regularly be analyzed

  10. Debug Info Dynamic Instrumentation: Linux, etc. myexe source.{f,c,cpp} RAM {f90, cc, c++} javelina \ myexe Instrumentation inserted into & removed from instructions in memory Map between source lines and instrumentation

  11. Logical Operations • AND(self, other) • Performs a logical AND operation on the data in two objects and returns the result. A line will be marked as executed if both objects mark the line as having been executed. • NOT(self) • Performs a logical NOT operation on the data in this object and returns the result. A line will be marked as executed if it was not executed and vice versa. • OR(self, other) • Performs a logical OR operation on the data in two objects and returns the result. A line will be marked as executed if either object marks the line as having been executed. • SUBTRACT(self, other) • Extracts the lines of this object which have been executed, marks these lines as executed if they are executed in the other object, and returns the result. This operator is useful in determining which lines executed by a user were tested.

  12. Logical Operations: OR Executed by test 1 Executed by test 2 Executed by either OR Source.f Source.f Source.f

  13. Source.f Source.f Source.f Logical Operations: SUBTRACT Executed by tests Executed by apps. Executed by apps. SUB- TRACT Highlighted lines used by applications, but not tested.

  14. GUI: Large Application Files ranked by worst offenders. Used by applications, but not tested.

  15. Tracking Scientific Applications

  16. Multiphysics simulations are complex • 105+ lines of constantly changing code • Constantly changing libraries • Complex input files • Simulations and libraries read environment variables • Simulations use variable numbers of processors • HPC System changes • Compilers • Libraries • Operating system • Hardware (upgrades, repairs, new machines) • Etc.

  17. FLAG CVS repository: B1 EOSPAC library: B2 Compiler: B3 Build script Old input .flg file: A1 Grid generator UNIX environment: B4 Note: FLAG may have to imbed “C1” in the file Log: B1 B2 B3 B4 Rtn: “C3” EOS Library: C2 Input .flg file: C1 Older grid: E1 Text editor FLAG executable: C3 FLAG startup subroutine Script: E2 Log: C1 C2 C3 C4 Rtn: “D1” Grid: C4 FLAG Ensight dump subroutine FLAG dump subroutine Log: D1 Rtn: “Gn” Log: E1,E2 Rtn: “C4” Log: D1 Rtn: “Fn” Script: F Ensight dump: F1 Restart dump: G1 Log: F Fn Rtn: “Hn” Ensight dump: F1 FLAG simulation: D1 Restart dump: G1 Restart dump: G1 Ensight dump: F1 Restart dump: Gn Ensight dump: Fn Ensight picture: H1 Ensight picture: H1 Ensight code Ensight picture: H1 Ensight picture: Hn Presentation Note: “Hn” is in the graphic itself Powerpoint Example Physics Package

  18. Motivation • It is practically impossible for a human to precisely record everything that went into or came out of a simulation • E.g. shared libraries • Ability to reproduce simulations decreases with time since the simulation was run • Systems change • Humans didn’t precisely specify all aspects of a simulation • Etc. • Currently cannot specify all outputs impacted by a bug • Especially difficult if the bug was discovered long after the simulation • Currently, in many cases, cannot easily determine exactly how two simulations differ • These are critical V&V issues

  19. Alexandria In A Sentence Alexandria tracks the history and relationships of files and processes to each other

  20. F0 genmesh F2 F3 F1 myphysics F4 F5 ensight F6 Example Information Flow Graph File Mesh Generation Application Execution (e.g. build, simulation, etc.) Simulation Visualization

  21. File Signatures As Fundamental Identification Why the use of Signature • It is a short-hand unique identifier for the file content. • It ensures the integrity of the file content through time. • The whole file does not have to be stored How signature is generated • Many algorithms - example uses 160 bit SHA-1 algorithm. • Takes as input a file of arbitrary length and produces as output a 160-bit "fingerprint" or "message digest" of the input. Example: Wrapper around mv command - generates signatures and tracks actions drkent% ./logging_mv file1 file2 IN: /Users/drkent/code/test/file1 41d7b77c8fe2634cfab042f54f5b6ae6c24d3a17 IN: /sw/bin/mv 389df9ea4ba8c266659165dd434d7ce33e97a936 ACTION: mv /Users/drkent/code/test/file1 /Users/drkent/code/test/file2 OUT: /Users/drkent/code/test/file2 41d7b77c8fe2634cfab042f54f5b6ae6c24d3a17 • Our signatures are really cryptographic hash functions • Checksums are simple examples of verifying file content

  22. User Interface: HPC System Side • Data will be acquired by intercepting system calls (e.g. “open”) • int x = open(“/etc/hosts”, O_RDONLY); • File: /etc/hosts • I/O: Input (O_RDONLY) • Int x = open(“/tmp/scratch.file”, O_WRONLY, 00640); • File: /tmp/scratch.file • I/O: Output (O_WRONLY) • A few possible methods for intercepting system calls • Currently using Dyninst • Does not involve modifying user code • Use on standard systems: • alexandria myexe inputs • On lightweight-kernel systems may involve relinking

  23. FC=f95 CC=cc all: myexe … FC=alexandria f95 CC=alexandria cc all: myexe … mpirun myexe input mpirun alexandria myexe input Why System Call Interception?: Minimal Effort Build untracked tracked Simulation Run untracked tracked

  24. F0 genmesh F2 F3 F1 myphysics F4 F5 ensight F6 Alexandria Object Database • Storing everything necessary to exactly describe our simulations will generate a lot of data over time (think terabytes or more) • The data is highly interconnected • M inputs and N outputs for every process • each input/output can be an input/output for other processes • Data querying must be fast enough for a user to perform interactive analysis • Database must: • Be a robust commercial product • Data persists for decades • Need protection against corruption, etc. • Scale to very large datasets • Perform well with highly interconnected data • Require minimal administration costs • Minimize development time and effort • To meet these requirements, we are using the Objectivity/DB Object Database.

  25. Flow of Information myoutput1 buggyfile.f myinput1 myexe myinput1 otherfile1.f f95 *.f -o myexe myexe otherfile2.f myinput2 myexe myinput2 f95 myoutput2 What Outputs Are Impacted By buggyfile.f? Inputs are to the left and outputs are to the right of a process (information flows left to right)

  26. Flow of Information myexe output1 mesh output2 input myexe mesh input gnuplot makeplot.gnp output3 libc.so bigexplosion.gif makeplot.gnp liblapack.so How Did I Create bigexplosion.gif? Inputs are to the left and outputs are to the right of a process (information flows left to right)

  27. User Interface: Analysis & Query • User interface to perform queries like: • Find the executable and all inputs used to generate a plot • Compare two simulations and identify differences • Locate a file with a given signature (e.g. in HPSS at location) • Determine the impact of problems in source files or libraries • Determine the genealogy of a given file • Find all simulations where a given input was used • Find all jobs run by a user during a time window • Etc.

  28. Alexandria CLI Example: Job Setup Setup a new job Print the unique job id Print the job’s current state Run the calculation under the Alexandria interceptor

  29. Alexandria CLI Example: Printing A Job Unique Job ID Process Timing Info Input/Output File Info

  30. Alexandria CGI Example: Where was this file used/created?

  31. Alexandria CGI Example: Where was this file used/created?

  32. Alexandria CGI Example: Where was this file used/created?

  33. Alexandria + Code Usage/Coverage • Considering tracking code usage in Alexandria • Based on LANL code usage/coverage work (Javelina) • Can be done with little overhead using Dyninst • Alexandria would: • Record which functions executed during a simulation • Record which function a bug is in (in a particular source file) • Allow you to identify which simulations using a buggy source file executed the buggy function! • Allow you to identify which functions have not been executed over the last N years!

  34. Dyninst Challenges at LANL: Part 1 • We can’t give out any of our important binaries which break Dyninst • Dyninst is very difficult to debug • Dyninst startup overhead • Improved by parsing only a subset of the binary • Can take >30min on a 100MB binary • Some binaries take longer to parse than others (PGI takes ~10x longer than GCC) • Still slow • Dyninst runtime overhead • Traps are used too often on x86 • Getting better • Performance has been improved by ~1000x for Javelina • “read” and “write” seem to run slow when instrumented at exit • MPI + Dyninst can lead to problems • “mpirun mydyninstprog myexe arg1 arg2 …” does not work with all MPI implementations • Seems to be a conflict with MPI startup and Dyninst (e.g. problems with signals) • Open-MPI seems to work fine (Yea!)

  35. Dyninst Challenges at LANL: Part 2 • Dyninst is still brittle • A 100MB binary has stuff in it that Dyninst has never been tested against • Specific instruction sequences • Debug information • Robustness depends on the compiler/language • GCC compiled applications have less problems than PGI compiled applications • C/C++ applications have less problems than Fortran 9x applications • Robustness depends on the architecture/os • Often have to debug Dyninst on each platform you intend your application to run on • Supercomputers are “flavor of the week” • Systems have a lifetime of 3-5 years • Poorly supported platforms (Alpha/Tru64) are bought for performance (price) reasons • Our Linux clusters are significantly modified from standard distributions • Makes Dyninst support difficult • LANL, LLNL, and SNL are working to improve the situation

  36. Final Note LANL is involved in the Open|SpeedShop effort, and Dyninst will soon be used to obtain performance data at LANL.

  37. Abstract LANL’s use of Dyninst in Alexandria and Javelina is discussed. An overview of these projects and a list of the problems LANL has encountered with Dyninst are discussed.

More Related