380 likes | 392 Views
This article discusses the use of Dyninst for simulation tracking and code coverage in large scientific applications at Los Alamos National Laboratory. It covers the challenges faced and the benefits of using Dyninst for code coverage.
E N D
Using Dyninst for Simulation Tracking and Code Coverage on Large Scientific Applications David R. “Chip” Kent IV High Performance Computing Environments Group Los Alamos National Laboratory March 21, 2006
Outline • Overview of LANL and computing at LANL • Code coverage in scientific applications • Tracking scientific simulations • Dyninst challenges at LANL
Los Alamos National Laboratory • History • Birthplace of the atomic bomb • Current mission • Ensure the safety and reliability of US nuclear weapons • Prevent the spread of weapons of mass destruction • Protect the homeland from attack
Computing at LANL • No nuclear testing by the US since 1992 • Nuclear arsenal is now well over a decade old • Simulations and laboratory experiments are now used in place • of nuclear tests • Software correctness is extremely important • Simulation repeatability is extremely important • Simulation results must reproduce laboratory experiments and • old nuclear tests • Requires huge computing resources • Application performance is important • Requires research into computing areas ranging from hardware • to OS to physics simulations
Simulation software at LANL • Applications are developed over decades • O(1M) source lines for large applications • Large applications contain a mixture of programming languages • Fortran 77/9x • C/C++ • Preprocessed variants of Fortran • Compilation done with multiple compilers • pgf90, pgcc • gcc, g++ • Some teams provide single-physics libraries and other teams • merge the libraries into multi-physics simulations • Libraries are typically linked in statically (not always) • “100MB Binary of Death” -- Drew • Binaries are often at least 100MB • MPI is used for parallel simulations • Simulations can run for months
What is Javelina? • An advanced code coverage tool (what code got executed) • Can portably acquire data (any platform Dyninst supports) • x86/Linux • ia64/Linux* • x86_64/Linux* (any day now) • PowerPC/AIX 5.1* • MIPS/IRIX 6.5* • Alpha/Tru64 • x86/ Windows 2000/XP* • Operates on the binary with no source or build changes • Acquires data with minimal overhead • Dynamic instrumentation (Dyninst) is used • Coverage instrumentation can be removed once it is executed • Coverage data can be analyzed using arbitrarily complex logic • Can find code executed by end users but not executed by tests • Can be incorporated into python scripts *untested
Using Javelina: Linux, etc. • Build your program • make flag • Run the program • mpirun javelina flag include inputs • Perform logic on code coverage data • python mylogic.py • View the resulting data • javelinagui mydata.xml No Code/Build Modifications
Binary Analysis, Instrumentation, & Coverage Data Generation • Javelina analyzes and instruments binaries (no source or build modifications) • Binary instrumentation is used on Tru64 systems (Atom) • 2-3x uninstrumented runtime • A new binary is created which contains the coverage instrumentation • Dynamic instrumentation is used on Linux and other supported systems (Dyninst) • 1.06-3x uninstrumented runtime (working to improve this range) • Binary is instrumented when execution starts • Once a block is executed, its instrumentaiton will be removed • Coverage is measured at the instruction block level. • Instruction blocks are mapped to source lines using debugging information. • Supports C/C++, Fortran 77/90/95, and mixtures of these (Anything the compilers support) • Supports parallel applications • Working to reduce the Dyninst overhead so that end-user runs can regularly be analyzed
Debug Info Dynamic Instrumentation: Linux, etc. myexe source.{f,c,cpp} RAM {f90, cc, c++} javelina \ myexe Instrumentation inserted into & removed from instructions in memory Map between source lines and instrumentation
Logical Operations • AND(self, other) • Performs a logical AND operation on the data in two objects and returns the result. A line will be marked as executed if both objects mark the line as having been executed. • NOT(self) • Performs a logical NOT operation on the data in this object and returns the result. A line will be marked as executed if it was not executed and vice versa. • OR(self, other) • Performs a logical OR operation on the data in two objects and returns the result. A line will be marked as executed if either object marks the line as having been executed. • SUBTRACT(self, other) • Extracts the lines of this object which have been executed, marks these lines as executed if they are executed in the other object, and returns the result. This operator is useful in determining which lines executed by a user were tested.
Logical Operations: OR Executed by test 1 Executed by test 2 Executed by either OR Source.f Source.f Source.f
Source.f Source.f Source.f Logical Operations: SUBTRACT Executed by tests Executed by apps. Executed by apps. SUB- TRACT Highlighted lines used by applications, but not tested.
GUI: Large Application Files ranked by worst offenders. Used by applications, but not tested.
Multiphysics simulations are complex • 105+ lines of constantly changing code • Constantly changing libraries • Complex input files • Simulations and libraries read environment variables • Simulations use variable numbers of processors • HPC System changes • Compilers • Libraries • Operating system • Hardware (upgrades, repairs, new machines) • Etc.
FLAG CVS repository: B1 EOSPAC library: B2 Compiler: B3 Build script Old input .flg file: A1 Grid generator UNIX environment: B4 Note: FLAG may have to imbed “C1” in the file Log: B1 B2 B3 B4 Rtn: “C3” EOS Library: C2 Input .flg file: C1 Older grid: E1 Text editor FLAG executable: C3 FLAG startup subroutine Script: E2 Log: C1 C2 C3 C4 Rtn: “D1” Grid: C4 FLAG Ensight dump subroutine FLAG dump subroutine Log: D1 Rtn: “Gn” Log: E1,E2 Rtn: “C4” Log: D1 Rtn: “Fn” Script: F Ensight dump: F1 Restart dump: G1 Log: F Fn Rtn: “Hn” Ensight dump: F1 FLAG simulation: D1 Restart dump: G1 Restart dump: G1 Ensight dump: F1 Restart dump: Gn Ensight dump: Fn Ensight picture: H1 Ensight picture: H1 Ensight code Ensight picture: H1 Ensight picture: Hn Presentation Note: “Hn” is in the graphic itself Powerpoint Example Physics Package
Motivation • It is practically impossible for a human to precisely record everything that went into or came out of a simulation • E.g. shared libraries • Ability to reproduce simulations decreases with time since the simulation was run • Systems change • Humans didn’t precisely specify all aspects of a simulation • Etc. • Currently cannot specify all outputs impacted by a bug • Especially difficult if the bug was discovered long after the simulation • Currently, in many cases, cannot easily determine exactly how two simulations differ • These are critical V&V issues
Alexandria In A Sentence Alexandria tracks the history and relationships of files and processes to each other
F0 genmesh F2 F3 F1 myphysics F4 F5 ensight F6 Example Information Flow Graph File Mesh Generation Application Execution (e.g. build, simulation, etc.) Simulation Visualization
File Signatures As Fundamental Identification Why the use of Signature • It is a short-hand unique identifier for the file content. • It ensures the integrity of the file content through time. • The whole file does not have to be stored How signature is generated • Many algorithms - example uses 160 bit SHA-1 algorithm. • Takes as input a file of arbitrary length and produces as output a 160-bit "fingerprint" or "message digest" of the input. Example: Wrapper around mv command - generates signatures and tracks actions drkent% ./logging_mv file1 file2 IN: /Users/drkent/code/test/file1 41d7b77c8fe2634cfab042f54f5b6ae6c24d3a17 IN: /sw/bin/mv 389df9ea4ba8c266659165dd434d7ce33e97a936 ACTION: mv /Users/drkent/code/test/file1 /Users/drkent/code/test/file2 OUT: /Users/drkent/code/test/file2 41d7b77c8fe2634cfab042f54f5b6ae6c24d3a17 • Our signatures are really cryptographic hash functions • Checksums are simple examples of verifying file content
User Interface: HPC System Side • Data will be acquired by intercepting system calls (e.g. “open”) • int x = open(“/etc/hosts”, O_RDONLY); • File: /etc/hosts • I/O: Input (O_RDONLY) • Int x = open(“/tmp/scratch.file”, O_WRONLY, 00640); • File: /tmp/scratch.file • I/O: Output (O_WRONLY) • A few possible methods for intercepting system calls • Currently using Dyninst • Does not involve modifying user code • Use on standard systems: • alexandria myexe inputs • On lightweight-kernel systems may involve relinking
FC=f95 CC=cc all: myexe … FC=alexandria f95 CC=alexandria cc all: myexe … mpirun myexe input mpirun alexandria myexe input Why System Call Interception?: Minimal Effort Build untracked tracked Simulation Run untracked tracked
F0 genmesh F2 F3 F1 myphysics F4 F5 ensight F6 Alexandria Object Database • Storing everything necessary to exactly describe our simulations will generate a lot of data over time (think terabytes or more) • The data is highly interconnected • M inputs and N outputs for every process • each input/output can be an input/output for other processes • Data querying must be fast enough for a user to perform interactive analysis • Database must: • Be a robust commercial product • Data persists for decades • Need protection against corruption, etc. • Scale to very large datasets • Perform well with highly interconnected data • Require minimal administration costs • Minimize development time and effort • To meet these requirements, we are using the Objectivity/DB Object Database.
Flow of Information myoutput1 buggyfile.f myinput1 myexe myinput1 otherfile1.f f95 *.f -o myexe myexe otherfile2.f myinput2 myexe myinput2 f95 myoutput2 What Outputs Are Impacted By buggyfile.f? Inputs are to the left and outputs are to the right of a process (information flows left to right)
Flow of Information myexe output1 mesh output2 input myexe mesh input gnuplot makeplot.gnp output3 libc.so bigexplosion.gif makeplot.gnp liblapack.so How Did I Create bigexplosion.gif? Inputs are to the left and outputs are to the right of a process (information flows left to right)
User Interface: Analysis & Query • User interface to perform queries like: • Find the executable and all inputs used to generate a plot • Compare two simulations and identify differences • Locate a file with a given signature (e.g. in HPSS at location) • Determine the impact of problems in source files or libraries • Determine the genealogy of a given file • Find all simulations where a given input was used • Find all jobs run by a user during a time window • Etc.
Alexandria CLI Example: Job Setup Setup a new job Print the unique job id Print the job’s current state Run the calculation under the Alexandria interceptor
Alexandria CLI Example: Printing A Job Unique Job ID Process Timing Info Input/Output File Info
Alexandria + Code Usage/Coverage • Considering tracking code usage in Alexandria • Based on LANL code usage/coverage work (Javelina) • Can be done with little overhead using Dyninst • Alexandria would: • Record which functions executed during a simulation • Record which function a bug is in (in a particular source file) • Allow you to identify which simulations using a buggy source file executed the buggy function! • Allow you to identify which functions have not been executed over the last N years!
Dyninst Challenges at LANL: Part 1 • We can’t give out any of our important binaries which break Dyninst • Dyninst is very difficult to debug • Dyninst startup overhead • Improved by parsing only a subset of the binary • Can take >30min on a 100MB binary • Some binaries take longer to parse than others (PGI takes ~10x longer than GCC) • Still slow • Dyninst runtime overhead • Traps are used too often on x86 • Getting better • Performance has been improved by ~1000x for Javelina • “read” and “write” seem to run slow when instrumented at exit • MPI + Dyninst can lead to problems • “mpirun mydyninstprog myexe arg1 arg2 …” does not work with all MPI implementations • Seems to be a conflict with MPI startup and Dyninst (e.g. problems with signals) • Open-MPI seems to work fine (Yea!)
Dyninst Challenges at LANL: Part 2 • Dyninst is still brittle • A 100MB binary has stuff in it that Dyninst has never been tested against • Specific instruction sequences • Debug information • Robustness depends on the compiler/language • GCC compiled applications have less problems than PGI compiled applications • C/C++ applications have less problems than Fortran 9x applications • Robustness depends on the architecture/os • Often have to debug Dyninst on each platform you intend your application to run on • Supercomputers are “flavor of the week” • Systems have a lifetime of 3-5 years • Poorly supported platforms (Alpha/Tru64) are bought for performance (price) reasons • Our Linux clusters are significantly modified from standard distributions • Makes Dyninst support difficult • LANL, LLNL, and SNL are working to improve the situation
Final Note LANL is involved in the Open|SpeedShop effort, and Dyninst will soon be used to obtain performance data at LANL.
Abstract LANL’s use of Dyninst in Alexandria and Javelina is discussed. An overview of these projects and a list of the problems LANL has encountered with Dyninst are discussed.