220 likes | 326 Views
Debugging Applications. Mahidhar Tatineni Introduction to and Optimization for SDSC Systems Workshop October 17, 2006. Debugging Tools. Some of the debugging tools available on the SDSC machines dbx (on DataStar) pdbx (on DataStar) Totalview gdb (on TG IA-64 cluster)
E N D
Debugging Applications Mahidhar Tatineni Introduction to and Optimization for SDSC Systems Workshop October 17, 2006
Debugging Tools • Some of the debugging tools available on the SDSC machines • dbx (on DataStar) • pdbx (on DataStar) • Totalview • gdb (on TG IA-64 cluster) • Totalview is available on DataStar and the TG IA-64 cluster. • The following languages are supported by totalview • C • C++ • Fortran77 • Fortran90 • Assembler
Tips to minimize problems (avoid debugging)! • Use IMPLICIT NONE (in Fortran) • Comment your code – use spacing and indentation • Take care in choosing variable names. Try not to use characters which can be confused for others (l and 1 for example). • Use make if possible – This lets you define compilers, compiler flags, libraries consistently. • Use version control (particularly if there are multiple developers) software (RCS/CVS/SVN) • Can trace when a new bug is found. • Makes documentation changes easy.
Common MPI errors • FORTRAN vs C calls: The FORTRAN call to MPI routines usually have one more variable (ierr) than the C call. Users sometimes forget the ierr in the call causing errors. • Users dimensioning the status variable used in some MPI routines. This variable should be dimensioned as an integer array of length MPI_STATUS_SIZE: i.e. status(MPI_STATUS_SIZE) • Non-Blocking I/O: Some users mistakenly use MPI_WAIT with non-blocking I/O. The proper routine to call is MPIO_WAIT • Exceeding the limit on MPI Tags: If you are using automatic tag generation in your MPI program make sure that the tag is less than the tag limit set in the particular MPI implementation (2**32-1 on DataStar). • Stopping all processes: If a `stop' statement is called within a program segment restricted to one processor (for example within a `if(myid.eq.0) then' statement), only that particular process will stop. All other processes may simply `hang' at the next communication statement (with the stopped process) they come across.
Debug Procedures • Turn off all optimizations and turn on debugging flags (-O –g) • Check memory references and array bounds –qcheck (or –C ) • Check subroutine calls sequences and mismatched common blocks –qextcheck • Check for floating point exceptions –qflttrap • Check for initialized variables –qinitauto • For more info on these flags see IBM compiler manuals online.
Simplify your problem ! • How small can you make the problem while the bug still occurs? • Minimize the number of mpi tasks • Minimize the number of input parameters affecting the issue • If you suspect a MPI problem reduce the computation part as much as possible (helps focus on the communication issues) • It is ok to use PRINT statements to quickly localize the problem! If you are doing so in a MPI code make sure to label the I/O so that you know which process has the problem • bash: export MP_LABELIO=yes • csh: setenv MP_LABELIO yes
Sample problem to illustrate debugging process • The sample code (sample.f) is located in /gpfs/projects/workshop/debug • Copy the file into your directory. You can use the submit script from day 1 of the workshop (/gpfs/projects/workshop/running_jobs/LLscript_mpi_p655) • Compiling the code is simple mpxlf sample.f
Sample problem to illustrate debugging process • The sample code initializes variables on all processors, does a simple computation and then sends data from proc 0 to other processors. • The code has two bugs • an array bounds violation • an incorrect MPI call • We will use bounds checking compiler flags and Totalview to debug this code. • Interesting note: With a simple compile on DataStar the bounds problem does not cause the code to crash. Just because a code runs without error does not mean there are no bugs!
sample.f program C****************************************************************** C Processor 0 executes this code C****************************************************************** if (my_id.eq.0) then do i = 1, nproc - 1 C******* Send r(nmax) to all other processors ********************* C******* We made a mistake here. i in the mpi_send call below ***** C******* should be nmax ********************** write(*,*)"Proc ",my_id," sending", r(5), "to proc", i call mpi_send(r,i,mpi_real,i,i, 1 mpi_comm_world,ierr) enddo endif C****************************************************************** C Other processors (not zero) execute this code C****************************************************************** if(my_id.ne.0) then do i = 1, nmax r(i) = 0.0 enddo C****** Receive r from processor 0 call mpi_recv(r,nmax,mpi_real,0,my_id, 1 mpi_comm_world,status,ierr) write(*,*)"Proc ",my_id, "received value", r(5) endif C****************************************************************** call mpi_finalize(ierr) end C PROGRAM sample: Example to illustrate debugging process C Program sample implicit none include "mpif.h" integer status(MPI_STATUS_SIZE) integer my_id,nproc,ierr integer nmax, i, l real u(100), r(100) real delx delx = 0.1e0 nmax = 100 C****************************************************************** C Initialize MPI Library Routines C****************************************************************** call mpi_init(ierr) call mpi_comm_rank(mpi_comm_world,my_id,ierr) call mpi_comm_size(mpi_comm_world,nproc,ierr) C****************************************************************** C Initialize variables C****************************************************************** do i = 1 , nmax u(i) = delx * (nmax*my_id+i) enddo C****************************************************************** C Computations < We make one array bounds mistake here > C****************************************************************** if (my_id.eq.0) then u(1) = 5e0 r(1) = 1e0 r(nmax)= 1e0 do i = 2, nmax-1 r(i) = u(i-2)-2*u(i)+u(i+1) enddo endif
Sample problem to illustrate debugging process • We first compile the code with no debug/check flags mpxlf sample.f • One of the array values is printed on both the send and receive processors. • Sample output 0: Proc 0 sending -0.9999996424E-01 to proc 1 1: Proc 1 received value 0.0000000000E+00 • Something is wrong!
Sample problem to illustrate debugging process • Now compile the code with -qcheck –g flags mpxlf –qcheck –g sample.f • The code now dumps core and we see the following error ERROR: 0031-250 task 0: Trace/BPT trap • We have a array bounds problem on task 0. We can use dbx to locate it ds100 % dbx a.out coredir.0/core Type 'help' for help. warning: The core file is truncated. You may need to increasethe ulimit for file and coredump, or free some space on the filesystem. [using memory image in coredir.0/core] reading symbolic information ... Trace/BPT trap in sample at line 38 in file "sample.f" 38 r(i) = u(i-2)-2*u(i)+u(i+1) (dbx)
Sample problem to illustrate debugging process • We make the array bounds correction and compile the code with no debug/check flags mpxlf –qcheck –g sample-array-correct.f • Sample output 0: Proc 0 sending 0.2980232239E-07 to proc 1 1: Proc 1 received value 0.0000000000E+00 • We fixed the array bounds problem. But something is still wrong as the send/receive data does not match. Use Totalview to debug this further.
Using Totalview • On DataStar, on the dspoe interactive node • Compile and link your code with –g, turn off any optimizations • mpxlf_r –g program.f • mpxlf90_r –g program.f • mpcc_r –g program.c • Use tvpoe wrapper to run your job. For example for a 4 processor job: • tvpoe /gpfs/projects/workshop/debug/a.out –nodes 1 –tasks_per_nodes 4 (Always use the full path to the executable) • On IA64 • The process is a little more involved. See: http://www.sdsc.edu/user_services/ia64/runjobs.html#interactive
Using Totalview : Root window • Interface lists the processes • The status of the processes is also displayed (running, breakpoint, hold etc)
Using Totalview : Process Window • Shows process specific information • The variables are listed in the stack frame. • The source code is displayed here. Breakpoints can be placed from the window • “Dive” to check values
Using Totalview : Stepping through program • Go – Start/Resume execution • Halt – Stop execution • Kill – Terminate execution • Next – Run to next line or instruction (function stepped over) • Step – Run to next line or instruction (function stepped into, execution stops within function) • Out – Execute to completion of function and return to instruction after function call. • Run To – Allows you to click on any source line and run to that point.
Using Totalview : Stepping through program • Click on line of source code to set breakpoint • Click again to clear • To follow function calls – double click on function name • Can also set watchpoints and actionpoints.
Using Totalview: Sample program Saying yes gives
Sample problem to illustrate debugging process • We make the array bounds correction and the MPI call correction. Recompile: mpxlf –qcheck –g sample-correct.f • Sample output 0: Proc 0 sending 0.2980232239E-07 to proc 1 1: Proc 1 received value 0.2980232239E-07 • We now get the expected output!
References • LLNL Totalview tutorial http://www.llnl.gov/computing/tutorials/totalview/index.html • Etnus Tutorial and Totalview guide http://www.sdsc.edu/user_services/datastar/docs/totalview/wwhelp/wwhimpl/java/html/wwhelp.htm http://www.etnus.com/ • NERSC debugging tutorial http://www.nersc.gov/nusers/help/tutorials/debug/