210 likes | 357 Views
NCCS User Forum March 20, 2012. Breakout: Debugging and Performance Tuning on Discover Chongxun (Doris) Pan doris.pan@nasa.gov. Debugging on Discover . Compilers can do some debugging work very easily and effectively Array bound checking Uninitialized variables and arrays
E N D
Breakout:Debugging and Performance Tuning on DiscoverChongxun (Doris) Pandoris.pan@nasa.gov
Debugging on Discover • Compilers can do some debugging work very easily and effectively • Array bound checking • Uninitialized variables and arrays • Floating point exception catching • Tools are a great help for debugging • idb, gdb, DDD, Totalview
Before you start debugging… • What tool to use? • Sequential code? – ddd/gdb, idb • Threaded code? – idb, totalview • MPI or MPI/OpenMP hybrid code? – totalview • Got a core dump? • setenv DECFORT_DUMP_flag Y (allow core dump generated) • limit coredumpsize xxxx • -g has to be used for debugging. Using –g automatically adds –O0
Debugging a threaded code with idb • setenv OMP_NUM_THREADS 4 • idb ./omp.exe
TotalView • An interactive tool that lets you debug serial, multi-threaded and multi-processor programs with support for Fortran and C/C++
TotalView • Details on how to configure TV for your run and use TV for serial or MPI/OpenMP jobs are in the NCCS Primer. • As announced on March 2012, the base TV solution now includes Replay Engine and CUDA debugging support for no additional fee. • Major features: • Parallel debugging: MPI, Pthreads, OpenMP, CUDA • Advanced memory debugging with MemoryScape • Reverse debugging with ReplayEngine • Add on ThreadSpotter for optimization
Performance Analysis – Typical Bottlenecks • YourApplication • Synchronization, load balance, communication, memory usage, I/O usage • System Architecture • Memory hierarchy, network latency, processor architecture, I/O system setup Instrumentation • Software • Compileroptions, libraries, runtime environment, communication protocols… Measurement Analysis Optimization
General Tuning Considerations • Communicationtuning • Balance communication • Communication/Computation overlapping • Run-time environment • Memory and CPU tuning • Cache misses (stride-1 access, padding) • TLB misses (large pages, loop blocking) • Page faults • Loop optimization • Auto-parallelization • Through library calls • Through compiler options • I/O tuning • TMPDIR. Run time environment • Reduce number of I/O requests • Use unformatted files • Use large record sizes
Optimization with Compiler Options - Intel • Start with reasonably good set of options and build with more • Default is -O2. -O = -O2 • –O3: recommended over -O2 only for codes with loops that heavily use FP calculations and process large data sets • Codes with many small function calls: -ip –ipo • Codes with many FP operations: -fp-model fast=2 • Be careful of –fast. –fast=-O3 –ipo –no-prec-div –static • -xSSE4.2 generate optimized code specialized for the Intel processors • –openmp for OpenMP code • Fall back by –fp-model precise if correctness is an issue
Optimization with Compiler Options (Cont’d) • Intel compiler provides different report to identify performance issues • -opt-report 3 -opt-report-phase=hlo(ipo, hop, ecg_swp…) • Vectorization report: -vec-report 3 • -par-report 3 • -openmp-report 2
Run-time Environment Tuning • Select proper process layout. Default is “group round-robin”. Set I_MPI_PERHOST to override default layout: • I_MPI_PERHOST=n • I_MPI_PERHOST=allcores : maps to cores on a node • Or use mpirun/mpiexec options • mpirun –perhost n … • For MPI/OpenMP hybrid jobs: • Set OMP_NUM_THREADS and Use –perhost n • Set I_MPI_PIN_DOMAIN=auto(omp) • Set KMP_AFFINITY= compact
Run-time Environment Tuning (Cont’d) • Use scalable DAPL progress for large jobs • Set I_MPI_DAPL_SCALABLE_PROGRESS variable to 1 to enable scalable algorithm for DAPL read progress engine. It offers performance advantage for large (>64) numbers of processes. • Use Intel MPI lightweight statistics • Set I_MPI_STATS set to non-zero integer value to gather MPI communication statistics • Manipulate with I_MPI_STATS_SCOPE to increase effectiveness of the analysis • I_MPI_STATS=3 • I_MPI_STATS_SCOPE=coll
Run-time Environment Tuning (Cont’d) • Adjust eager/rendezvous protocol threshold • “Eager” sends data immediately regardless of receive request availability. • “Rendezvous” notices receiving site on data pending and transfers when receive request is set. • I_MPI_EAGER_THRESHOLD controls high level protocol switchover point. • Shorter messages are sent using the eager protocol; larger ones are sent by using the more memory efficient rendezvous protocol.
Performance Analysis Tools (Check Primer for usage!) Interval Timers Elapsed time between two timer calls time (shell), mpi_wtime,system_clock (subroutines) Profiling Tools Periodically sample the program counter gprof • Intrusive. Longer run time • Only profiles CPU usage. Lack of I/O and communication info • “time” is simple to use Applications Event Tracers Complete sequences of events I_MPI_STATS, MpiP, TAU Event Counters Counts number of times hardware events occur TAU • Provide hardware performance counter info (e.g., cache misses), hardware metrics (e.g., loads per cache miss or # of page faults) • Detailed information about communication time • Good for analyze scaling
Top3 ways to avoid performance problems - 1 Never, ever write your own code unless you absolutely have to • Libraries, libraries, libraries! • MKL library • GNU Scientific Library (GSL), under /user/local/other/SLES11/gsl • LAPACK, under /user/local/other/SLES11/lapack • PETSC, Portable, Extensible Tookit for Scientific Computation, under /user/local/other/SLES11/petsc • Spend time doing research, chances are you will find a package that suits your needs
Top3 ways to avoid performance problems - 2 Let the compiler do the work • Modern compilers are much better now at optimizing most code and providing hints for your manual optimization • For example, “-xsse4.2 –O3 –no-prec-div” are recommended for latest Intel processors • Spend some time reading the “man” page and ask around
Top3 ways to avoid performance problems - 3 Never use more data than absolutely necessary • Only use high precisions when necessary. • A reduction in the amount of data the CPU needs ALWAYS translates to an increase in performance • Always keep in mind that the memory subsystem and the network are the ultimate bottlenecks
Top3 ways to avoid performance problems – 3.5 Finally, Make friends with Computer Scientists! Learning even a little about modern computer architectures will result in much better code.