1 / 21

NCCS User Forum March 20, 2012

NCCS User Forum March 20, 2012. Breakout: Debugging and Performance Tuning on Discover Chongxun (Doris) Pan doris.pan@nasa.gov. Debugging on Discover . Compilers can do some debugging work very easily and effectively Array bound checking Uninitialized variables and arrays

lavina
Download Presentation

NCCS User Forum March 20, 2012

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NCCS User ForumMarch 20, 2012

  2. Breakout:Debugging and Performance Tuning on DiscoverChongxun (Doris) Pandoris.pan@nasa.gov

  3. Debugging on Discover • Compilers can do some debugging work very easily and effectively • Array bound checking • Uninitialized variables and arrays • Floating point exception catching • Tools are a great help for debugging • idb, gdb, DDD, Totalview

  4. Compiler options

  5. Debugging tools on Discover

  6. Before you start debugging… • What tool to use? • Sequential code? – ddd/gdb, idb • Threaded code? – idb, totalview • MPI or MPI/OpenMP hybrid code? – totalview • Got a core dump? • setenv DECFORT_DUMP_flag Y (allow core dump generated) • limit coredumpsize xxxx • -g has to be used for debugging. Using –g automatically adds –O0

  7. Debugging a threaded code with idb • setenv OMP_NUM_THREADS 4 • idb ./omp.exe

  8. TotalView • An interactive tool that lets you debug serial, multi-threaded and multi-processor programs with support for Fortran and C/C++

  9. TotalView • Details on how to configure TV for your run and use TV for serial or MPI/OpenMP jobs are in the NCCS Primer. • As announced on March 2012, the base TV solution now includes Replay Engine and CUDA debugging support for no additional fee. • Major features: • Parallel debugging: MPI, Pthreads, OpenMP, CUDA • Advanced memory debugging with MemoryScape • Reverse debugging with ReplayEngine • Add on ThreadSpotter for optimization

  10. Performance Analysis – Typical Bottlenecks • YourApplication • Synchronization, load balance, communication, memory usage, I/O usage • System Architecture • Memory hierarchy, network latency, processor architecture, I/O system setup Instrumentation • Software • Compileroptions, libraries, runtime environment, communication protocols… Measurement Analysis Optimization

  11. General Tuning Considerations • Communicationtuning • Balance communication • Communication/Computation overlapping • Run-time environment • Memory and CPU tuning • Cache misses (stride-1 access, padding) • TLB misses (large pages, loop blocking) • Page faults • Loop optimization • Auto-parallelization • Through library calls • Through compiler options • I/O tuning • TMPDIR. Run time environment • Reduce number of I/O requests • Use unformatted files • Use large record sizes

  12. Optimization with Compiler Options - Intel • Start with reasonably good set of options and build with more • Default is -O2. -O = -O2 • –O3: recommended over -O2 only for codes with loops that heavily use FP calculations and process large data sets • Codes with many small function calls: -ip –ipo • Codes with many FP operations: -fp-model fast=2 • Be careful of –fast. –fast=-O3 –ipo –no-prec-div –static • -xSSE4.2 generate optimized code specialized for the Intel processors • –openmp for OpenMP code • Fall back by –fp-model precise if correctness is an issue

  13. Optimization with Compiler Options (Cont’d) • Intel compiler provides different report to identify performance issues • -opt-report 3 -opt-report-phase=hlo(ipo, hop, ecg_swp…) • Vectorization report: -vec-report 3 • -par-report 3 • -openmp-report 2

  14. Run-time Environment Tuning • Select proper process layout. Default is “group round-robin”. Set I_MPI_PERHOST to override default layout: • I_MPI_PERHOST=n • I_MPI_PERHOST=allcores : maps to cores on a node • Or use mpirun/mpiexec options • mpirun –perhost n … • For MPI/OpenMP hybrid jobs: • Set OMP_NUM_THREADS and Use –perhost n • Set I_MPI_PIN_DOMAIN=auto(omp) • Set KMP_AFFINITY= compact

  15. Run-time Environment Tuning (Cont’d) • Use scalable DAPL progress for large jobs • Set I_MPI_DAPL_SCALABLE_PROGRESS variable to 1 to enable scalable algorithm for DAPL read progress engine. It offers performance advantage for large (>64) numbers of processes. • Use Intel MPI lightweight statistics • Set I_MPI_STATS set to non-zero integer value to gather MPI communication statistics • Manipulate with I_MPI_STATS_SCOPE to increase effectiveness of the analysis • I_MPI_STATS=3 • I_MPI_STATS_SCOPE=coll

  16. Run-time Environment Tuning (Cont’d) • Adjust eager/rendezvous protocol threshold • “Eager” sends data immediately regardless of receive request availability. • “Rendezvous” notices receiving site on data pending and transfers when receive request is set. • I_MPI_EAGER_THRESHOLD controls high level protocol switchover point. • Shorter messages are sent using the eager protocol; larger ones are sent by using the more memory efficient rendezvous protocol.

  17. Performance Analysis Tools (Check Primer for usage!) Interval Timers Elapsed time between two timer calls time (shell), mpi_wtime,system_clock (subroutines) Profiling Tools Periodically sample the program counter gprof • Intrusive. Longer run time • Only profiles CPU usage. Lack of I/O and communication info • “time” is simple to use Applications Event Tracers Complete sequences of events I_MPI_STATS, MpiP, TAU Event Counters Counts number of times hardware events occur TAU • Provide hardware performance counter info (e.g., cache misses), hardware metrics (e.g., loads per cache miss or # of page faults) • Detailed information about communication time • Good for analyze scaling

  18. Top3 ways to avoid performance problems - 1 Never, ever write your own code unless you absolutely have to • Libraries, libraries, libraries! • MKL library • GNU Scientific Library (GSL), under /user/local/other/SLES11/gsl • LAPACK, under /user/local/other/SLES11/lapack • PETSC, Portable, Extensible Tookit for Scientific Computation, under /user/local/other/SLES11/petsc • Spend time doing research, chances are you will find a package that suits your needs

  19. Top3 ways to avoid performance problems - 2 Let the compiler do the work • Modern compilers are much better now at optimizing most code and providing hints for your manual optimization • For example, “-xsse4.2 –O3 –no-prec-div” are recommended for latest Intel processors • Spend some time reading the “man” page and ask around

  20. Top3 ways to avoid performance problems - 3 Never use more data than absolutely necessary • Only use high precisions when necessary. • A reduction in the amount of data the CPU needs ALWAYS translates to an increase in performance • Always keep in mind that the memory subsystem and the network are the ultimate bottlenecks

  21. Top3 ways to avoid performance problems – 3.5 Finally, Make friends with Computer Scientists! Learning even a little about modern computer architectures will result in much better code.

More Related