NCCS User Forum March 20, 2012

NCCS User ForumMarch 20, 2012

Breakout:Debugging and Performance Tuning on DiscoverChongxun (Doris) Pandoris.pan@nasa.gov

Debugging on Discover • Compilers can do some debugging work very easily and effectively • Array bound checking • Uninitialized variables and arrays • Floating point exception catching • Tools are a great help for debugging • idb, gdb, DDD, Totalview

Compiler options

Debugging tools on Discover

Before you start debugging… • What tool to use? • Sequential code? – ddd/gdb, idb • Threaded code? – idb, totalview • MPI or MPI/OpenMP hybrid code? – totalview • Got a core dump? • setenv DECFORT_DUMP_flag Y (allow core dump generated) • limit coredumpsize xxxx • -g has to be used for debugging. Using –g automatically adds –O0

Debugging a threaded code with idb • setenv OMP_NUM_THREADS 4 • idb ./omp.exe

TotalView • An interactive tool that lets you debug serial, multi-threaded and multi-processor programs with support for Fortran and C/C++

TotalView • Details on how to configure TV for your run and use TV for serial or MPI/OpenMP jobs are in the NCCS Primer. • As announced on March 2012, the base TV solution now includes Replay Engine and CUDA debugging support for no additional fee. • Major features: • Parallel debugging: MPI, Pthreads, OpenMP, CUDA • Advanced memory debugging with MemoryScape • Reverse debugging with ReplayEngine • Add on ThreadSpotter for optimization

Performance Analysis – Typical Bottlenecks • YourApplication • Synchronization, load balance, communication, memory usage, I/O usage • System Architecture • Memory hierarchy, network latency, processor architecture, I/O system setup Instrumentation • Software • Compileroptions, libraries, runtime environment, communication protocols… Measurement Analysis Optimization

General Tuning Considerations • Communicationtuning • Balance communication • Communication/Computation overlapping • Run-time environment • Memory and CPU tuning • Cache misses (stride-1 access, padding) • TLB misses (large pages, loop blocking) • Page faults • Loop optimization • Auto-parallelization • Through library calls • Through compiler options • I/O tuning • TMPDIR. Run time environment • Reduce number of I/O requests • Use unformatted files • Use large record sizes

Optimization with Compiler Options - Intel • Start with reasonably good set of options and build with more • Default is -O2. -O = -O2 • –O3: recommended over -O2 only for codes with loops that heavily use FP calculations and process large data sets • Codes with many small function calls: -ip –ipo • Codes with many FP operations: -fp-model fast=2 • Be careful of –fast. –fast=-O3 –ipo –no-prec-div –static • -xSSE4.2 generate optimized code specialized for the Intel processors • –openmp for OpenMP code • Fall back by –fp-model precise if correctness is an issue

Optimization with Compiler Options (Cont’d) • Intel compiler provides different report to identify performance issues • -opt-report 3 -opt-report-phase=hlo(ipo, hop, ecg_swp…) • Vectorization report: -vec-report 3 • -par-report 3 • -openmp-report 2

Run-time Environment Tuning • Select proper process layout. Default is “group round-robin”. Set I_MPI_PERHOST to override default layout: • I_MPI_PERHOST=n • I_MPI_PERHOST=allcores : maps to cores on a node • Or use mpirun/mpiexec options • mpirun –perhost n … • For MPI/OpenMP hybrid jobs: • Set OMP_NUM_THREADS and Use –perhost n • Set I_MPI_PIN_DOMAIN=auto(omp) • Set KMP_AFFINITY= compact

Run-time Environment Tuning (Cont’d) • Use scalable DAPL progress for large jobs • Set I_MPI_DAPL_SCALABLE_PROGRESS variable to 1 to enable scalable algorithm for DAPL read progress engine. It offers performance advantage for large (>64) numbers of processes. • Use Intel MPI lightweight statistics • Set I_MPI_STATS set to non-zero integer value to gather MPI communication statistics • Manipulate with I_MPI_STATS_SCOPE to increase effectiveness of the analysis • I_MPI_STATS=3 • I_MPI_STATS_SCOPE=coll

Run-time Environment Tuning (Cont’d) • Adjust eager/rendezvous protocol threshold • “Eager” sends data immediately regardless of receive request availability. • “Rendezvous” notices receiving site on data pending and transfers when receive request is set. • I_MPI_EAGER_THRESHOLD controls high level protocol switchover point. • Shorter messages are sent using the eager protocol; larger ones are sent by using the more memory efficient rendezvous protocol.

Performance Analysis Tools (Check Primer for usage!) Interval Timers Elapsed time between two timer calls time (shell), mpi_wtime,system_clock (subroutines) Profiling Tools Periodically sample the program counter gprof • Intrusive. Longer run time • Only profiles CPU usage. Lack of I/O and communication info • “time” is simple to use Applications Event Tracers Complete sequences of events I_MPI_STATS, MpiP, TAU Event Counters Counts number of times hardware events occur TAU • Provide hardware performance counter info (e.g., cache misses), hardware metrics (e.g., loads per cache miss or # of page faults) • Detailed information about communication time • Good for analyze scaling

Top3 ways to avoid performance problems - 1 Never, ever write your own code unless you absolutely have to • Libraries, libraries, libraries! • MKL library • GNU Scientific Library (GSL), under /user/local/other/SLES11/gsl • LAPACK, under /user/local/other/SLES11/lapack • PETSC, Portable, Extensible Tookit for Scientific Computation, under /user/local/other/SLES11/petsc • Spend time doing research, chances are you will find a package that suits your needs

Top3 ways to avoid performance problems - 2 Let the compiler do the work • Modern compilers are much better now at optimizing most code and providing hints for your manual optimization • For example, “-xsse4.2 –O3 –no-prec-div” are recommended for latest Intel processors • Spend some time reading the “man” page and ask around

Top3 ways to avoid performance problems - 3 Never use more data than absolutely necessary • Only use high precisions when necessary. • A reduction in the amount of data the CPU needs ALWAYS translates to an increase in performance • Always keep in mind that the memory subsystem and the network are the ultimate bottlenecks

Top3 ways to avoid performance problems – 3.5 Finally, Make friends with Computer Scientists! Learning even a little about modern computer architectures will result in much better code.

NCCS User Forum March 20, 2012

NCCS User Forum March 20, 2012

Presentation Transcript

NCCS User Forum June 15, 2010

NCCS User Forum

NCCS User Forum

NCCS User Forum

NCCS User Forum

NCCS User Forum

Doha, March 20, 2012

March 20, 2012

March 20, 2012

NCCS User Forum June 19, 2012

NCCS User Forum September 25, 2012

NCCS User Forum

NCCS User Forum December 7, 2010

FORUM - MARCH 2012

NCCS User Forum March 20, 2012

NCCS User Forum July 19, 2011

NCCS User Forum March 5, 2013

NCCS User Forum June 25, 2013

March 20, 2012

NCCS User Forum

NCCS User Forum June 15, 2010

March 20, 2012