80 likes | 220 Views
COM503 Parallel Computer Architecture & Programming. Lecture 8. Profiling - for Performance Analysis -. Prof. Taeweon Suh Computer Science Education Korea University. Performance Analysis.
E N D
COM503 Parallel Computer Architecture & Programming Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University
Performance Analysis • Assuming that the performance of an application is satisfactory in single-threaded mode, the most likely performance question is “Why does my application not get the expected speed-up when running on multiple threads? • The performance of large-scale parallel applications depends on many factors • Load imbalance • Parallelization overheads
Profiling • Several approaches can be used to obtain performance data • Sampling • Based on periodic OS interrupts (timer interrupts) • At each sampling point, the performance data such as the program counter, call stacks, and hardware counter data are collected and recorded • Less numerically accurate, but allow the target program to run at near full speed • Examples • Unix gprof • Sun Performance Analyzer • Oprofile • Code instrumentation • Calls to a tracing library are inserted in the code by the programmer, the compiler, or a tool • These library calls write performance data into a file during program execution
Pertinent Performance Data • Time spent in user and system level routines • Time spent in serial parts and parallel regions • Time spent in communications • #Invalidations, #cache-to-cache transfers • Hardware performance counter information such as CPU cycles, I$ and D$ misses • The state of a thread at given times such as waiting for work, synchronizing, forking, and joining
gprof • Use GNU gprof to get the profile information • Compile and link your code with -pg option • Run your code • gmon.out is generated • Run gprof to interpret the information
Testrun Benchmarks • Download a parallel benchmark from • http://www.nas.nasa.gov/Resources/Software/npb.html • Download the OpenMP version of NPS (NPB 3) • Compile the BT benchmark • Read README.install for information of how to compile the code • Edit ‘make.def’ under /config/ • Change ‘f77’ to ‘gfortran’ • Add ‘-pg’ option to FLAGS and FLINKFLAGS • FFLAGS = -O -fopenmp–pg • FLINKFLAGS = -O –fopenmp-pg • Compile BT with ‘make BT CLASS=A’ • Run simulation with ./bin/BT.A • It will generate gmon.out by default in the directory where you run the program • Use gprof to extract the profile information • gprof ./bin/BT.A > bt.txt • Open bt.txt with any text editor
Testrun Benchmarks • Compile the DC benchmark • Read README.install for information of how to compile the code • Edit ‘make.def’ under /config/ • Change ‘cc’ to ‘gcc’ • Add ‘-pg’ option to FLAGS and FLINKFLAGS • CFLAGS = -O -fopenmp–pg • CLINK = $(CC) –fopenmp-pg • Compile BT with ‘make DC CLASS=A’ • Run simulation with ./bin/dc.A.x • It will generate gmon.out by default in the directory where you run the program • Use gprof to extract the profile information • gprof ./bin/dc.A.x > dc.txt • Open dc.txt with any text editor