Lecture 8. Profiling - for Performance Analysis -

COM503 Parallel Computer Architecture & Programming Lecture 8. Profiling - for Performance Analysis - Prof. Taeweon Suh Computer Science Education Korea University

Performance Analysis • Assuming that the performance of an application is satisfactory in single-threaded mode, the most likely performance question is “Why does my application not get the expected speed-up when running on multiple threads? • The performance of large-scale parallel applications depends on many factors • Load imbalance • Parallelization overheads

Profiling • Several approaches can be used to obtain performance data • Sampling • Based on periodic OS interrupts (timer interrupts) • At each sampling point, the performance data such as the program counter, call stacks, and hardware counter data are collected and recorded • Less numerically accurate, but allow the target program to run at near full speed • Examples • Unix gprof • Sun Performance Analyzer • Oprofile • Code instrumentation • Calls to a tracing library are inserted in the code by the programmer, the compiler, or a tool • These library calls write performance data into a file during program execution

Pertinent Performance Data • Time spent in user and system level routines • Time spent in serial parts and parallel regions • Time spent in communications • #Invalidations, #cache-to-cache transfers • Hardware performance counter information such as CPU cycles, I$ and D$ misses • The state of a thread at given times such as waiting for work, synchronizing, forking, and joining

gprof • Use GNU gprof to get the profile information • Compile and link your code with -pg option • Run your code • gmon.out is generated • Run gprof to interpret the information

Testrun Benchmarks • Download a parallel benchmark from • http://www.nas.nasa.gov/Resources/Software/npb.html • Download the OpenMP version of NPS (NPB 3) • Compile the BT benchmark • Read README.install for information of how to compile the code • Edit ‘make.def’ under /config/ • Change ‘f77’ to ‘gfortran’ • Add ‘-pg’ option to FLAGS and FLINKFLAGS • FFLAGS = -O -fopenmp–pg • FLINKFLAGS = -O –fopenmp-pg • Compile BT with ‘make BT CLASS=A’ • Run simulation with ./bin/BT.A • It will generate gmon.out by default in the directory where you run the program • Use gprof to extract the profile information • gprof ./bin/BT.A > bt.txt • Open bt.txt with any text editor

Testrun Benchmarks • Compile the DC benchmark • Read README.install for information of how to compile the code • Edit ‘make.def’ under /config/ • Change ‘cc’ to ‘gcc’ • Add ‘-pg’ option to FLAGS and FLINKFLAGS • CFLAGS = -O -fopenmp–pg • CLINK = $(CC) –fopenmp-pg • Compile BT with ‘make DC CLASS=A’ • Run simulation with ./bin/dc.A.x • It will generate gmon.out by default in the directory where you run the program • Use gprof to extract the profile information • gprof ./bin/dc.A.x > dc.txt • Open dc.txt with any text editor

Lecture 8. Profiling - for Performance Analysis -

Lecture 8. Profiling - for Performance Analysis -

Presentation Transcript

Lecture 8: Evaluation Using Heuristic Analysis

Case Analysis I- Lecture 8

Magpie: Profiling for Performance Analysis of Distributed Systems

F2: Performance Analysis Profiling with PPW

Performance based analysis for profiling skills and competencies for progress files

Profiling: Software Performance

Introduction to Social Analysis Lecture 8

Lecture 8: Evaluation Using Heuristic Analysis

Individual Performance Profiling

Performance Profiling

Multiprocessor Kernel Performance Profiling

Lecture 7 - Debugging and Profiling

Analysis of Path Profiling Information Generated with Performance Monitoring Hardware

Lecture – Performance

Lecture 8: Linkage Analysis I

Profiling, Performance Tuning, and Design Issues

Phase-Based Parallel Performance Profiling

Lecture 9: Performance Analysis

Magpie : Distributed Profiling for Performance Analysis

Java Performance Profiling and Optimization

Lecture 8 DATA ANALYSIS

Lecture 8 Principal Component Analysis