Lab Exercises: Lab 1 (Performance measurement)

Programming Multi-Core Processors based Embedded SystemsA Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)

Lab # 1: Parallel Programming and Performance measurement using MPAC 1-2

Lab 1 – Goals • Objective • Use MPAC benchmarks to measure the performance of different subsystems of multi-core based systems • Use MPAC to learn to develop parallel programs • Mechanism • MPAC CPU and memory benchmarks will exercise the processor and memory unit by generating compute and memory intensive workload 1-3

What to Look for • Observations • Observe the throughput with increasing number of threads for compute and memory intensive workloads • Identify performance bottlenecks 1-4

Measurement of Execution Time • Measuring the elapsed time since the start of a task until its completion is a straight-forward procedure in the context of a sequential task. • This procedure becomes complex when the same task is executed concurrently by n threads on n distinct processors or cores. • Not guaranteed that all tasks start at the same time or complete at the same time. Therefore, the measurement is imprecise due to concurrent nature of the tasks. 1-5

Cont… • Execution time measured either globally or locally. • In the case of global measurement, execution time is equal to the difference of time stamps taken at global fork and join instants. • Local times can be measured and recorded by each of the n threads. • After thread joining, the maximum of all these individual execution times provides an estimate of overall execution time. 1-6

Definitions • LETE: Local Execution Time Estimation • GETE: Global Execution Time Estimation 1-7

Cont… GETE LETE 1-8

The Problem • Lack of Precision • Some tasks finish before others • Synchronization issue with large no. of cores • Results not repeatable 1-9

Get start time Repeat for N no. of iterations Get end time Get start time at the barrier (1) (2) (3) ... (K) Repeat for N no. of iterations Get end time at the barrier Performance Measurement Methodologies For sequential case For multithreaded case 1-10

Thread synchronization before each round using barrier (1) (2) (3) ... (K) Repeat for N no. of rounds Maximum elapsed time for the round Accurate LETE Measurement Methodology 1-11

Measurement Observations 1-12

Accurate MINMAX Approach • Repeat for N no. of Iterations • Store thread local execution time for each thread for each iteration • For an individual iteration store the largest execution time amongst the threads • We have stored N largest execution time values • Choose the minimum of that value to be your execution time. The MINMAX value!! 1-13

Compile and Run (Memory Benchmark) • Memory Benchmark • $ cd /<path-to-mpac>/mpac_1.2 • $ ./configure • $ make clean • $ make • $ cd benchmarks/mem • $ ./mpac_mem_bm –n <# of Threads> -s <array size> -r <# of repetitions> -t <data type> • For Help • ./mpac_cpu_bm –h 1-14

Compile and Run (CPU Benchmark) • CPU Benchmark • $ cd /<path-to-mpac>/mpac_1.2 • $ ./configure • $ make clean • $ make • $ cd benchmarks/cpu • $ ./mpac_cpu_bm –n <# of Threads> -r <# of Iterations> • For Help • ./mpac_cpu_bm –h 1-15

Performance Measurements (CPU) • Integer Unit (summation), Floating Point Unit (sine) and Logical Unit (string operation) of the processor are exercised. • Intel Xeon, AMD Opteron (x86) and Cavium Octeon (MIPS64) are used as System under Test (SUT). • Throughput scales linearly across number of threads for all cases. 1-16

Performance Measurements (Memory) • With concurrent symmetric threads one expects to see the memory-memory throughput scale with the number of threads. • With data sizes of 4 KB, 16 KB and 1 MB, most of the memory accesses should hit L2 caches rather than the main memory. • For these cases the throughput scales linearly. 1-17

Performance Measurements (Memory) • Copying 16 MB requires extensive memory accesses • In case of Intel shared bus is used. Thus, throughput is lower compared to the cases where accesses hit in L2 caches, and saturates as bus becomes a bottleneck • Memory copy throughput saturates at around 40 Gbps, which is half of the available bus bandwidth (64 bits x 1333 MHz = 85.3 Gbps) • For AMD and Cavium based SUT, throughput scales linearly for 16MB case due to their more efficient low-latency memory controllers instead of a shared system bus 1-18

MPAC fork and join infrastructure • In MPAC based applications, the initialization and argument handling is performed by the main thread. • The task to be run in parallel are forked to worker threads • The worker threads join after completing their task. • Final processing is done by main thread 1-19

MPAC code structure 1-20

MPAC Hello World • Objective • To write a simple ” Hello World” program using MPAC • Mechanism • User specifies number of worker threads through commandline • Each worker thread prints “Hello World” and exits 1-21

Compile and Run • $ cd /<path-to-mpac>/mpac_1.2/apps/hello • $ make clean • $ make • $ ./mpac_hello_app –n <# of Threads> 1-22

Lab Exercises: Lab 1 (Performance measurement)

Lab Exercises: Lab 1 (Performance measurement)

Presentation Transcript

What is driving P4P? Catalyst for Change

Introduction to WinRiver II

Chapter 5

Software Measurement

Part 2

Measurement and Scaling: Fundamentals and Comparative Scaling

3 Best Abs Exercises

Chapter: Measurement

Lecture 4: Parallel Tools Landscape – Part 1

Quantitative Performance Analysis

Performance Evaluation Course

Single Cycle CPU

Internet Measurement Tutorial

Measurement Systems Analysis

Uncertainty in Hardness Measurement

Chapter 3: Scientific Measurement

Planning for Performance Measurement: Development of the OJJDP Performance Measurement System

Internet Measurement Tutorial

“ Scientific Measurement ”

Water Measurement

Chapter 10

Energy Sub-Metering for Measurement and Verification