220 likes | 345 Views
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms. Lab Exercises: Lab 1 (Performance measurement). Lab # 1: Parallel Programming and Performance measurement using MPAC. Lab 1 – Goals. Objective
E N D
Programming Multi-Core Processors based Embedded SystemsA Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)
Lab # 1: Parallel Programming and Performance measurement using MPAC 1-2
Lab 1 – Goals • Objective • Use MPAC benchmarks to measure the performance of different subsystems of multi-core based systems • Use MPAC to learn to develop parallel programs • Mechanism • MPAC CPU and memory benchmarks will exercise the processor and memory unit by generating compute and memory intensive workload 1-3
What to Look for • Observations • Observe the throughput with increasing number of threads for compute and memory intensive workloads • Identify performance bottlenecks 1-4
Measurement of Execution Time • Measuring the elapsed time since the start of a task until its completion is a straight-forward procedure in the context of a sequential task. • This procedure becomes complex when the same task is executed concurrently by n threads on n distinct processors or cores. • Not guaranteed that all tasks start at the same time or complete at the same time. Therefore, the measurement is imprecise due to concurrent nature of the tasks. 1-5
Cont… • Execution time measured either globally or locally. • In the case of global measurement, execution time is equal to the difference of time stamps taken at global fork and join instants. • Local times can be measured and recorded by each of the n threads. • After thread joining, the maximum of all these individual execution times provides an estimate of overall execution time. 1-6
Definitions • LETE: Local Execution Time Estimation • GETE: Global Execution Time Estimation 1-7
Cont… GETE LETE 1-8
The Problem • Lack of Precision • Some tasks finish before others • Synchronization issue with large no. of cores • Results not repeatable 1-9
Get start time Repeat for N no. of iterations Get end time Get start time at the barrier (1) (2) (3) ... (K) Repeat for N no. of iterations Get end time at the barrier Performance Measurement Methodologies For sequential case For multithreaded case 1-10
Thread synchronization before each round using barrier (1) (2) (3) ... (K) Repeat for N no. of rounds Maximum elapsed time for the round Accurate LETE Measurement Methodology 1-11
Accurate MINMAX Approach • Repeat for N no. of Iterations • Store thread local execution time for each thread for each iteration • For an individual iteration store the largest execution time amongst the threads • We have stored N largest execution time values • Choose the minimum of that value to be your execution time. The MINMAX value!! 1-13
Compile and Run (Memory Benchmark) • Memory Benchmark • $ cd /<path-to-mpac>/mpac_1.2 • $ ./configure • $ make clean • $ make • $ cd benchmarks/mem • $ ./mpac_mem_bm –n <# of Threads> -s <array size> -r <# of repetitions> -t <data type> • For Help • ./mpac_cpu_bm –h 1-14
Compile and Run (CPU Benchmark) • CPU Benchmark • $ cd /<path-to-mpac>/mpac_1.2 • $ ./configure • $ make clean • $ make • $ cd benchmarks/cpu • $ ./mpac_cpu_bm –n <# of Threads> -r <# of Iterations> • For Help • ./mpac_cpu_bm –h 1-15
Performance Measurements (CPU) • Integer Unit (summation), Floating Point Unit (sine) and Logical Unit (string operation) of the processor are exercised. • Intel Xeon, AMD Opteron (x86) and Cavium Octeon (MIPS64) are used as System under Test (SUT). • Throughput scales linearly across number of threads for all cases. 1-16
Performance Measurements (Memory) • With concurrent symmetric threads one expects to see the memory-memory throughput scale with the number of threads. • With data sizes of 4 KB, 16 KB and 1 MB, most of the memory accesses should hit L2 caches rather than the main memory. • For these cases the throughput scales linearly. 1-17
Performance Measurements (Memory) • Copying 16 MB requires extensive memory accesses • In case of Intel shared bus is used. Thus, throughput is lower compared to the cases where accesses hit in L2 caches, and saturates as bus becomes a bottleneck • Memory copy throughput saturates at around 40 Gbps, which is half of the available bus bandwidth (64 bits x 1333 MHz = 85.3 Gbps) • For AMD and Cavium based SUT, throughput scales linearly for 16MB case due to their more efficient low-latency memory controllers instead of a shared system bus 1-18
MPAC fork and join infrastructure • In MPAC based applications, the initialization and argument handling is performed by the main thread. • The task to be run in parallel are forked to worker threads • The worker threads join after completing their task. • Final processing is done by main thread 1-19
MPAC code structure 1-20
MPAC Hello World • Objective • To write a simple ” Hello World” program using MPAC • Mechanism • User specifies number of worker threads through commandline • Each worker thread prints “Hello World” and exits 1-21
Compile and Run • $ cd /<path-to-mpac>/mpac_1.2/apps/hello • $ make clean • $ make • $ ./mpac_hello_app –n <# of Threads> 1-22