1 / 22

Lab Exercises: Lab 1 (Performance measurement)

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms. Lab Exercises: Lab 1 (Performance measurement). Lab # 1: Parallel Programming and Performance measurement using MPAC. Lab 1 – Goals. Objective

allene
Download Presentation

Lab Exercises: Lab 1 (Performance measurement)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming Multi-Core Processors based Embedded SystemsA Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)

  2. Lab # 1: Parallel Programming and Performance measurement using MPAC 1-2

  3. Lab 1 – Goals • Objective • Use MPAC benchmarks to measure the performance of different subsystems of multi-core based systems • Use MPAC to learn to develop parallel programs • Mechanism • MPAC CPU and memory benchmarks will exercise the processor and memory unit by generating compute and memory intensive workload 1-3

  4. What to Look for • Observations • Observe the throughput with increasing number of threads for compute and memory intensive workloads • Identify performance bottlenecks 1-4

  5. Measurement of Execution Time • Measuring the elapsed time since the start of a task until its completion is a straight-forward procedure in the context of a sequential task. • This procedure becomes complex when the same task is executed concurrently by n threads on n distinct processors or cores. • Not guaranteed that all tasks start at the same time or complete at the same time. Therefore, the measurement is imprecise due to concurrent nature of the tasks. 1-5

  6. Cont… • Execution time measured either globally or locally. • In the case of global measurement, execution time is equal to the difference of time stamps taken at global fork and join instants. • Local times can be measured and recorded by each of the n threads. • After thread joining, the maximum of all these individual execution times provides an estimate of overall execution time. 1-6

  7. Definitions • LETE: Local Execution Time Estimation • GETE: Global Execution Time Estimation 1-7

  8. Cont… GETE LETE 1-8

  9. The Problem • Lack of Precision • Some tasks finish before others • Synchronization issue with large no. of cores • Results not repeatable 1-9

  10. Get start time Repeat for N no. of iterations Get end time Get start time at the barrier (1) (2) (3) ... (K) Repeat for N no. of iterations Get end time at the barrier Performance Measurement Methodologies For sequential case For multithreaded case 1-10

  11. Thread synchronization before each round using barrier (1) (2) (3) ... (K) Repeat for N no. of rounds Maximum elapsed time for the round Accurate LETE Measurement Methodology 1-11

  12. Measurement Observations 1-12

  13. Accurate MINMAX Approach • Repeat for N no. of Iterations • Store thread local execution time for each thread for each iteration • For an individual iteration store the largest execution time amongst the threads • We have stored N largest execution time values • Choose the minimum of that value to be your execution time. The MINMAX value!! 1-13

  14. Compile and Run (Memory Benchmark) • Memory Benchmark • $ cd /<path-to-mpac>/mpac_1.2 • $ ./configure • $ make clean • $ make • $ cd benchmarks/mem • $ ./mpac_mem_bm –n <# of Threads> -s <array size> -r <# of repetitions> -t <data type> • For Help • ./mpac_cpu_bm –h 1-14

  15. Compile and Run (CPU Benchmark) • CPU Benchmark • $ cd /<path-to-mpac>/mpac_1.2 • $ ./configure • $ make clean • $ make • $ cd benchmarks/cpu • $ ./mpac_cpu_bm –n <# of Threads> -r <# of Iterations> • For Help • ./mpac_cpu_bm –h 1-15

  16. Performance Measurements (CPU) • Integer Unit (summation), Floating Point Unit (sine) and Logical Unit (string operation) of the processor are exercised. • Intel Xeon, AMD Opteron (x86) and Cavium Octeon (MIPS64) are used as System under Test (SUT). • Throughput scales linearly across number of threads for all cases. 1-16

  17. Performance Measurements (Memory) • With concurrent symmetric threads one expects to see the memory-memory throughput scale with the number of threads. • With data sizes of 4 KB, 16 KB and 1 MB, most of the memory accesses should hit L2 caches rather than the main memory. • For these cases the throughput scales linearly. 1-17

  18. Performance Measurements (Memory) • Copying 16 MB requires extensive memory accesses • In case of Intel shared bus is used. Thus, throughput is lower compared to the cases where accesses hit in L2 caches, and saturates as bus becomes a bottleneck • Memory copy throughput saturates at around 40 Gbps, which is half of the available bus bandwidth (64 bits x 1333 MHz = 85.3 Gbps) • For AMD and Cavium based SUT, throughput scales linearly for 16MB case due to their more efficient low-latency memory controllers instead of a shared system bus 1-18

  19. MPAC fork and join infrastructure • In MPAC based applications, the initialization and argument handling is performed by the main thread. • The task to be run in parallel are forked to worker threads • The worker threads join after completing their task. • Final processing is done by main thread 1-19

  20. MPAC code structure 1-20

  21. MPAC Hello World • Objective • To write a simple ” Hello World” program using MPAC • Mechanism • User specifies number of worker threads through commandline • Each worker thread prints “Hello World” and exits 1-21

  22. Compile and Run • $ cd /<path-to-mpac>/mpac_1.2/apps/hello • $ make clean • $ make • $ ./mpac_hello_app –n <# of Threads> 1-22

More Related