1 / 42

Parallel Computing

Parallel Computing. CS 6021/01 Advanced Computer Architecture Final Project. Spring 2019. Group 1 Hu Longhua Shweta Khandal Vannel Zeufack. Plan. Introduction Concepts Parallel Computing Memory Architectures Parallel Programming Models References. Introduction.

Mercy
Download Presentation

Parallel Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Computing CS 6021/01 Advanced Computer Architecture Final Project Spring 2019 • Group 1 • Hu Longhua • Shweta Khandal • Vannel Zeufack

  2. Plan • Introduction • Concepts • Parallel Computing Memory Architectures • Parallel Programming Models • References

  3. Introduction • Def: Simultaneous use of many resources to solve a computing problem. • Necessary due to power wall: single processor limited due to heat • We can solve • Larger problems • Faster

  4. Amdahl’s law • 1 server -> 20 customers per hour • 2 servers -> 40 customers per hour • 3 servers -> 60 customers per hour • Only true if • Severs serve at the same speed • Servers do not share resources

  5. Amdahl’s Law

  6. Amdahl’s Law (Examples)

  7. Parallelism vs Concurrency Concurrency: managing the execution of multiple tasks such that they seem to be occurring at the same time. Parallelism: running two tasks at the exact same time

  8. Parallelism vs Concurrency Parallelism concurrency Interleaving tasks running once at a time One processor needed • Running at the same time • Multiple processors needed Task 1 Task 1 Task 2 Task 2

  9. Types of parallelism • Bit-Level • based on increasing CPU word size (from 4 bits microprocessors to 64 bits microprocessors) • Reduces the number of instructions the processor must execute in order to perform an operation on variables whose sizes are greater than the length of the word. • Instruction-Level Parallelism • Based on simultaneous execution of many instructions • Can occur both at hardware (chips) and software level (compilers) • Task-Level Parallelism • Running many different tasks at the same time on the same data • A task (process/thread) is an unit of execution and is made of many instructions. • Data-Level parallelism • Running the same task on different data at the same time

  10. Parallel Computing Memory Architectures

  11. Memory Architectures • Shared Memory • Uniform Memory Access (UMA) • Non-Uniform Memory Access (NUMA) • Distributed Memory • Hybrid Architecture • Hybrid Architecture with Accelerators (co-processors) • GPGPU (General Purpose Graphical Processing Unit) • MIC (Many Integrated Core)

  12. Shared Memory • Multiple processors can operate independently but share the same memory resources. • Changes in a memory location made by one processor are visible to all other processors. • Classified as UMA (Uniform Memory Access)  and NUMA (Non-uniform Memory Access), based upon memory access times.

  13. Shared Memory: Uniform Memory Access (UMA) • Identical processors • Equal access times to memory • Sometimes called CC-UMA (Cache Coherent UMA). • Most commonly represented today by Symmetric Multiprocessor (SMP) machines

  14. Shared Memory: Non-Uniform Memory Access (NUMA) • Often made by physically linking two or more SMPs (Symmetric Multiprocessors) • One SMP can directly access memory of another SMP • Processors do not have equal access time to memories • Memory access across link is slower

  15. Shared Memory: Pros and Cons • Advantages • User-friendly programming perspective to memory • Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs • Disadvantages • Lack of scalability between memory and CPUs • can increase traffic on the shared Memory-CPU path • for cache coherent systems, can increase traffic associated with cache/memory management.

  16. Distributed Memory • Memory is local to each processor • Data exchanged by message passing over a network • Because each processor has its own local memory, it operates independently. Hence, the concept of cache coherency does not apply • The network “fabric” used for data transfer varies widely, though it can be as simple as Ethernet.

  17. Distributed Memory: Pros and Cons • Advantages • Memory is scalable with the number of processors. Increase the number of processors and the size of memory increases proportionately. • Each processor can rapidly access its own memory. • Cost effectiveness. • Disadvantages • The programmer is responsible for many of the details associated with data communication between processors. • It may be difficult to map existing data structures, based on global memory, to this memory organization. • Non-uniform memory access times

  18. Hybrid Architecture • The largest and fastest computers in the world today employ both shared and distributed memory architectures. • The shared memory component can be a shared memory machine and/or graphics processing units (GPU) • Network communications are required to move data from one machine to another

  19. Hybrid Architecture with accelerators • Why need accelerators or co-processors • Limitation of clock frequency due to power requirements and heat dissipation restrictions (unmanageable problem). • Number of cores per chip increases. • On HPC, we need a chip which can provide higher computing performance at lower energy.

  20. Hybrid Architecture with accelerators • How to solve • The actual solution is a hybrid system containing both CPUs and “accelerators”, plus other forms of parallelism such as vector instruction support. • Widely accepted that hybrid systems with accelerators deliver the highest performance, and energy efficient computing in HPC. • The most common accelerator is MIC(Many Integrated Core ) and GPGPU(general purpose graphical processing unit

  21. Accelerated (GPGPU and MIC) Systems • Accelerator (or co-processor): computer processor used to supplement the functions of the primary processor (the CPU ), with it allowing even great parallelism. • GPGPU (general purpose graphical processing unit) • Derived form graphics hardware • Requires a new programming model and specific libraries and compilers (CUDA, OpenCL) • Newer GPUs support IEEE 754-2008 floating point standard • Dose not support flow control (handled by host thread) • MIC (Many Integrated Core) • Derived from traditional CPU hardware • Based on x86 instruction set • Supports multiple programming models (OpenMP, MPI, OpenCL) • Flow control can be handled on accelerator

  22. CPU vs MIC vs GPU Architecture Comparison • General-purpose architecture • Power-efficient Multiprocessor X86 design architecture • Massively data parallel

  23. Hybrid Architecture with accelerators(GPGPU and MIC) • Calculations made in both CPU and accelerator • Provide abundance of low-cost flops • Typically communicate over PCI-e bus • Load balancing critical for performance

  24. Parallel Programming Models

  25. Parallel Programming Models • Shared Memory Model without threads • Shared Memory Model with threads • Distributed Memory Model with Message Passing Interface • Hybrid Model

  26. Shared Memory Model (without threads) • Simplest parallel programming model • Processes/tasks share a common address space, which they read and write to asynchronously • Locks/semaphores are used to control access to the shared memory, resolve contentions and prevent race conditions and deadlocks. • Examples • POSIX standard provides an API to implement shared memory model • UNIX provides shared memory segments (shmget, shmat, shmctl, etc)

  27. Shared Memory Model without threads • Advantages • No need of to specify explicitly the communication of data between tasks. • Simplest model • Disadvantages • Data locality issue • Deadlock and Race Condition

  28. Threads Model • Shared memory programming model but using threads •  Threads implementations commonly comprise: • A library of subroutines that are called from within parallel source code • A set of compiler directives embedded in either serial or parallel source code

  29. Types of Thread Model

  30. POSIX Threads • Specified by the IEEE POSIX 1003.1c standard (1995). C Language only. • Part of Unix/Linux operating systems • Library based • Commonly referred to as Pthreads. • Very explicit parallelism; requires significant programmer attention to detail.

  31. POSIX Threads

  32. Pthread • The subroutines which comprise the Pthreads API can be informally grouped into four major groups: • Thread management: routines that work directly on threads - creating, detaching, joining, etc. • Mutexes: routines that deal with synchronization. Mutex functions provide for creating, destroying, locking and unlocking mutexes. • Condition variables: routines that address communications between threads that share a mutex. • Synchronization: routines that manage read/write locks and barriers • Major disadvantage is deadlocks.

  33. OpenMP • Industry standard, jointly defined and endorsed by a group of major computer hardware and software vendors, organizations and individuals. • Compiler directive based • Portable / multi-platform, including Unix and Windows platforms • Available in C/C++ and Fortran implementations • Can be very easy and simple to use - provides for "incremental parallelism". Can begin with serial code.

  34. OpenMP Fork-Join Model

  35. Fork-Join Model or OpenMP in Java Incorrect order Incorrect order Correct order

  36. Distributed Memory / Message Passing Model • Tasks use their own local memory • Tasks exchange data through communications by sending and receiving messages. • Data transfer usually requires cooperative operations to be performed by each process. • From a programming perspective, message passing implementations usually comprise a library of subroutines. • The programmer is responsible for determining all parallelism.

  37. Distributed Memory / Message Passing Model • Point to point communication • Thread Safety • Mainly, MPI used in portable parallel program, parallel library, irregular or dynamic data relationships that do not fit in data parallel model.

  38. Hybrid Model • A hybrid model combines more than one of the previously described programming models • Currently, a common example of a hybrid model is the combination of the message passing model (MPI) with the threads model (OpenMP). • Threads perform computationally intensive kernels using local, on-node data • Communications between processes on different nodes occurs over the network using MPI

  39. Hybrid Model • MPI with CPU-GPU (Graphics Processing Unit) programing: another similar and increasingly popular example of a hybrid model • MPI tasks run on CPUs using local memory and communicating with each other over a network. • Computationally intensive kernels are off-loaded to GPUs on-node. • Data exchange between node-local memory and GPUs uses CUDA (or something equivalent).

  40. References • Introduction to parallel computing • https://en.wikipedia.org/wiki/Parallel_computing#Fine-grained,_coarse-grained,_and_embarrassing_parallelism • https://computing.llnl.gov/tutorials/parallel_comp/ • Parallel vs Concurrent Programming • https://www.youtube.com/watch?v=ltTQaMSk6ME • https://www.youtube.com/watch?v=FChZP09Ba4E • GPU vs ManyCore • https://www.greymatter.com/corporate/hardcopy-article/gpu-vs-manycore/ • MIC & GPU Architecture • https://www.lrz.de/services/compute/courses/x_lecturenotes/MIC_GPU_Workshop/MIC-AND-GPU-2015.pdf • Parallel Programming Models • http://apiacoa.org/teaching/big-data/smp.en.html • https://www.cs.uky.edu/~jzhang/CS621/chapter9.pdf • http://hpcg.purdue.edu/bbenes/classes/CGT%20581-I/lectures/CGT%20581-I-01-Introduction.pdf • file:///C:/Users/Lenovo/Downloads/BPTX_2013_2_11320_0_378526_0_153227.pdf • https://homes.cs.washington.edu/~djg/teachingMaterials/spac/grossmanSPAC_forkJoinFramework.htmlhttps://www.mpi-forum.org/docs/

  41. Thanks for your attention!Questions?

More Related