1 / 45

Chapter 3

Chapter 3. Parallel Programming Models. Abstraction. Machine Level Looks at hardware, OS, buffers Architectural models Looks at interconnection network, memory organization, synchronous & asynchronous Computational Model Cost models, algorithm complexity, RAM vs. PRAM Programming Model

alina
Download Presentation

Chapter 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 3 Parallel Programming Models

  2. Abstraction • Machine Level • Looks at hardware, OS, buffers • Architectural models • Looks at interconnection network, memory organization, synchronous & asynchronous • Computational Model • Cost models, algorithm complexity, RAM vs. PRAM • Programming Model • Uses programming language description of process

  3. Control Flows • Process • Address spaces differ - Distributed • Thread • Shares address spaces – Shared Memory • Created statically (like MPI-1) or dynamically during run time (MPI-2 allows this as well as pthreads).

  4. Parallelization of a Program • Decomposition of the computations • Can be done at many levels (ex. Pipelining). • Divide into tasks and identify dependencies between tasks. • Can be done statically (at compile time) or dynamically (at run time) • Number of tasks places an upper bound on the parallelism that can be used • Granularity: the computation time of a task

  5. Assignment of Tasks • The number of processes or threads does not need to be the same as the number of processors • Load Balancing: each process/thread having the same amount of work (computation, memory access, communication) • Have a tasks that use the same memory execute on the same thread (good cache use) • Scheduling: assignment of tasks to threads/processes

  6. Assignment to Processors • 1-1: map a process/thread to a unique processor • many to 1: map several processes to a single processor. (Load balancing issues) • OS or programmer done

  7. Scheduling • Precedence constraints • Dependencies between tasks • Capacity constraints • A fixed number of processors • Want to meet constraints and finish in minimum time

  8. Levels of Parallelism • Instruction Level • Data Level • Loop Level • Functional level

  9. Instruction Level • Executing multiple instructions in parallel. May have problems with dependencies • Flow dependency – if next instruction needs a value computed by previous instruction • Anti-Dependency – if an instruction uses a value from register or memory when the next instruction stores a value into that place (cannot reverse the order of instructions • Output dependency – 2 instructions store into same location

  10. Data Level • Same process applied to different elements of a large data structure • If these are independent, the can distribute the data among the processors • One single control flow • SIMD

  11. Loop Level • If there are no dependencies between the iterations of a loop, then each iteration can be done independently, in parallel • Similar to data parallelism

  12. Functional Level • Look at the parts of a program and determine which parts can be done independently. • Use a dependency graph to find the dependencies/independencies • Static or Dynamic assignment of tasks to processors • Dynamic would use a task pool

  13. Explicit/Implicit Parallelism Expression • Language dependent • Some languages hide the parallelism in the language • For some languages, you must explicitly state the parallelism

  14. Parallelizing Compilers • Takes a program in a sequential language and generates parallel code • Must analyze the dependencies and not violate them • Should provide good load balancing (difficult) • Minimize communication • Functional Programming Languages • Express computations as evaluation of a function with no side effects • Allows for parallel evaluation

  15. More explicit/implicit • Explicit parallelism/implicit distribution • The language explicitly states the parallelism in the algorithm, but allows the system to assign the tasks to processors. • Explicit assignment to processors – do not have to worry about communications • Explicit Communication and Synchronization • MPI – additionally must explicitly state communications and synchronization points

  16. Parallel Programming Patterns • Process/Thread Creation • Fork-Join • Parbegin-Parend • SPMD, SIMD • Master-Slave (worker) • Client-Server • Pipelining • Task Pools • Producer-Consumer

  17. Process/Thread Creation • Static or Dynamic • Threads, traditionally dynamic • Processes, traditionally static, but dynamic has become recently available

  18. Fork-Join • An existing thread can create a number of child processes with a fork. • The child threads work in parallel. • Join waits for all the forked processes to terminate. • Spawn/exit is similar

  19. Parbegin-Parend • Also called cobegin-coend • Each statement (blocks/function calls) in the cobegin-coend block are to be executed in parallel. • Statements after coend are not executed until all the parallel statement are complete.

  20. SPMD – SIMD • Single Program, Multiple Data vs. Single Instruction, Multiple Data • Both use a number of threads/processes which apply the same program to different data • SIMD executes the statements synchronously on different data • SPMD executes the statements asynchronously

  21. Master-Slave • One thread/process that controls all the others • If dynamic thread/process creation, the master is the one that usually does it. • Master would “assign” the work to the workers and the workers would send the results to the master

  22. Client-Server • Multiple clients connected to a server that responds to requests • Server could be satisfying requests in parallel (multiple requests being done in parallel or if the request is involved, a parallel solution to the request) • The client would also do some work with response from server. • Very good model for heterogeneous systems

  23. Pipelining • Output of one thread is the input to another thread • A special type of functional decomposition • Another case where heterogeneous systems would be useful

  24. Task Pools • Keep a collection of tasks to be done and the data to do it upon • Thread/process can generate tasks to be added to the pool as well as obtaining a task when it is done with the current task

  25. Producer Consumer • Producer threads create data used as input by the consumer threads • Data is stored in a common buffer that is accessed by producers and consumers • Producer cannot add if buffer is full • Consumer cannot remove if buffer is empty

  26. Array Data Distributions • 1-D • Blockwise • Each process gets ceil(n/p) elements of A, except for the last process which gets n-(p-1)*ceil(n/p) elements • Alternatively, the first n%p processes get ceil(n/p) elements while the rest get floor(n/p) elements. • Cyclic • Process p gets data k*p+i (k=0..ceil(n/)) • Block cyclic • Distribute blocks of size b to processes in a cyclic manner

  27. 2-D Array distribution • Blockwise distribution rows or columns • Cyclic distribution of rows or columns • Blockwise-cyclic distribution of rows or columns

  28. Checkerboard • Take an array of size n x m • Overlay a grid of size g x f • g<=n • f<=m • More easily seen if n is a multiple of g and m is a multiple of f • Blockwise Checkerboard • Assign each n/g x m/f submatrix to a processor

  29. Cyclic Checkerboard • Take each item in a n/g x m/f submatrix and assign it in a cyclic manner. • Block-Cyclic checkerboard • Take each n/g x m/f submatrix and assign all the data in the submatrix to a processor in a cyclic fashion

  30. Information Exchange • Shared Variables • Used in shared memory • When thread T1 wants to share information with thread T2, then T1 writes the information into a variable that is shared with T2 • Must avoid 2 or more processes reading or writing to the same variable at the same time (race condition) • Leads to non-Deterministic behavior.

  31. Critical Sections • Sections of code where there may be concurrent accesses to shared variables • Must make these sections mutually exclusive • Only one process can be executing this section at any one time • Lock mechanism is used to keep sections mutually exclusive • Process checks to see if section is “open” • If it is, then “lock” it and execute (unlock when done) • If not, wait until unlocked

  32. Communication Operations • Single Transfer – Pi sends a message to Pj • Single Broadcast – one process sends the same data to all other processes • Single accumulation – Many values operated on to make a single value that is placed in root • Gather – Each process provides a block of data to a common single process • Scatter – root process sends a separate block of a large data structure to every other process

  33. More Communications • Multi-Broadcast – Every process sends data to every other process so every process has all the data that was spread out among the processes • Multi-Accumulate – accumulate, but every process gets the result • Total Exchange-Each process provides p-data blocks. The ith data block is sent to pi. Each processor receives the blocks and builds the structure with the data in i order.

  34. Applications • Parallel Matrix-Vector Product • Ab=c where A is n x m and b, c are m • Want A to be in contiguous memory • A single array, not an array of arrays • Have blocks of rows with allof b calculate a block of c • Used if A is stored row-wise • Have blocks of columns with a block of b compute columns that need to be summed. • Used if A is stored column-wise

  35. Processes and Threads • Process – a program in execution • Includes code, program data on stack or heap, values of registers, PC. • Assigned to processor or core for execution • If there are more processes than resources (processors or memory) for all, execute in a round-robin time-shared method • Context switch – changing from one process to another executing on processor.

  36. Fork • The Unix fork command • Creates a new process • Makes a copy of the program • Copy starts at statement after the fork • NOT shared memory model – Distributed memory model • Can take a while to execute

  37. Threads • Share a single address space • Best with physically shared memory • Easier to get started than a process – no copy of code space • Two types • Kernel threads – managed by the OS • User threads – managed by a thread library

  38. Thread Execution • If user threads are executed by a thread library/scheduler, (no OS support for threads) then all the threads are part of one process that is scheduled by the OS • Only one thread executed at a time even if there are multiple processors • If OS has thread management, then threads can be scheduled by OS and multiple threads can execute concurrently • Or, Thread scheduler can map user threads to kernel threads (several user threads may map to one kernel thread)

  39. Thread States • Newly generated • Executable • Running • Waiting • Finished • Threads transition from state to state based on events (start, interrupt, end, block, unblock, assign-to-processor)

  40. Synchronization • Locks • A process “locks” a shared variable at the beginning of a critical section • Lock allows process to proceed if shared variable is unlocked • Process blocked if variable is locked until variable is unlocked • Locking is an “atomic” process.

  41. Semaphore • Usually a binary type but can be integer • wait(s) • Waits until the value of s is 1 (or greater) • When it is, decreases s by 1 and continues • signal(s) • Increments s by 1

  42. Barrier Synchronization • A way to have every process wait until every process is at a certain point • Assures the state of every process before certain code is executed

  43. Condition Synchronization • A thread is blocked until a given condition is established • If condition is not true, then put into blocked state • When condition true, moved from blocked to ready (not necessarily directly to a processor) • Since other processes may be executing, by the time this process gets to a processor, the condition may no longer be true • So, must check condition after condition satisfied

  44. Efficient Thread Programs • Proper number of threads • Consider degree of parallelism in application • Number of processors • Size of shared cache • Avoid synchronization as much as possible • Make critical section as small as possible • Watch for deadlock conditions

  45. Memory Access • Must consider writing values to shared memory that is held in local caches • False sharing • Consider 2 processes writing to different memory locations • SHOULD not be an issue since not shared by two cache memories • HOWEVER, if the memory locations are close to each other, they may be in the same cache line and actually have the different locations both be in the different caches

More Related