Computer architecture II

Computer architecture II Introduction Computer Architecture II

Plan for today • Parallelization strategies • What to partition? • Embarrassingly Parallel Computations • Divide-and-Conquer • Pipelined Computations • Application examples • Parallelization steps • 3 programming models • Data parallel • Shared memory • Message passing Computer Architecture II

Parallelization strategies: what is partitioned? • Computation (anyhow) • Uniform (all processors apply the same operation) • Functional (each processor applies a different type of operation) • mixed • Data • Not always necessary (ex. : computing pi) • When data is partitioned, in most cases computation is partitioned as well (computation accompanies data partitioning) • Locality reasons: a processor works on the assigned data Computer Architecture II

Parallelization strategies(Barry Wilkinson) • Embarrassingly Parallel Computations • Divide-and-Conquer • Pipelined Computations Computer Architecture II

Embarrassingly Parallel Computations • A computation that can obviously be divided into a number of completely independent parts, each of which can be executed by a separate process. • No communication or very little communication between processes • Each process can do its tasks without any interaction with other processes Input data … Processes Output data Computer Architecture II

Area circle ¶ Pts. in circle ≈ = Area square 4 Pts. in square Embarrassingly Parallel Computations examples • Image processing • Monte Carlo methods • Use random selection for independent computation • Compute Pi • Choose points within square randomly • Count in parallel how many lie inside circle ( x2 + y2 < 1 ) and how many not Computer Architecture II

Divide-and-Conquer • Problem resolution = dividing the problem into several parts and solving them + combining the results • Examples: • Reduce (MPI example) • merge sort • N-body problem Computer Architecture II

Divide and conquer: merge sort Computer Architecture II

Pipelined computation • Problem divided into a series of tasks that have to be completed one after the other • Each task executed by a separate process or processor • May perform different tasks • Functional partitioning (like the pipeline of a microprocessor) Computer Architecture II

Pipelined computation example • Sieve of Eratosthenes • Generating series of prime numbers • Count up from 2 • A prime number is kept by the first free processor • A non-prime is dropped Computer Architecture II

(a) Cross sections Application examples: Simulating Ocean Currents • Model as several two-dimensional grids • Discretized in space and time • finer spatial and temporal resolution => greater accuracy • One time step: set up and solve equations • Concurrency • Intra-step and inter-steps • Static and regular (b) Spatial discretization of a cross section Computer Architecture II

m1m2 r2 Simulating Galaxy Evolution (N-body problem) • Simulate the interactions of many stars evolving over time • Discretized in time • Computing forces is expensive • O(n2) brute force approach • Hierarchical Methods take advantage of force law: G • Many time-steps, plenty of concurrency across stars within one Computer Architecture II

Rendering Scenes by Ray Tracing • Shoot rays into scene through pixels in image plane • Follow their paths • they bounce around as they strike objects • they generate new rays: ray tree per input ray • Result is color, opacity, brightness for 2D pixels • Parallelism across rays Computer Architecture II

Definitions • Task: • Arbitrary piece of work in parallel computation • Executed sequentially; concurrency is only across tasks • E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace • Fine-grained versus coarse-grained tasks • Process (thread): • Abstract entity that performs the tasks assigned to processes • Processes communicate and synchronize to perform their tasks • Processor: • Physical engine on which process executes • Processes virtualize machine to programmer • write program in terms of processes, then map to processors Computer Architecture II

4 Steps in Creating a Parallel Program • Decomposition of computation in tasks • Assignment of tasks to processes • Orchestration of data access, comm, synch. • Mapping processes to processors Computer Architecture II

Decomposition • Identify concurrency and decide level at which to exploit it • Break up computation into tasks to be divided among processes • Tasks may become available dynamically • No. of available tasks may vary with time • Goal: Enough tasks to keep processes busy, but not too many • Number of tasks available at a time is upper bound on achievable speedup Computer Architecture II

Assignment • Specify mechanism to divide work up among processes • E.g. which process computes forces on which stars, or which rays • Goals • Load Balance workload • reduce communication and synchronization • Low run-time management cost (e.g. management of scheduling queues) • Structured programs can help • Code inspection • Identifying phases • parallel loops • Static versus dynamic assignment • Decomposition & assignment first steps of parallel programming • Usually independent of architecture or programming model • But cost and complexity of using system primitives (eg. communication, synchronization costs) may affect decisions => may have to revisit them Computer Architecture II

Orchestration • Naming data • Synchronization • Structuring communication • Organizing data structures and scheduling tasks temporally • Goals • Reduce cost of communication and synchronization • Preserve locality of data reference • Schedule tasks to satisfy dependences early Computer Architecture II

Mapping • Two aspects: • Which process runs on which particular processor? • mapping to a network topology • Will multiple processes run on same processor? • Space-sharing • Machine divided into subsets of nodes, only one application runs at a time in a subset • Processes can be pinned to processors, or dynamically migrated by the OS • Time-sharing • Several processes share the same processor • We assume: 1 process = 1 processor Computer Architecture II

High-level goals • Major goal of parallel programs: high performance (speedup) • Low resource usage • Decrease development effort (programming, testing, debugging) • High-productivity computing (trade-off among the three) Computer Architecture II

Parallel Programming Models Computer Architecture II

Random access machine (RAM) • Abstract machine with an unlimited number of registers • Model used in sequential programming • Bad programming practices • Assume unlimited memory • Ignore cache hierarchy Computer Architecture II

Parallel Random Access Machine (PRAM) • abstract machine for designing the algorithms for parallel computers. • multiple processors may communicate through a shared memory • a synchronous PRAM: all processors execute synchronously • The programmer is not concerned with synchronization and communication, but only with concurrency. • Depending on the read or write accesses • Exclusive Read Exclusive Write (EREW) • Concurrent Read Exclusive Write (CREW) • Exclusive Read Concurrent Write (ERCW) • Concurrent Read Concurrent Write (CRCW) • Used for architecture-independent parallel algorithms design and analysis • Very similar to the data-parallel model Computer Architecture II

Example: iterative equation solver • Simplified version of a piece of Ocean simulation • Illustrate program in low-level parallel language • C-like pseudocode with simple extensions for parallelism • Expose basic comm. and synch. primitives • State of most real parallel programming today Computer Architecture II

Grid Solver • Gauss-Seidel (near-neighbor) sweeps to convergence • interior n-by-n points of (n+2)-by-(n+2) updated in each sweep • updates done in-place in grid • difference from previous value computed • accumulate partial diffs into global diff at end of every sweep • check if has converged • to within a tolerance parameter Computer Architecture II

Sequential Version Computer Architecture II

Decomposition • Identify concurrency: looking at loop iterations • dependence analysis; • if not enough concurrency, then look for other methods • Not much concurrency here at this level (all loops sequential) • Examine fundamental dependences • Solutions • Retain loop structure, use pt-to-pt sync. Problem: too many synch ops. • Restructure loops. Concurrency along anti-diagonals. Problem: load imbalance, global sync. • Exploit the application knowledge Computer Architecture II

Exploit Application Knowledge • Reorder grid traversal: red-black ordering • Different ordering of updates: may converge quicker or slower • Red sweep and black sweep are each fully parallel (Ocean) • Global synch between them (conservative but convenient) • We use simpler, asynchronous one to illustrate • no red-black, simply ignore dependences within sweep • parallel program nondeterministic (depends on the scheduling order) Computer Architecture II

Decomposition • Decomposition into elements: degree of concurrency n2 • Implicit synchronization after for_all construct • Line 18: for_all loop made for: what happens? Computer Architecture II

i p Possible Assignments • Static assignment: decomposition into rows • block assignment of rows: Row i is assigned to process • cyclic assignment of rows: process i is assigned rows i, i+p, ... • Dynamic assignment • get a row index, process the row, get a new row, ... Computer Architecture II

Orchestration • Here we need to pin-down the programming model • Data parallel solver • Shared address space solver • Message passing solver Computer Architecture II

Data parallel solver • Characteristics • Data decomposition needed • All the processes execute the same code • Global variables are shared • Local variables are private • Implicit parallel processes (no need for a proc_id, no need to create threads) • Implicit synchronization (for instance at for_all) • Good for regular distributions Computer Architecture II

Data Parallel Solver Computer Architecture II

Shared Address Space Solver • Characteristics • Shared variables • Explicit threads/processes (proc_id) • Explicit synchronization • Explicit work assignment controlled by values of variables computed in function of a proc_id Computer Architecture II

Shared Address Space Solver • Assignment controlled by values of variables computed in function of a proc_id Computer Architecture II

Generating Threads Computer Architecture II

Assignment Mechanism Computer Architecture II

SAS Program • SPMD: not lockstep like data parallel. Not necessarily same instructions • Assignment controlled by process specific proc_id variables • done condition evaluated redundantly by all • Code that does the update identical to sequential program • each process has private mydiff variable • Most interesting special operations are for synchronization • accumulations into shared diff have to be mutually exclusive • barriers Computer Architecture II

Mutual Exclusion • Why is it needed? • Provided by LOCK-UNLOCK around critical section • Set of operations we want to execute atomically • Implementation of LOCK/UNLOCK must guarantee mutual exclusion Computer Architecture II

Global Event Synchronization • BARRIER(nprocs): wait here till nprocs processes get here • Built using lower level primitives • Often used to separate phases of computation Process P_1 Process P_2 Process P_nprocs set up eqn system set up eqn system set up eqn system Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs) solve eqn system solve eqn system solve eqn system Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs) apply results apply results apply results Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs) • Conservative form of preserving dependences, but easy to use • WAIT_FOR_END (nprocs-1) Computer Architecture II

Message Passing Grid Solver • Cannot declare A to be global shared array • compose it logically from per-process private arrays • usually allocated in accordance with the assignment of work • process assigned a set of rows allocates them locally • Transfers of entire rows between traversals • Structurally similar to SPMD SAS • Orchestration different • data structures and data access/naming • communication • synchronization • Ghost rows Computer Architecture II

P 0 P 0 P 1 P 1 P 2 P 4 P 2 Data partition allocated per processor Add ghost rows to hold boundary data Send edges to neighbors P 4 Receive into ghost rows Data Layout and Orchestration Compute as in sequential program Computer Architecture II

Computer Architecture II

Notes on Message Passing Program • Use of ghost rows • Receive does not transfer data, send does • unlike SAS which is usually receiver-initiated (load fetches data) • Communication done at beginning of iteration, so no asynchrony • Communication in whole rows, not element at a time • Core similar, but indices/bounds in local and not in global space • Synchronization through sends and receives • Update of global diff and event synch for done condition • Could implement locks and barriers with messages • Can use REDUCE and BROADCAST library calls to simplify code Computer Architecture II

Send/Receive Synchronous Asynchronous Blocking asynchronous Non-blocking asynchronous Send and Receive Alternatives • Semantic flavors: based on when control is returned when data structures or buffers can be reused at either end • Synchronous: • both send/recv return when data arrived • Asynchonous blocking: • send returns when data copied from user buffer • recv returns when data arrived w/o acknowledging the sender • Asynchronous non-blocking • Send returns control immediately • Recv returns after posting the operation (has to be probed later) Computer Architecture II

Orchestration: comparison SAS and MP Computer Architecture II

Computer architecture II