Computer architecture II

Computer architecture II Introduction Computer Architecture II

Recap • Parallelization strategies • What to partition? • Embarrassingly Parallel Computations • Divide-and-Conquer • Pipelined Computations • Application examples • Parallelization steps • 3 programming models • Data parallel • Shared memory • Message passing Computer Architecture II

4 Steps in Creating a Parallel Program • Decomposition of computation in tasks • Assignment of tasks to processes • Orchestration of data access, comm, synch. • Mapping processes to processors Computer Architecture II

Plan for today Programming for performance • Amdahl’s law • Partitioning for performance • Addressing decomposition and assignment • Orchestration for performance Computer Architecture II

Performance(p) Performance(1) Time(1) Time(p) Creating a Parallel Program • Assumption: Sequential algorithm is given • Sometimes need very different algorithm, but beyond scope • Pieces of the job: • Identify work that can be done in parallel • Partition work and perhaps data among processes • Manage data access, communication and synchronization • Note: work includes computation, data access and I/O • Main goal: Speedup (plus low prog. effort and resource needs) Speedup (p) = • For a fixed problem: Speedup (p) = Computer Architecture II

Amdahl´s law • Suppose a fraction f of your application is not parallelizable • 1-f : parallelizable on p processors Speedup(P) = T1 /Tp <= T1/(f T1 + (1-f) T1 /p) = 1/(f + (1-f)/p) <= 1/f Computer Architecture II

Amdahl’s Law (for 1024 processors) See: Gustafson, Montry, Benner, “Development of Parallel Methods for a 1024 Processor Hypercube”, SIAM J. Sci. Stat. Comp. 9, No. 4, 1988, pp.609. Computer Architecture II

Amdahl´s law • But: • There are many problems can be “embarrassingly” parallelized • Ex: image processing, differential equation solver • In some cases the serial fraction does not increase with the problem size • Additional speedup can be achieved from additional resources (super-linear speedup due to more memory) “ Computer Architecture II

Speedup linear speedup speedup superlinear speedup!! sub-linear speedup p Computer Architecture II

Superlinear Speedup? • Possible causes • Algorithm • e.g., with optimization problems, throwing many processors at it increases the chances that one will “get lucky” and find the optimum fast • Hardware • e.g., with many processors, it is possible that the entire application data resides in cache (vs. RAM) or in RAM (vs. Disk) Computer Architecture II

Parallel Efficiency • Effp = Sp / p • Typically 1, unless superlinear speedup • Used to measure how well the processors are utilized • If increasing the number of process by a factor 10 increases the speedup by a factor 2, perhaps it’s not worth it: efficiency drops by a factor 5 Computer Architecture II

Performance Goal => Speedup • Architect Goal • observe how program uses machine and improve the design to enhance performance • Programmer Goal • observe how the program uses the machine and improve the implementation to enhance performance Computer Architecture II

4 Steps in Creating a Parallel Program • Decomposition of computation in tasks • Assignment of tasks to processes • Orchestration of data access, comm, synch. • Mapping processes to processors Computer Architecture II

Partitioning for Performance • First two phases of parallelization process: decomposition & assignment • Goal • Balancing the workload and reducing wait time at synch points • Reducing inherent communication • Reducing extra work for determining and managing a good assignment (static versus dynamic) • Tensions between the 3 goals • Maximize load balance => smaller tasks => increase communication • No communication (run on 1 processor) => extreme load imbalance (all others idle) • Load balance => extra work to compute or manage the partitioning (ex. dynamic techniques) Computer Architecture II

Sequential Work Speedup ≤ Max Work on any Processor 1. Load Balance • Work: data access, computation • Not just equal work, but must be busy at same time • Ex: Speedup ≤1000/400 = 2.5 Computer Architecture II

1. Load balance • Identify enough concurrency • Data and functional parallelism (last class) • Managing concurrency • Task granularity • Reduce communication and synchronization Computer Architecture II

1 b) Static versus Dynamic assignment • Static: before the program starts is clear who does what #pragma omp parallel for schedule(static) for(i=0;I<N;i++) { a[i] = a[i] + b[i];} • Dynamic • External scheduler • Self-scheduled Each process picks a chunk of loop iterations and executes them #pragma omp parallel for schedule(dynamic,4) for(i=0;I<N;i++) { a[i] = a[i] + b[i];} • Dynamic guided self-scheduling: processes take first larger chunks and then reduce this number progressively #pragma omp parallel for schedule(guided,4) for(i=0;I<N;i++) { a[i] = a[i] + b[i];} Computer Architecture II

Dynamic Tasking with Task Queues • Centralized queue: simple protocol • Problems: Communication, synchronization, contention • Distributed queues: complicated protocol • Initial distribution of jobs • May cause load imbalance • Solution: task stealing: whom to steal from, how many tasks to steal, ... • Termination detection Computer Architecture II

1.c Task granularity • Task granularity: amount of work associated with a task • General rule: • Coarse-grained: often less load balance • Fine-grained: better load balance, but more overhead, often more communication and contention Processor 1 Processor 1 Processor 2 Processor 2 Computer Architecture II

Sequential Work Speedup < Max (Work + Synch Wait Time) 1.d Reducing Serialization • Synchronization for task assignment may cause serialization (for instance the access to a queue) Process 1 Process 2 Process 3 Work Synchronization point Synchronization wait time Computer Architecture II

Reducing Serialization • Event synchronization • Reduce use of conservative synchronization • point-to-point instead of barriers • finer granularity of access may reduce the synchronization time • But fine-grained synchronization more difficult to program, more synchronization operations. • Mutual exclusion • Separate locks for separate data • lock per task in task queue, not per queue • finer grain => less contention/serialization, more space, less reuse • Smaller, less frequent critical sections • don’t do reading/testing in critical section, only modification • e.g. searching for task to dequeue outside critical section Computer Architecture II

Sequential Work Speedup < Max (Work + Synch Wait Time + Comm Cost) 2. Reducing Inherent Communication • Communication is expensive! • Measure: communication to computation ratio • Inherent communication • Determined by assignment of tasks to processes • Actual communication may be larger (artifactual) • One principle: Assign tasks that access same data to same process Process 1 Process 2 Process 3 Communication Work Synchronization point Synchronization wait time Computer Architecture II

Domain Decomposition • Ocean Example: communicate with the neighbors, compute in the assigned domain • Perimeter to Area communication-to-computation ratio (area to volume in 3-d): Depends on n,p: decreases with n, increases with p Computer Architecture II

4*√p n 2*p n Domain Decomposition • Best domain decomposition depends on information requirements • Block versus strip decomposition: Communication/computation: for block, for strip • Block better • Application dependent: strip may be better in other cases Computer Architecture II

Finding a Domain Decomposition GOALS: load balance & low communication • Static, by inspection • Must be predictable: Ocean • Static, but not by inspection • Input-dependent, require analyzing input structure • E.g sparse matrix computations • Semi-static (periodic repartitioning) • Characteristics change but slowly; e.g. Barnes-Hut • Static or semi-static, with dynamic task stealing • Initial decomposition, but highly unpredictable; e.g ray tracing Computer Architecture II

Sequential Work Speedup < Max (Work + Synch Wait Time + Comm Cost + Extra Work) 3. Reducing Extra Work • Common sources of extra work: • Computing a good partition • e.g. partitioning in Barnes-Hut • Using redundant computation to avoid communication • Task, data and process management overhead • applications, languages, runtime systems, OS • Imposing structure on communication • coalescing small messages Computer Architecture II

PART II: memory aware optimizations • So far we have seen the parallel computer as a collection of communicating processors • Goals: balance load, reduce inherent communication and extra work • We have assumed an unlimited memory • In reality the parallel computer uses a multi-cache, multi-memory system Computer Architecture II

Memory-oriented View • Multiprocessor as Extended Memory Hierarchy • Levels in extended hierarchy: • Registers, caches, local memory, remote memory • Glued together by communication architecture • Levels communicate at a certain granularity of data transfer Proc Proc Cache Cache L2 Cache L2 Cache L3 Cache L3 Cache potential interconnects Memory Memory Granularity increases, access time increases Capacity increases, cost/unit decreases Computer Architecture II

Memory-oriented view • Performance depends heavily on memory hierarchy • Time spent by a program (usually given in cycles) Timeprog =Timecompute + Timeaccess • Data access time can be reduced by: • Optimizing machine • larger caches • lower latency • Larger bandwidth • Optimizing program • temporal and spatial locality Computer Architecture II

Artifactual Communication in Extended Hierarchy • poor allocation of data across distributed memories • Data accessed by a node is allocated in the memory of another even though accessed only by the remote node • unnecessary data in a transfer (e.g. sender sends more) • unnecessary transfers due to system granularities (e.g. cache coherent at block granularity) • redundant communication of data (e.g. update-protocols : data sent at every modification, but needed only once) • finite replication capacity (in cache or main memory): replicated data has to be evicted from memory y brought again later Computer Architecture II

Replication induced artifactual communication • Communication induced by finite capacity is most fundamental artifact • Like cache size and miss rate memory traffic in uniprocessors • View as three level hierarchy for simplicity • Local cache, local memory, remote memory (ignore network topology) • Classify “misses” in “cache” at any level as for uniprocessors (4 “C”s) • compulsory or cold misses (no size effect) • capacity misses (yes): has to evict smth because limited space • conflict or collision misses (yes): memory that maps on the same cache blocks • communication or coherence misses (no) Computer Architecture II

Working set • Working set: size of data that fits in a certain level of memory • 1. A few points for near-neighbor reuse • 2. Three sub-rows • 3. Whole matrix 15. while (!done) do 16. diff = 0; ¬ 17. for i 1 to n do ¬ 18. for j 1 to n do 19. temp = A[ i,j]; ¬ 20. A[ i,j] 0.2 * (A[ i,j] + A[i,j-1] + A[i-1,j] + 21. A[i,j+1] + A[i +1,j]); 22. diff += abs(A[ i,j] - temp); 23. end for 24. end for 25. if ( diff/( n*n) < TOL) then done = 1; 26. end while Computer Architecture II

fic First working set Data traf Capacity-generated traf fic (including conflicts) Second working set Other capacity-independent communication Inher ent communication Cold-start (compulsory) traf fic Replication capacity (cache size) Working Set Perspective • Hierarchy of working sets • At first level cache (fully assoc, one-word block), inherent to algorithm • working set curve for program • Traffic from any type of miss can be local or nonlocal (communication) Computer Architecture II

Orchestration for Performance • Reducing amount of communication: • Inherent: change the partitioning (seen earlier) • Artifactual: exploit spatial, temporal locality in the memory hierarchy • Techniques often similar to those on uniprocessors • Structuring communication to reduce cost Computer Architecture II

Reducing Artifactual Communication • Message passing model • Communication and replication are both explicit • Even artifactual communication is in explicit messages • Shared address space model • Occurs transparently due to interactions of program and system • used for explanation Computer Architecture II

(a) Unblocked access pattern in a sweep (b) Blocked access pattern with B = 4 Exploiting Temporal Locality • Def: reusing of data elements already brought into cache • Structure algorithm so working sets fit into the cache • often techniques to reduce inherent communication: • assign tasks accessing the same elements to the same processor • schedule tasks for data reuse once assigned • Ocean Solver example: blocking • Each grid element accessed 5 times • First time brought into cache then reused • Rewrite the loops Computer Architecture II

Exploiting Spatial Locality • Def: when a data element is accessed, its neighbors are accessed • Major spatial-related causes of artifactual communication: • Conflict misses • Data distribution/layout (allocation granularity) • Fragmentation (communication granularity) • False sharing of data (coherence granularity) • AVOIDING ARTIFACTUAL COMMUNICATION: keep contiguous data accessed by one processor • Fix problems by modifying data structures, or layout/alignment Computer Architecture II

Contiguity in memory layout P P P P P P P P 0 1 2 3 0 1 2 3 P P P P 4 5 6 7 P P P P 4 5 6 7 P P 8 8 Page straddles partition boundaries: difficult to distribute memory well Page does not straddle partition boundary Cache block straddles partition boundary Cache block is within a partition (a) Two-dimensional array (b) Four-dimensional array Spatial Locality Example • Repeated sweeps over 2-d grid, each time adding 1 to elements • 4-d grid to achieve spatial locality ( line processor x column processor x line index x column index) Computer Architecture II

Good spacial locality on nonlocal accesses at row-oriented boudary Poor spacial locality on nonlocal accesses at column-oriented boundary Tradeoffs with Inherent Communication • Partitioning grid solver: blocks versus rows • Blocks have a spatial locality problem on remote data: when accessing the elements of neighboring processors whole cache blocks are fetched at column boundary • Row-wise can perform better despite worse inherent communication-to-computation ratio Computer Architecture II

Example Performance Impact on Origin2000 Ocean Kernel solver Computer Architecture II

Computer architecture II