720 likes | 869 Views
CSE 8163. Parallel & Distributed Scientific Computing Dr. Ioana Banicescu. Parallel algorithms for unstructured & dynamically varying problems. Parallel algorithms for unstructured & dynamically varying problems. Background and related work
E N D
CSE 8163 Parallel & Distributed Scientific Computing Dr. Ioana Banicescu
Parallel algorithms for unstructured & dynamically varying problems
Parallel algorithms for unstructured & dynamically varying problems • Background and related work - motivation (application, model, performance) - steps towards best solution - problem classification - discuss various aspects • Mapping algorithms onto architectures - grid-oriented problems (main issues) - structured grid-oriented problems - unstructured grid-oriented problems • Conclusions and future work
Motivation • Unstructured and dynamic problems - abstractions of various phenomena - astrophysics, molecular biology, chemistry - fluid dynamics, electromagnetics, … - large, computationally intensive • Applications - solutions to boundary integral equations . wave propagation, fluid flow - transitions from 3-D to 2-D - prediction (Schrödinger's 1st, 2nd) - local density approximations - large least-squares problems - image processing, molecular biology, astrophysics
Motivation (continued) • Goal: model real event - simulations, prediction, impact - accurate solutions • Need high performance solutions - performance evaluations . hard to compare (different parameters) . metrics (ExT, Sp, E, IsoE, ConvR) - simple, accurate, fast, effective
Performance of Parallel Algorithms • Theoretical order analysis (not enough) - best solutions restricted applications - nonrealistic assumptions about resources . PE: number, speed; problem size • Empirical testing (optimal solution difficult) - need to consider various factors . [FlattKen89]: given overhead, problem size, parallel execution time minimum for a unique number of processors. studies conclude: convergence rate and parallel efficiency determine granularity and choice of algorithm. . [BarrHick93]: graph (best solution versus time) • Performance degradation (loss degree)
Steps towards best performance solution • Depends upon best choice of - parallel algorithm (simple, accurate, fast) - parallel architecture (fast, effective) - mapping (detect, partitioning, scheduling) • Mapping – goals (minimize factors) - computation time (numerical efficiency) - communication (comm/comp ratio) - load imbalance (effective use of resources) - overhead (synch, comm, sched) . [identify stochastic behavior – algorithm design]
Steps towards best performance solution (continued) • Best performance (choice of parameters) - find dominant component • Mapping – considerations - problems domain distribution (pattern, density, size) - interconnection topology, characteristics of each computation
Problem classification • Pattern of data points distribution - structured (uniform) . synchronous - unstructured (nonuniform) . loosely synchronous . asynchronous • Density of data points - dense (high density) - sparse (low density) • Solutions: grids (math description, data struct) structured, unstructured, semi structured
Problem ClassificationDepending on the pattern of data points distribution
Problem ClassificationDepending on the density of data points distribution
Structured Problems • Synchronous, regular space & time • Uniform distribution data points • Parall easy detect, express, implement • Naturally expressed (vector, matrix) • Compiler: mapp constructs, operations • e.g. QCD simulations, chemistry, …
Unstructured Problems • Loosly synchronous (irreg space,reg & time) • Asynchronous (irregular space & time) • Nonuniform distribution data points • Irregularities difficult: detect, express • Irregularities hard to implement (develop) • Need: flexible HW, fast communication
Loosely Synchronous Problems • Dominant methodology irreg scientific simulation - irregular static, data parallel over sparse structure • Irregularity not too hard: detect, express - hierarchical data struct, sparse matrices - success express irregularity: performance gains • Irregularity hard to implement: - high-level data structure, geometric decomposition • Run better on MIMD than on SIMD
Loosely Synchronous Problems (continued) • Different data points, distinct algorithm • Eg periods interact heterogeneous objects - time-driven simulation, statistical physics, particle dynamics - adaptive meshes, biology, image processing - Monte Carlo, clustering algorithm: N-body problem • Spatial structure may change dynamically - need synchronization each iteration step • Need: adaptive algorithms
Asynchronous • Irregularity hard to detect, express, implement • Hard to parallelize (unless embarrassingly) • Eg event-driven simulation, chess, market analysis • Irregularities dynamic (cannot exploit) • Cannot use simple mappings (comm, decomp) • Object-oriented approach (flexible communication) • Statistical methods load balancing
Grid-oriented problems • Structured grids - simple (each node same procedure) - low overhead - hard to create (complex domains) • Unstructured grids - easy to create, adapt, effective - no need to propagate local features - large overhead, more storage • Semistructured grids - domain: unstructured union of structured subdomains - use [GroppKeyes92], PAFMA [LeaBoa92]
Dense and sparse problems • Structured, unstructured - various approaches
Dense problems • Matrix problems - well or not well determined solutions - indices not necessarily linear order . Vandermonde, Toeplitz, orthogonal … structure - well conditioned: accurate regardless of computation method - ill conditioned: can be highly accurate if algorithm computes small residuals, sol. det right hand side - role ill conditioned and condition estimators . [Edel93] paradox - improved: multiply, transpose, inverse
Sparse problems • Undirected weighted graphs • Compressed rows - vector rows, nested pairs column-value • Different: multiply, transpose, inverse • Det eigenvalue, eigenvector of sparse matrix • Using O(n) matrix-vector multiply [Yau93]
Dense approaches to sparse problems • Sparse problems contain dense blocks • Extract, process regular structure - eg. FEBA algorithm for sparse matrix-vector multiplication [Agar92] • Direct matrix factorization - decomposition smaller dense, loss performance, communication router [Kratzer 92], mapp QR different that mapp Cholesky
Sparse approaches to dense problems • Recently successful, seem counterintuitive General methods [Edel93], [Freu92], [Reev93] - access matrix through matrix-vector multiplication - look for preconditioners - replace O(n2) matrix-vector multiplication with approximations (multipole, multigrid) • PAFMA – [LeaBoa92] - nonuniform problem divided uniform reg. - take advantage regular communication patterns - processor assignment in subregions with density variance. When: . load imbalance – assign dense regions . communication: assign sparse regions
Stochastic nature of parallel algorithms • Variability run-time behavior algorithm - solution unpredictable, multiple paths - path nondeterministic, optimal path chosen run-time, results divergent: number pivots, solution time, alternate optima • Race conditions (time dependent decisions) - in algorithm’s design - same: problem, strategy, operating conditions - alternate optima, different: timing event, incoming variables, choices, sequence of points traversed • Eg. Self-scheduled nonuniform problems: - highly efficient machine utilization
Stochastic nature of parallel algorithms (continued) • Some examples - parallel network simplex, good load balance, variability run-time behavior - branch-and-bound for integer program . good bounds affect portion of search tree explored - loops without dependencies among their iteration but sensitive other iterations, OS, application . Factoring [HumSch, Fly92], scheduling: good load balance, overhead, scalable, resistant iteration variance
Mapping algorithms onto architectures • Parallelism detection - depends upon algorithm and problem nature - independent of architecture - study data dependency: explicit, implicit • Partitioning (problem decomposition) - task into processes, identify sharing objects • Allocation (distribution of tasks to processors) - influenced by memory organization, interconnection • Scheduling (ordering task execution on processors) - depends upon interconnection and PE characteristics
Mapping goals(partitioning, scheduling) • Minimize communication: - exploit locality • Minimize load imbalance: - adaptive refinement
Partitioning • Granularity of process coarse enough for target machine without loosing parallelism • Partitioning technique [Sarkar89] - start initial fine granularity, use heuristics to merge processes until coarsest partition reached; use cost function (depends upon critical path, overhead)
Scheduling • Goal: spread load evenly on PEs (efficiency) - static versus dynamic allocation . low overhead, inflexible versus high overhead, flexible - centralized versus distributed scheduling . centralized mediation [SmithSchnabel92] . Hierarchical mediation strategy • Note: compiler, automatic tools for partitioning, scheduling -run-time (flexible, large overhead) -compile time (low overhead, need technology, good if easy estimate)
The mapping problem • Assigning tasks to PEs such that Texec minimal • Model [HammondSchreiber92]: - each PE given equal work; G=(Vg, Eg) task graph (vertices = processors, edges = inter-processor communication); H=(Vh, Eh) PE graph; d = shortest path - find the surjection: such that communication load is minimized: - good results need partition, allocation, scheduling (if isolated: poor mappings, non-optimal time)
The mapping problem(continued) • Mapping of communication pattern strategy (efficient) [GuptaSchenfeld93] - switching locality (sparse nature of communication graphs) - each process switches its communication between a small set of other process – ICN: PEs grouped in small clusters, intercluster and intracluster connectivity - identify this partitioning problem with bounded l-contraction of graph [RamKri92] (partitioning of vertex set into subsets such that: no subset contains more than l vertices and every subset has at least as many vertices as the number of subsets it is connected to) - simulated annealing; good results
Mapping of grid-oriented problems • Grid points geometrically adjacent - partitioning grid into subdomains (subgrids) - assigning them to PEs, each PE perform computation associated with the subdomain • Dependency & communication restricted to perimeters of subdomains • Model [RooseDries93] – structured grid: - time integration of finite-difference, finite volume discretization PDE on structured grid - 2D structured grid partitioned in subdomains of equal size - each PE performs locally updates for interior grid points - boundary grid points need neighbor information
Analysis(communication, load balance, numerical efficiency) • Communication overhead depends upon: - size subdomains (large subdomains, small overhead); 2-D perimeter to surface, 3-D surface to volume ratio . communication volume proportional to perimeter subdomain Tcomm = tstartup + ntsend . for fixed-size subdomain perimeter minimum if square . block wise partitioning in square sub regions leads to minimum communication volume . communication requirements not always isotropic - machine characteristics . influence communication to computation ratio - problem characteristics . dense problems: load imbalance predominates; . sparse problems: communication predominates
Analysis (continued) • Load imbalance: . minimal when block partitioning in square subdomains - work per grid may vary (cannot predict load imbalance) - mathematical models differ in various parts of the domain - boundary regions: work differs from interior cells, cells distributed almost equal among processors, achieved when subdomains square
Analysis (continued) • Numerical efficiency: . accuracy results differ (depending algorithm, problem nature) - Numerical properties of inherently parallel algorithms not affected by partitioning strategy (e.g. Jacobi relaxation as opposed to Gauss-Seidel which has better convergence properties) - Runge-Kutta: update overlapping regions only after complete integration step (omitting some communication); high speedup, small convergence degradation; # of blocks is determined by the # of PEs and not domain geometry - Block tridiagonal systems: (Thomas sequential, pipelined); parallel solvers use Gaussian elimination, cyclic reduction: distribution of sequential parts over PEs at the expense of increased communication
Analysis (continued) • Numerical efficiency (continued) . domain decomposition algorithm for PDEs (contain algorithmic overhead) - Schwartz domain decomposition: overlapping subdomains (iterative process, approx boundaries) - Schur complement: non-overlapping subdomains (borders computed first)
Hierarchical nature of multigrid algorithms - Obtaining acceptable performance levels require optimization techniques which address the characteristics of each architectural class [MathesonTarjan93] • Model study for each architectural class • Conclusions: - fine grain machines (high variable communication cost) . optimize domain to PE topology mapping - medium grain machines (high fixed communication cost) . optimize domain partition: (well-shaped, perimeter minimum, # of neighbors small) • Parallel algorithms require accurate subspace decomposition and large number of PEs to provide practical alternatives to standard algorithms
Optimal Partitioningstructured grids - Structured problems (use structured grids) - Unstructured problems (structured on subdomains) • Regular computation 2-D mesh MIMD [Lee90]: - workload and communication pattern same at each point, - communication depends upon: total communication required, actual pattern of communication (number of communicating neighbors, underlying architecture) • CCR important factor in performance evaluation (given stencil, generate best shape) - CCR depends upon: stencil, partition shape, gird (e.g. diamond-5p, hexagon-7p, square-9p star) (CCR max does not guarantee minimum execution time; CCR better performance indicator in shared –memory than message passing; CCR proportional to aspect ratio of partition)
Optimal partitioningstructured grids (continued) • Optimal partitioning [ReedAdamsPatrick87] - formalize relationship: stencil, shape, underlying architecture - isolated evaluation of components yields suboptimal performance - message-passing: small versus large packets have opposite order results (stencil, shape) - type of interconnection network important (grid to network mapping must support interpartition communication pattern – otherwise performance degradation)