260 likes | 379 Views
Programming Paradigms and Algorithms. W+A 3.1, 3.2, p. 178, 5.1, 5.3.3, Chapter 6, 9.2.8, 10.4.1, Kumar 12.1.3 1. Berman, F., Wolski, R., Figueira, S., Schopf, J. and Shao, G., "Application-Level Scheduling on Distributed Heterogeneous Networks," Proceedings of Supercomputing '96
E N D
Programming Paradigms and Algorithms W+A 3.1, 3.2, p. 178, 5.1, 5.3.3, Chapter 6, 9.2.8, 10.4.1, Kumar 12.1.3 1. Berman, F., Wolski, R., Figueira, S., Schopf, J. and Shao, G., "Application-Level Scheduling on Distributed Heterogeneous Networks," Proceedings of Supercomputing '96 (http:apples.ucsd.edu) CSE 160/Berman
Common Parallel Programming Paradigms • Embarrassingly parallel programs • Workqueue • Master/Slave programs • Monte Carlo methods • Regular, Iterative (Stencil) Computations • Pipelined Computations • Synchronous Computations CSE 160/Berman
Regular, Iterative Stencil Applications • Many scientific applications have the format Loop until some condition is true Perform computation which involvescommunicating with N,E,W,S neighborsof a point (5 point stencil) [Convergence test?] CSE 160/Berman
Stencil Example: Jacobi2D • Jacobi algorithm, also known as the method of simultaneous corrections is an iterative method for approximating the solution to a system of linear equations. • Jacobi addresses the problem of solving n linear equations in n unknowns Ax=b where the ith equation is or alternatively • a’s and b’s are known, want to solve for x’s CSE 160/Berman
Jacobi 2D Strategy • Jacobi strategy iterates until the computation converges to an exact solution, i.e. each iteration we solve where the values from the (k-1)st iteration are used to compute the values for the kth iteration • For important classes of problems, Jacobi converges to a “good” solution after O(logN) iterations [Leighton] • typically, the solution is approximated to a desired error threshold CSE 160/Berman
Jacobi 2D • Equation is most efficient to solve when most a’s are 0 • When most a’s entries are non-zero, A is dense • When most a’s are 0, A is sparse • Sparse matrices are regularly found in many scientific applications. CSE 160/Berman
La Place’s Equation • Jacobi strategy can be used effectively to solve sparse linear equations. • One such equation is La Place’s equation: • f is solved over a 2D space having coordinates x and y • If the distance between points (D) is small enough, f can be approximated by • These equations reduce to CSE 160/Berman
(x,y+D) (x,y) (x-D,y) (x+D,y) (x,y-D) La Place’s Equation • Note the relationship between the parameters • This forms a 4 point stencil Any update will involve only local communication! CSE 160/Berman
Solving La Place using Jacobi strategy • Note that in La Place equation, we want to solve for all f(x,y) which has 2 parameters • In Jacobi, we want to solve for x_i which has only 1 index • How do we convert f(x,y) into x_i ? • Associate x_i’s with the f(x,y)’s by distributing them in the f 2D matrix in row-major (natural) order • For an nxn matrix, there are then nxnx_i’s, so the A matrix will need to be (nxn)X(nxn) CSE 160/Berman
Solving La Place using Jacobi strategy • When the x_i’s are distributed in the f 2D matrix in row-major (natural) order becomes CSE 160/Berman
Working backward • Now we want to work backward to find out what the A matrix and b vector will be for Jacobi • Our solution to the La Place equation gives us equations of this form • Rewriting, we get • So the b_i are 0, what is the A matrix? CSE 160/Berman
Finding the A matrix • Each row only at most 5 non-zero entries • All entries on the diagonal are 4 N=9, n=3: CSE 160/Berman
Jacobi Implementation Strategy • An initial guess is made for all the unknowns, typically x_i = b_i • New values for the x_i’s are calculated using the iteration equations • The updated values are substituted in the iteration equations and the process repeats again • The user provides a "termination condition" to end the iteration. • An example termination condition is errorthreshold. CSE 160/Berman
Data Parallel Jacobi 2D Pseudo-code [Initialize ghost regions] for (i=1; i<=N; i++) x[0][i] = north[i]; x[N+1][i] = south[i]; x[i][0] = west[i]; x[i][N+1] = east[i]; [Initialize matrix] for (i=1; i<=N; i++) for (j=1; j<=N; j++) x[i][j] = initvalue; [Iterative refinement of x until values converge] while (maxdiff > CONVERG) [Update x array] for (i=1; i<=N; i++) for (j=1; j<=N; j++) newx[i][j] = ¼ (x[i-1][j] + x[i][j+1] + x[i+1][j] + x[i][j-1]); [Convergence test] maxdiff = 0; for (i=1; i<=N; i++) for (j=1; j<=N; j++) maxdiff = max(maxdiff, |newx[i][j]-x[i][j]|); x[i][j] = newx[i][j]; CSE 160/Berman
Jacobi2D Programming Issues • Synchronization • Should we synchronize between iterations? Between multiple iterations? • Should we tag information and let the application run asynchronously? (How bad can things get?) • How often should we test for convergence? • How important is it to know when we’re done? • How expensive is it? CSE 160/Berman
Jacobi2D Programming Issues • Block decomposition or strip decomposition? • How big should the blocks or strips be? • How should blocks/strips be allocated to processors? Block Uniform Strip Non-uniform Strip CSE 160/Berman
HPF-Style Data Decompositions • 1D (Processors P0 P1 P2 P3 , tasks 0-15) • Block decomposition (Task i allocated to processor floor (i/p)) • Cyclic decomposition (Task i allocated to processor i mod p) • Block-Cycle Decomposition (Block i allocated to processor i mod p) Block Cyclic Block-cyclic CSE 160/Berman
HPF-Style Data Decompositions • 2D • Each dimension partitioned by block, cyclic, block-cyclic or * (do nothing) • Useful set of uniform decompositions can be constructed [Block, Block] [Block, *] [* , Cyclic] CSE 160/Berman
Jacobi on a Cluster • If each partition of Jacobi is executed on a processor in a lab cluster, we can no longer assume we have dedicated processors and network • In particular, the performance exhibited by the cluster will vary over time and with load • How can we go about developing a performance-efficient implementation in a more dynamic environment? CSE 160/Berman
Jacobi AppLeS • We developed an AppLeS application scheduler • AppLeS = Application-Level Scheduler • AppLeS is scheduling agent which integrates with application to form a “Grid-aware” adaptive self-scheduling application • Targeted Jacobi AppLeS to a distributed clustered environment CSE 160/Berman
Resource Discovery Resource Selection SchedulePlanningand PerformanceModeling DecisionModel Schedule Deployment How Does AppLeS Work? AppLeS + application = self-scheduling application accessible resources feasible resource sets Grid Infrastructure NWS evaluatedschedules Resources “best” schedule
Sensor Interface Reporting Interface Forecaster Model 1 Model 2 Model 3 Network Weather Service (Wolski, U. Tenn.) • The NWS provides dynamic resource information for AppLeS • NWS is stand-alone system • NWS • monitors current system state • provides best forecast of resource load from multiple models
Feasible resources determined according to application-specific “distance” metric Choose fastest machine aslocus Compute distance Dfrom locus based on unit-sized application-specific benchmark D[locus,X] = |comp[unit,locus]-comp[unit,X]| + comm[W,E columns] Resources sorted according to distance from locus, forming a desirability list Feasible resource sets formed from initial subsets of sorted desirability list Next step: plan a schedule for each feasible resource set Scheduler will choose schedule with best predicted execution time Jacobi2D AppLeS Resource Selector
Execution time for ith strip whereload= predicted percentage of CPU time available (NWS) comm = time to send and receive messages factored by predicted BW (NWS) AppLeS uses time-balancingto determine best partition on a given set of resources Solve for P1 P2 P3 Jacobi2D Performance Model and Schedule Planning
Jacobi2D Experiments • Experiments compare • Compile-time block [HPF] partitioning • Compile-time irregular strip partitioning [no NWS forecasts, no resource selection] • Run-time strip AppLeS partitioning • Runs for different partitioning methods performed back-to-back on production systems • Average execution time recorded • Distributed UCSD/SDSC platform: Sparcs, RS6000, Alpha Farm, SP-2
Representative Jacobi 2D AppLeS experiment Adaptive scheduling leverages deliverable performance of contended system Spike occurs when a gateway between PCL and SDSC goes down Subsequent AppLeS experiments avoid slow link Jacobi2D AppLeS Experiments