More on Adaptivity in Grids

More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers

Predicting the Cost and Benefit of Adapting Data Parallel Applications in Clusters – Jon Weissman • Deals with 3 adaptation techniques in response to resource or load changes • Migration • Involves remote process creation followed by transmission of old worker’s data to new worker • Dynamic load balancing • Collecting load indices, determining redistribution and initiating data transmission • Addition or removal of processors • Followed by data transmission to maintain load balance • A prediction framework to decide automatically which adaptation technique is better at a given time for a given resource conditions

Terms for Prediction • N – problem size; P – processors; • i – current iteration • Esize – size in bytes of data elements • DM – amount of data that must be moved at i to achieve load balance • Tcomm(N, P, j, i) – cost of ith communication step for jth worker; Tcomp; Texec = Tcomm+Tcomp • Tpc – cost of creating remote process • Tdm(B) – cost of transferring B bytes of data • Tnn – cost of establishing new worker neighbors

Cost of adaptive methods • Migration from a processor • Involves process creation and data movement • Tmigrate(N,P,i) = Tpc + Tdm(Esize DM(N,P,i)) + Tnn • Addition of a processor • Each worker sends a fraction of their data to worker 0 • Worker 0 distributes the collected data to the new worker • Tadd(N,P,i) = max{Tdm(Esize DM(N,P,i)), Tpc} + Tdm(Esize DM(N,P+1,i)) + Tnn

Cost of adaptive methods • Removal of a processor • Existing worker’s data is sent to worker 0 • This data is distributed across remaining worker • Tremove(N,P,i) = 2Tdm(Esize DM(N,P,i)) + Tnn • Dynamic Load Balancing - Moving data from overloaded to underloaded workers • Load indices of each worker collected by worker 0 • Worker 0 computes the redistribution, collects data from each overloaded worker • Worker 0 transmits data to each underloaded worker to achieve load balance. • Tdlb(N,P,i) = 2Tdm(Esize DM(N,P,i))

Contd… • All the above depends on DM(N,P,i) • L(j,i) – individual load index of the jth worker at iteration I • D(j,i) – amount of data held by each worker • Where

Experiment Setup • Applications • Iterative Jacobi solver • Gaussian elimination • Gene sequence comparison • Tnn = 0 (inexpensive) • Few test programs to determine cluster-specific cost constants - Tpc, Tdm • Each application run with 3 problem sizes on 3 processor configurations for application-specific constants • Tcomp, Tcomm, Texec

STEN • 5-point stencil iterative Jacobi solver • 1D communication topology where processor division is across rows • Size of each data point, Esize – 8 bytes • Tcomm(N,P,j,i) = 5 + 0.00219 x 8N ms • Latency – 5 ms, 0.00219 - per-byte transfer costs, message size – 8N • Tcomp(N,P,j,i) = 0.000263 x 5N D(j,i) L(j,i) ms • 5 floating points operations per element; 0.000263 time to perform update of single element on an unloaded machine • Texec = Tcomp + Tcomm ; Tpc = 330 ms • Tdm = 1 + 0.00103M ms • 1ms latency; 0.00103 per-byte transfer costs

GE with partial pivoting • Row-cyclic decomposition of the matrix • Master-slave broadcast topology for pivot exchange • Esize – 8 bytes, N iterations, Tnn = 0 • Tcomm(N,P,j,i) = 1.14 + 0.00114 x (N-i)P ms • Tcomp(N,P,j,i) = 0.000335(N-i)D(j,i)L(j,i) • N-i entries are modified • Texec = Tcomp + Tcomm ; Tpc = 55 ms • Tdm(M) – same as previous

CL (CompLib) • Biology application that classifies protein sequences • Compares source library of sequences to a target library of sequences (string matching problem) • Parallel implementation – target library decomposed across workers in load balanced fashion • Each worker compares all target sequences it is assigned to a source sequence in the single iteration • Amount of computation in each worker depends on size of source sequence and sizes and number of target sequences

CL (Contd…) • N – total number of target sequence blocks • D(j,i) – number of target sequence blocks stored in jth worker • Each block – 5000 bytes • Seq(i) – size of the source sequence • Data transferred – source sequence to each worker by master, results sent back up • Tcomm(N,P,j,i) = 1.14P + 0.00130(seq(i)+180D(j,i)) • 1.14 P – latency, 0.0013 per-byte transfer costs, 180 bytes – comparison score for each target sequence • Tcomp(N,P,j,i) = 0.00000424D(j,i)5000seq(i)L(j,i) • 0.00000424 – cost for integer comparison • Tpc = 550 ms • Tdm(M) – same as above

Accuracy in predicting cost of adaptation

Accuracy in Predicting Benefit of Adaptation

Sensitivity to Increasing Load • Differing loads added to a machine • Migration provides benefits with increasing load

Sensitivity to Load Introduction Times • Migration benefit decreases with increasing load injection times

Dynamically Choosing Best Adaptive Method • Adaptive run time system that chooses the best adaptive method based on a given condition • Automated adaptive method selection in response to two events • Addition of a new processor • Presence of external CPU load

Addition of free nodes • * represents number of nodes picked by adaptive method

Adaptation due to load events Prediction-based gives better results

Fault-Tolerance, Malleability and Migration for Divide-and-Conquer Applications on the Grid Wrzesinska et al.

Fault-Tolerance, Malleability and Migration for Divide-and-Conquer Applications on the Grid • 3 general class of divisible applications • Master-worker paradigm – 1 level • Hierarchical master-worker grid system – 2 levels • Divide-and-conquer paradigm – allows computation to be split up in a general way. E.g. search algorithms, ray tracing etc. • The work deals with mechanisms to deal with processors leaving • Handling partial results from leaving processors • Handling orphan work • 2 cases of processors leaving • When processors leave gracefully (e.g. when processor reservation comes to an end) • When processors crash • Restructuring computation tree

Introduction • Divide-and-conquer • Recursive subdivision; After solving subproblems, their results are recursively combined until the final solution is reached. • Work is distributed across processors by work-stealing • When a processor runs out of work, it picks another processor at random and steals a jobs from its work queue • After computing the jobs, the result is returned to the originating processor • Have a work-stealing algorithm called CRS (Cluster-aware random stealing) that overlaps intra-cluster steals with inter-cluster steals

Malleability • Adding a new machine to a divide-and-conquer computation is simple • New machine starts stealing jobs from other machines • Leaving of a processor - Restructuring of the computation tree to reuse as many partial results as possible • What happens when processors leave • remaining processors are notified by leaving processor (when processors leave gracefully) • detected by the communication layer (in unexpected leaves)

Recomputing jobs stolen by leaving processors • Each processor maintains a list of jobs stolen from it and the processor Ids of the thieves • When processors leave • Each of the remaining processors traverses its stolen jobs list, searches for jobs stolen by leaving processors • Such jobs are put back in the work queues of owners, marked as “restarted” • Children of “restarted” jobs are also marked as “restarted” when they are spawned

Example

Example (Contd…)

Orphan Jobs • Jobs stolen from leaving processors • Existing approaches • Processor working on an orphan job must discard the result, since it does not know where to return the result • Need to know the new address to return the result • Salvaging orphan jobs requires creating the link between the orphan and its restarted parent

Orphan Jobs (Contd…) • For each finished orphan job • Broadcast of a small message containing the jobID of the orphan and the processorID that computed the orphan • Abort unfinished intermediate nodes of orphan subtrees • (jobID, processorID) stored by each processor in a local orphan table

Orphan Jobs (Contd…) • When a processor tries to recompute “restarted” jobs • Processors perform lookup in orphan table • If the jobIDS match, the processor removes it from the workqueue, puts it in the list of stolen jobs • Send message to the orphan owner requesting result of the job • Orphan owner marks it as stolen from the sender of the request • Link between restarted parent and orphaned child is restored • Reusing orphans improves performance of the system

Example

Partial Results on Leaving Processors • If a processor knows it has to leave: • Chooses another processor randomly • Transfers all results of finished jobs to the other processor • The jobs are treated as orphan jobs • Processor receiving the finished jobs broadcasts a (jobID, processorID) tuple • Partial results linked to the restarted parents

Special Cases • Master leaving – special case; owns root job that was not stolen from anyone • Remaining processors elect the new master which will respawn the root job • New run will reuse partial results of orphan jobs from previous run • Adding processors • New processor downloads an orphan table from one of the other processors • Piggybacks orphan table requests with steal requests • Message combining • One small (broadcast) message has to be sent for each orphan and for each computed job in the leaving processor • Messages are combined

Results • 3 Types • Overhead when no processors are leaving • Comparison with traditional approach that does not save orphans • To show that mechanism can be used for efficient migration of the computation • Testbeds • DAS-2 system, 5 clusters in five Dutch Universities • European GridLab – 24 processors in 4 sites in Europe • 8 in Leiden and 8 in Delft (DAS-2) • 4 in Berlin • 4 in Brno

Overhead during normal Execution • 4 applications on a system with and without their mechanisms • RayTracer, TSP, SAT solver, Knapsack problem • Overhead is negligible

Impact of Salvaging Partial Results • RayTracer Application • 2 DAS-2 clusters with 16 processors each • Removed one cluster in the middle of the computation, after half of the time it would take on 2 clusters without processors leaving • Comparison of • Traditional approach (without saving partial results) • Recomputing trees when processors leave unexpectedly • Recomputing trees when processors leave gracefully • Runtime on 1.5 clusters (16 on processors in 1 cluster and 8 processors in another cluster) • Difference between last two gives overhead of transferring the partial results from leaving processors and the work lost because of the leaving processors

Results

Migration • Replaced one cluster with another • Raytracer application on 3 clusters • In the middle of the computation, one cluster was gracefully removed, and another identical cluster added • Comparison without migration • Overhead of migration – 2%

References • Predicting the cost and benefit of adapting data parallel applications in clusters. Journal of Parallel and Distributed Computing. Volume 62 , Issue 8 (August 2002) Pages: 1248 - 1271 Year of Publication: 2002 Author Jon B. Weissman • Fault-Tolerance, Malleability and Migration for Divide-and-Conquer Applications on the Grid," Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International , vol., no.pp. 13a- 13a, 04-08 April 2005

Predicting the Cost and Benefit of Adapting Data Parallel Applications in Clusters – Jon Weissman • Library of adaptation techniques • Migration • Involves remote process creation followed by transmission of old worker’s data to new worker • Dynamic load balancing • Collecting load indices, determining redistribution and initiating data transmission • Addition or removal of processors • Followed by data transmission to maintain load balance • Library calls to detect and initiate adaptation actions within the applications • Adaptation event sent from an external detector to all workers

More on Adaptivity in Grids