1 / 25

Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm

Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm. N. Melab and E-G. Talbi {melab,talbi}@lifl.fr. Bendjoudi ahcene.bendjoudi @gmail.com. U niversité A / M ira de B éjaïa CE ntre de R echerche sur l’ I nformation S cientifique et T echnique CERIST.

hayley
Download Presentation

Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault-Tolerant Mechanism for Hierarchical Branch and Bound Algorithm N. Melab and E-G. Talbi {melab,talbi}@lifl.fr • Bendjoudi • ahcene.bendjoudi@gmail.com Université A/Mira de Béjaïa CEntre de Recherche sur l’Information Scientifique et Technique CERIST Laboratoire d’InformatiqueFondamentale de Lille

  2. Outline • Motivations • FTH-B&B: Fault Tolerant hierarchical Branch and Bound algorithm • Performance Evaluation • Conclusion and Perspectives

  3. Motivations • Combinatorial Optimization Problems are Np-Hard and CPU-Intensive • Their exact resolution requires a huge amount of computing resources • Branch and Bound B&B algorithms are the best known methods • Implicit enumeration of the search space • They use a pruning technique to reduce the search space  however they remain not sufficient for very large instances • High performance computing (Computational Grids) • Computational Grids offer large amount of computing resources  Issue: Scalability • Traditional Master/Worker-based B&B (MW-B&B) algorithms are inefficient • Master process becomes rapidly bottleneck  Solution: Hierarchical Master/Worker-based B&B (HMW-B&B) 8

  4. Branch and Bound Algorithm • Branching • Original problem is partitioned into smaller sub problems. • Bounding • Compute the estimated optimal solution (lower bound) of the considered problem and compare to the best known solution (upper bound). • Elimination • Identify nodes which don’t lead to The best solution and eliminate them • Selection • Exploration strategy (best first search, depth first search, best bound,…) P0 Problem to be solved Selection Branching Sub-problem Elimination 8 Solutions Optimal solution

  5. Motivations • Resources of computational grids are highly unreliable and volatile  Issue: Fault-tolerance  Fault tolerance at middleware level: ProActive, XtremWeb, Condor, …  They are costly in terms of execution time  Solution: Application level  fault-tolerant algorithm  Fault detection • Task recovery • Minimization of Redundant work • Maintenance of the hierarchy

  6. Outline • Motivations • FTH-B&B: Fault Tolerant Hierarchical Branch and Bound algorithm • Architecture • Work management with task recovery • Maintenance of the hierarchy • Distributed checkpointing • Performance Evaluation • Conclusion et Perspectives

  7. Architecture • Based on HMW paradigm dealing with Scalability & Fault tolerance • Several FT MW-B&Bs organized hierarchically into groups • Each FT MW-B&B is composed of: • Single Master (on the top and inner nodes of the hierarchy): parallel recursive branching • Workers (Leaves): parallel exploration or Masters (inner nodes): parallel branching • Each master owns a single work pool  Multiple work pools  Collegial strategy Multiple work pools Masters Group of masters and workers Fault Tolerant M/W-based B&Bs Workers

  8. Fault detection • A master heartbeats its workers • Workers heartbeat their master Master Heartbeats Workers

  9. Work management & task recovery • Each master manages: • A list of assigned sub-problems (mapping between the assigned sub-problem and the child process assigned to it) • A list of branched sub-problems LBS (sub-problems being explored by its children) • When a failure is detected  only the unexplored sub-problems of the lost root problem in the LBS are rescheduled to another safe process PB PC PD PA Work pool List of branched sub-problems LBS R PA B A PA2 PA1

  10. 3-phase communication mechanism • Phase 1 (between a master and its children): • A master assigns a problem to its children • It receives back the branched sub-problems • Phase 2 (between a master, its children, and its parent): • A master updates explored sub-problems in the LBS each time a child finishes the exploration of its sub-problem. •  the master knows at any time the unexplored parts of a given problem. • Phase 3 (between a master and a new free process): • When a process fails • The parent of the failed process detects its failure and saves the unexplored part of its sub-problem • When a new safe process connects, the parent reschedules it the unexplored part of the sub-problem PB PC PD PA Connection of a new worker R PA B A PA1 PA2

  11. Maintenance of the hierarchy • Motivations • Failures can cause a loss of data and/or computing power • Master’s failures can isolate some parts of processes from the rest of the hierarchy and cause orphan branches • Loss of processes can cause unsafe execution of the algorithm (non termination, redundant work, ...) • Failures can unbalance the hierarchy if only the parent of the failed processes is involved in the fault recovery process • New bottlenecks can be created on parts of the hierarchy • Proposed techniques • Simple connection to Ascendants SCA • Master Election ME • Balanced Hierarchy BH

  12. Simple connection to Ascendants (SCA) • Children of the failed master connect to the closest safe ascendant • Each process holds the list of its ascendants • No process reassignment • When a process receives a connection from an orphan process, it considers it as a new child Drawback: • Risk of creation of new bottlenecks • If f children fail, the safe master receives gxf connections from its grandchildren • If l levels of masters fail, it receives gl+1 connections • Convergence to MW-B&B

  13. Master Election (ME) • Orphan processes elect a new master among them using the bully algorithm • The new selected master considers the other electors as its children • It connects to its closest safe ascendant using SCA • It is informed by its new parent about its neighbors

  14. Balanced Hierarchy (BH) • Orphan processes migrate to a safe sub-B&B • The migration strategy respects the maximum group size threshold • Assigns orphan processes to a non-full sub-B&B • A master receiving orphan processes: • Holds them only if the group it manages is non-full • Otherwise, it dispatches them among its children • Avoids the convergence to a MW

  15. Distributed Checkpointing • Several levels of checkpointing • Each sub-MW-B&B performs distributed checkpointing independently • There is no global backup file and each master handles its own local backup file • A master saves unexplored sub-problems and updates its file each time a new sub-problem is explored. • A new considered master checks the buck-up file • Reconstitutes the unexplored sub-problems • Carries on the last consistent global state. Masters Clusters Workers Backup Files

  16. Performance Evaluation (Objectives) • Evaluate the ability of FTH-B&B to deal with Fault Tolerance issue • Measure the efficiency of FTH-B&B in terms of execution time • Measure the benefit of task recovery on FTH-B&B in terms of minimizing the redundant work • Measure the impact of failures on the hierarchy of FTH-B&B using (SCA, ME, and BH)

  17. Performance Evaluation (Environment) • FTH-B&B has been experimented on the Flowshop Scheduling Problem FSP (Taillard instances) • Application developed on top of the ProActive middleware • GRID’5000 : https://www.grid5000.fr • Between 1900 and 8900 processes have been deployed • To obtain deeper hierarchy, several processes have been deployed on a same host and the size of the sub-B&Bs is fixed to 10 • Failures are obtained killing processes chosen uniformly during their lifetime

  18. The Flow-Shop Scheduling Problem M1 M2 M3 • N jobs to be scheduled on M machines • Each machine can not be simultaneously assigned to two jobs (colors) • Jobs (colors) must be scheduled in the same order on all machines • One objective must be minimized • Cmax: Makespan (Total completion time) 4 jobs on 3 machines

  19. Efficiency of FTH-B&B • Efficiency = ratio between the exploration time and the total execution time including Idle times. • Idle time = communication time + 3-phase time + management time • 3-phase mechanism is not costly in terms of execution time • 3-phase consumes between 0,20% – 0,40% of the total execution time • The workers of FTH-B&B spend on average 98,92% of their time solving sub-problems

  20. Minimizing Redundant work • The masters of FTH-B&B reschedule less sub-problems when using 3-phase mechanism  less redundant work • On average when using the 3-phase mechanism the algorithm is 6,36 times more efficient.

  21. Maintenance of the hierarchy using SCA • The figure presents the four most loaded masters in the hierarchy • Degree of a process = number of children • x-axis = Time • y-axis = degree of masters • Degrees increase with the number of failed processes • The root-master process rapidly becomes a bottleneck (passes from 10 children at the beginning of the execution to 1368 children at the end)  Convergence to MW paradigm

  22. Maintenance of the hierarchy using ME • The masters are less loaded than compared with SCA • When a new master is designated, it receives the highest identifier  it will not be designated in the next election session. • Nevertheless, the average number of children has doubled after the failure of 25% of processes.

  23. Maintenance of the hierarchy using BH • The figure shows the average degree masters in each level of the hierarchy • Up to 8 masters are killed every 1 second • The whole masters resist to the failures and do not become bottlenecks • The root masters maintains the same degree during all the lifetime • The degrees decreases softly for the 3 first levels and more rapidly for the last level caused by the new converted masters that have only one worker as a children

  24. Conclusion • FTH-B&B: Application level Fault tolerant algorithm • Several FT-MW-based B&Bs are launched hierarchically • 3-phase communication mechanism (between a master, its children, and its grand-children) • Distribute, store, and recover work units in case of failures • Minimizing redundant work • Efficiency = 98% • 6,36 times more efficient than without using 3-Phase mechanism • 3 strategies to maintain the hierarchy • SCA, ME, and BH • BH has proved its efficiency in terms of balancing the hierarchy • Distributed checkpointing mechanism • Several checkpointing levels

  25. Perspectives • Take into account best effort  exploit the maximum number of computing resources • Improve the realism of FTH-B&B  use a realistic model of failures based on real statistics of grids

More Related