480 likes | 554 Views
Parallelization of Stochastic Metaheuristics to Achieve Linear Speed-ups while Maintaining Quality. Course Project Presentation : Mustafa Imran Ali Ali Mustafa Zaidi. Outline of Presentation. Brief Introduction Motivation for this problem Simulated Annealing (SA) Simulated Evolution (SimE)
E N D
Parallelization of Stochastic Metaheuristics to Achieve Linear Speed-ups while Maintaining Quality Course Project Presentation: Mustafa Imran Ali Ali Mustafa Zaidi
Outline of Presentation • Brief Introduction • Motivation for this problem • Simulated Annealing (SA) • Simulated Evolution (SimE) • Related Work • Classification of Parallel Strategies • Previous efforts • Our Efforts and Results • Low-Level Parallelization of Both Heuristics • Domain-Decomposition Parallelizations of SimE • Multithreaded-Search implementations of SA and SimE • Conclusions Drawn, Future Work
Motivation • Stochastic Heuristics are utilized to solve a wide variety of combinatorial optimization problems • VLSI CAD, Operations Research, Network Design etc. • These heuristics attempt to find near-optimal solutions by performing an ‘intelligent search’ of the ‘solution space’ • Each algorithm has a built in ‘intelligence’ that allows it to move towards successively better solutions. • But they usually require large execution times to achieve near optimal solutions.
Motivation • Parallelization of Heuristics aims to achieve one of two basic goals: • Achieve Speedup • Achieve Better Quality Solutions • Parallelization of Heuristics different from Data and Functional parallelism: • Different strategies alter the properties of basic algorithm in different ways • Thus, Quality of solution achievable is sensitive to parallelization strategy • Key is to select/develop parallelization strategy that has maximum positive impact on heuristic • With respect to the goals we are trying to achieve: • For better quality solutions, strategy should enhance algorithmic “intelligence”. • For better runtimes, strategy should focus on reducing work-done per processor.
Our Objectives • Our goal is to explore, implement and develop parallelization strategies that allow us to: • Achieve near-linear speedup (or best effort) • Without sacrificing quality (fixed constraint) • Speedup trends should be as scalable as possible
Approaches taken and Considerations • In order to effectively consider this problem, we must look at it in terms of several aspects: • The nature of the Heuristics themselves • The nature of the Parallel Environment • The nature of the Problem Instance and Cost Functions • All three factors influence both the • Runtime, and • Achievable Solution Quality of any parallelization strategy. • For our task, we must optimize all three factors.
Introduction to Simulated Annealing • Basic Simulated Annealing Algorithm • With Metropolis Loop
Introduction to Simulated Evolution • Basic Simulated Evolution Algorithm
Related Work • Classification of Parallelization Strategies for Metaheuristics [ref] • Low-Level Parallelization • Domain Decomposition • Parallel or Multithreaded Search • Previous work done for Parallel SA • Extensive – general as well as problem specific • All three approaches tried, • Type 3 most promising – still active area of research • Previous work done for Parallel SimE • Minimal – Type 2 parallelization strategy proposed, both by designers of SimE
Our Efforts • Starting Point for our work this semester: • Basic version of Type 3 Parallel SA • Basic version of Type 2 Parallel SimE • Parallelization of SA: • Several Enhanced versions of Type 3 • Implementation of Type 1 • Type 2 not implemented because… • Parallelization of SimE • Several Enhancements to Basic Type 2 • Implementation of Type 1 • Implementation of Type 3
Basic Type 3 Parallel SA • Basic Type 3 Parallel SA: • Based on Asynchronous Multiple Markov Scheme developed in [ref] • Best Type 3 scheme developed for SA to date. • Primarily intended for improving solution qualities achievable over serial version
Type 3 Parallel SA – Second Version • Strategy 2 - Speed-up Oriented Type 3 Parallel SA • From the above starting point, we saw that high-quality characteristics of basic Type 3 may be exploited to produce a speed-up oriented version • Expected to be Capable of achieving quality equivalent to serial version, but not better • While providing near-linear runtimes • Near-linear runtimes forced by dividing workload on each processor by number of processors • M/p metropolis iterations instead of M.
Speed-up Oriented Type 3 Parallel SA • Speedup oriented Type 3 Parallel SA • Results consistently show a 10% drop in achievable solution quality from the Serial version • Runtimes show near-linear speedups, as expected
Lessons Learned from Strategy 2 • We reasoned that • The 10% quality drop occurs due to negative impact on the ‘intelligence’ of the parallel SA. • To restore achievable quality, we must counteract this effect. • Intelligence of SA: • Lies in the “Cooling Schedule” [ref Dr. Sait’s book] • Division of workload directly tampers with the Cooling Schedule. • Proposed Solution: • Attempt to optimize the Cooling Schedule to take into account the parallel environment – a “Multi-Dimensional” Cooling Schedule. • To maintain Speedup, M/p remained unchanged, while other parameters varied across processors
Third Version of Type 3 Parallel SA • Strategy 3 – Varying Parameter settings across different processors • Expected that this would result in more effective search of the solution space. • Several sets of parameter settings were tried • Primarily by Varying and T across processors • Processors with higher T and lower would perform more random search • Processors with lower T and higher would be more greedy • Intermittent sharing of information should diversify search from same position. • Results obtained from these versions: • NO IMPROVEMENT OF QUALITY OVER STRATEGY 2 • Show results (hadeed_1)
Lessons Learned from Strategy 3 • Based on the last two attempts, we reasoned that: • M/p drastically reduces time spent searching for better solutions at a given temperature • Lower temperatures achieved quicker, • Adverse effect on hill-climbing property of Heuristic. • Simple division of M/p inadequate for sustaining quality. • What to do next? • Develop techniques that minimize adverse effect on algorithmic intelligence. • Two different tracks possible • Type 1 parallel SA • Further Study Runtime vs Quality trends of Type 3 Parallel SA
What to do next? • Type 1 Parallel SA • Pros: • Leaves algorithmic intelligence intact • Solution quality guaranteed. • Cons: • High communication frequency adversely affects runtime • Environmental factors come into play. • Further explore dynamics of Type 3 • Pros: • Low communication frequency • Suitable for cluster environment (show chart) • Cons: • Uncharted territory - progress not assured
Type 1 Parallel SA • Low-level parallelization: • Divide Cost Computation Function • Our cost function has 3 parts: • Wirelength, Power, Delay • First two are easy to divide • Division of Delay computation posed a significant challenge • Too many replicated computations across processors negated benefits of division. • Eventually had to be excluded
Results of Type 1 Parallel SA • Performance of Type 1 Parallel SA: • Abysmal!! • Reasons: • 2 collective communications per iteration • Amount of work divided is small compared to communication time • Communication delay increases with increasing processors • (Show Chart) • Found to be completely unsuitable for • MIMD-DM environment, and • Our problem instance.
Further Exploring Type 3 Parallel SA • To improve our achievable quality of our Type 3 parallel SA: • In depth study of the impact of parameter M on achievable solution quality. • All experiments first attempted for the Serial Version, then replicated to the parallel version • Based on what we know of Type 3 Parallelization schemes, see how any new lessons can be incorporated into it.
Observations • We see that initially, fastest improvement in quality appears with smallest M • However, Quality saturates earlier with smaller M. • Thus it might be beneficial to increase M as time progresses. • But by mow much? How to minimize runtime.
Learning From Experiments • Another observation helps: • Until saturation, rate of improvement nearly constant (per metropolis calls) • This applicable for all runs for a given circuit. • Thus best way to minimize time while sustaining quality improvement: • Set value of M adaptively such that the average rate of improvement remains constant. • New Enhancement to Serial SA. • Since Type 3 parallel SA improves faster than serial version, it is expected that some speedup will be observed • Experiments still under way – parameter tuning being done
Preliminary results • For 7 processors in parallel
Preliminary Results • Results are similar to the original implementation • Enhancement to serial SA mitigates the observed benefits of the parallel version • Although further parameter tuning/code refinements may improve parallel results even more.
Conclusions Drawn from Experiments • Type 3 parallelization of SA is most robust • Minimum susceptibility to environment and problem instance • Type 1 fails due to unsuitability to environment and to problem instance • For Parallel SA, direct trade off exists between achievable solution quality and speedup • Depending on quality desired, linear speedup possible. • For highest quality, speed-up diminishes to 1 in most cases. • Further experimentation needed to verify these points.
Base Work for SimE • Type II (Domain Decomposition) implementation • Problem: Poor Runtimes with Quality Degradation • Improvement Proposed: Use of Random Row Allocation over the Fixed Row Allocation used in previous work • Results: Improvement of Quality over Fixed Row Allocation but still short of Serial Quality
Type II Quality Issues • Observation: Parallel Quality will always lag behind serial quality • Reason: Division of Solution into Domains restricts optimum cell movement (worsens with more processors/partitions) • Focus on improving runtime!
Type II Runtime Improvements • How? • Reduce Communication time • Reduce Frequency of Communications • Reduce Amount of Data Communicated • Overlap Communication with Computation • Reduce Computations • Can Workload be still better divided? • How will workload division affect communication requirements?
Type II SimE Algorithm Structure Communication Cost Computations
Type II Communication Optimization • No potential for Computation Communication Overlap • Implicit Barrier synchronization at communication points • Possibility of Reducing Communication Frequency Over Multiple Iterations • Do multiple operations (E,S,A) on assigned partition before communicating • Impact on Solution Quality due to accumulated errors • Actual Impact on Solution Quality vs. runtime improvement not presently quantified (future work)
Type II Communication Optimization(2) • Reducing Communication Frequency • by combining gather & broadcast operation into 1 MPI call • Efficiency of collective call not too superior in MPI implementation used • Reduce Data Communicated per call • Significant impact on runtime due to barrier synchronization • Essentially compressing placement data! • Tradeoff: Added computations in data preprocessing and post processing step
Type II Computation Optimization • Can workload be better divided? • Evaluation, Selection and Allocation already localized (divided) • What about cost computation? • Goodness Evaluation needs costs computed • Dependencies across partitions • Delay computations over long paths • Spans over partitions • Wire length & Power Computation • Potential for cost computation division
Wirelength & Power Cost Division • More independent of other partitions than delay computations • Effect of computation division can be readily evaluated for wire length and power (within the limited time constraints) • Tradeoff: Added Computation • Partial Costs Need to be communicated to Master • Additional communication phase added per iteration! • Actual benefit over non-division: Results will tell!
Results & Observations • Effect of communication worsens speed-ups with increasing processors • communication time over shadows gain of computation division • Optimizing communication resulted in greater gains than cost computation division • Dependencies do not allow much gains
Targeting Better Solution Qualities • Type II parallelization has poor quality issues • How to maintain quality? • Type I Parallelization • Same Quality as Serial Implementation guaranteed • Speed-ups governed by gains resulting from division of cost-computation • Type III Parallelization • Can we benefit from parallel co-operating searches?
Type I Parallelization • Target as much computation division without dividing SimE algorithm’s core “intelligence” • Computations that don’t affect intelligence • Cost computations • Goodness evaluations • Computations that affect intelligence • Selection function • Allocation function
Type I Parallelization • Again, wire length & power easier to divide than delay (same reasoning as for Type II) • Workload division achieved by dividing the cells among PE • Each PE computes costs and goodness for a fraction of cells • Division of cells done at beginning & fixed • Slave PEs communicate goodness values to master PE which does Selection & Allocation
Type I Parallelization Results • Runtimes not drastically reduced • Allocation takes significant portion of overall computation time • Speed-ups again limited by communication • Not by data communicated but due to more overheads with increased participants
Type III Parallelization • Parallel Cooperating searches communicating best solutions through a master • Modeled after Parallel SA • A Metropolis loop can be equated with a SimE compound move • Intelligence of SimE can benefit from best solution produced among all • May lead to more rapid convergence to better quality solutions
Type III Parallelization Results • Results quite contrary to expectations • Highest quality inferior than serial algorithm • No speed-ups observed • Parallel Searches used as is doesn’t benefit • possible explanation: greedy behavior of parallel searches is counter productive to SimE intelligence (inability to escape local minima)
Improving Type III Parallelization • Proposed schemes for improved SimE parallel searches • varying frequency of communication with time • Initially exchanges can be frequent and with frequency decreasing to allow more diversification • Intelligently combining good elements in each solution to get a new starting solution • Best location for each cell (or a cluster) can be identified by examining/comparing goodness values among solutions received from peers and constructing a good solution to improve further