340 likes | 483 Views
Robust Asynchronous Optimization for Volunteer Computing Grids. Travis Desell, Malik Magdon-Ismail, Boleslaw Szymanski , Carlos Varela , Heidi Newberg, Nathan Cole. Department of Computer Science Department of Physics, Applied Physics and Astronomy Rensselaer Polytechnic Institute
E N D
Robust Asynchronous Optimization for Volunteer Computing Grids Travis Desell, Malik Magdon-Ismail, Boleslaw Szymanski, Carlos Varela, Heidi Newberg, Nathan Cole Department of Computer Science Department of Physics, Applied Physics and Astronomy Rensselaer Polytechnic Institute E-Science 2009 December 12, Oxford, UK December 12, 2009
Overview Introduction Motivation Driving Scientific Application Asynchronous Genetic Search Why asynchronous? Methodology Recombination Particle Swarm Optimization Generic Optimization Framework Approach Architecture Results Convergence Rates Re-computation rates Conclusions & Future Work Questions?
Motivation Scientists need easily accessible distributed optimization tools Distribution is essential for scientific computing Scientific models are becoming increasingly complex Rates of data acquisition are far exceeding increases in computing power Traditional optimization strategies not well suited to large scale computing Lack scalability and fault tolerance
Astro-Informatics Observing from inside the Milky Way provides 3D data: SLOAN digital sky survey has collected over 10 TB data. Can determine it's structure – not possible for other galaxies. Very expensive – evaluating a single model of the Milky Way with a single set of parameters can take hours or days on a typical high-end computer. Models determine where different star streams are in the Milky Way, which helps us understand better its structure and how it was formed. What is the structure and origin of the Milky Way galaxy?
Generic Optimization Framework Separation of Concerns • Distributed Computing • Optimization • Scientific Modeling “Plug-and-Play” • Simple & generic interfaces
Two Distribution Strategies Asynchronous evaluations Results may not be reported or reported late No processor dependencies Faults can be ignored Grids & Internet Single parallel evaluation Always uses most evolved population Can use traditional methods Faults require recalculation Grids require load balancing Supercomputers & Grids
Asynchronous Architecture Scientific Models Search Routines Data Initialisation Integral Function Integral Composition Likelihood Function Likelihood Composition Evolutionary Methods Genetic Search Particle Swarm Optimisation … Initial Parameters Optimised Parameters Work Request Work Results Work Request Work Results Evaluator (1) Evaluator (N) … Evaluator Creation BOINC (Internet) SALSA/Java (RPI Grid) Distributed Evaluation Framework
GMLE Architecture (Parallel-Asynchronous) Search Routines Communication Layer BOINC - HTTP Grid - TCP/IP Supercomputer - MPI Work Request Work Results Work Request Work Results Worker (1) Worker (Z) Combine Results Combine Results Distribute Parameters Distribute Parameters … MPI MPI Evaluator (1) Evaluator (2) Evaluator (N) Evaluator (1) Evaluator (2) Evaluator (M) … …
Issues With Traditional Optimization Traditional global optimization techniques are evolutionary, but dependent on previous steps and are iterative Current population is used to generate the next population Dependencies and iterations limit scalability and impact performance With volatile hosts, what if an individual in the next generation is lost? Redundancy is expensive Scalability limited by population size
Asynchronous Optimization Strategy Use an asynchronous methodology No dependencies on unknown results No iterations Continuously updated population N individuals are generated randomly for the initial population Fulfil work requests by applying recombination operators to the population Update population with reported results
Asynchronous Search Strategy Workers Report results and update population Request work Send work Population Request work when queue is low Work Queue Fitness (1) Parameter Set (1) Unevaluated Parameter Set (1) Fitness (2) Parameter Set (2) Unevaluated Parameter Set (2) . . . . . . . . . . . . . . . Generate members from population Fitness (n) Parameter Set (n) Unevaluated Parameter Set (m)
Asynchronous Genetic Search Operators (1) Average Simple operator for continuous problems Generated parameters are the average of two randomly selected parents Mutation Takes a parent and generates a mutation by randomly selecting a parameter and mutating it
Asynchronous Genetic Search Operators (2) Double Shot - two parents generate three children Average of the parents Outside the less fit parent, equidistant to parent and average Outside the more fit parent, equidistant to parent and average
Asynchronous Genetic Search Operators (3) Probabilistic Simplex N parents generate one or more children Points placed randomly along the line created by the worst parent, and the centroid (average) of the remaining parents
Particle Swarm Optimization Particles ‘fly’ around the search space. They move according to their previous velocity and are pulled towards the global best found position and their locally best found position. Analogies: cognitive intelligence (local best knowledge) social intelligence (global best knowledge) December 12, 2009 16
Particle Swarm Optimization PSO: vi(t+1) = w * vi(t) + c1 * r1 * (li - pi(t)) + c2 * r2 * (g - pi(t)) pi(t+1) = pi(t) + vi(t+1) w, c1, c2 = constants r1, r2 = random float between 0 and 1 vi(t) = velocity of particle i at iteration t pi(t) = position of particle i at iteration t li = best position found by particle i g = global best position found by all particles December 12, 2009 17
Asynchronous PSO Generating new positions does not necessarily require the fitness of the previous position 1. Generate new particle or individual positions to fill work queue 2. Update local and global best on results PSO: If result improves particle’s local best, update local best, particle’s position and velocity of the result December 12, 2009 18
Particle Swarm Optimization (Example) w * vi(t) current: pi(t) velocity: vi(t) previous: pi(t-1) global best local best c2 * (g - pi(t)) c1 * (li - pi(t)) possible new positions December 12, 2009 19
Particle Swarm Optimization (Example) previous: pi(t-1) velocity: vi(t) new position previous: pi(t-1) velocity: vi(t) global best current: pi(t) current: pi(t) global best c2 * (g - pi(t)) local best local best w * vi(t) possible new positions Particle finds a new local best position and the global best position December 12, 2009 20
Particle Swarm Optimization (Example) c1 * (li - pi(t)) c2 * (g - pi(t)) possible new positions global best local best w * vi(t) previous: pi(t-1) current: pi(t) velocity: vi(t) Another particle finds the global best position December 12, 2009 21
Asynchronous PSO Workers (Fitness Evaluation) Local and global best updated if new individual has better fitness Report results and update population Request Work Send Work Population Unevaluated Individuals Fitness (1) Individual (1) Unevaluated Individual (1) Select individual to generate new individual from in round-robin manner Fitness (2) Individual (2) Unevaluated Individual (2) . . . . . . . . . . . . . . . . . . . . . . . . Generate individuals when queue is low Fitness (n) Individual (n) Unevaluated Individual (n) December 12, 2009 22
Computing Environment: Milkyway@home http://milkyway.cs.rpi.edu BOINC Einstein@home, SETI@home, etc >50,000users; 80,000 CPUs; 600 teams; from 99 countries; Second largest BOINC computation (among 100’s) About 500 Teraflops Donate your idle computer time to help perform our calculations.
MilkyWay@Home – Growth of Power December 12, 2009 24
Computing Environments - BOINC MilkyWay@Home: http://milkyway.cs.rpi.edu/ Multiple Asynchronous Workers Approximately 10,000 – 30,000 volunteered computers engages at a time Asynchronous architecture used Asynchronous Evaluation Volunteered computers can queue up to 20 pending individuals Population updated when results reported Individuals may be reported slowly or not at all
Users do more than volunteer computing resources (Citizen’s Science): Open-source code gives users access to the MilkyWay@Home application Users have submitted many bug reports, fixes, and performance enhancements A user even created an ATI GPU capable version of the MilkyWay@Home application Forums provide opportunities for users to learn about astronomy and computer science User Participation
With open-source application code, users can compile their own compiler-optimized versions and many do. However, there is also the possibility of users returning malicious results BOINC traditionally uses redundancy on every result to verify their correctness. This requires at least 2 results for every work unit! Asynchronous search doesn't require all work units to be verified, only those which improve the population We reduce the redundancy by comparing a result against the current partial results. Malicious/Incorrect Result Verification
Limiting Redundancy (Genetic Search) 60% verification found best solutions Increased verification reduces reliability Reliability and convergence by number of parents seems dependent on verification rate December 12, 2009 29
Limiting Redundancy (PSO) 30% verification found best solutions Increased verification reduces reliability Not as dramatically as AGS Lower inertia weights give better results December 12, 2009 30
Optimization Method Comparison APSO found better solutions than AGS. APSO needed lower verification rates and was less effected by different verification rates. December 12, 2009 31
Conclusions Asynchronous search is effective on large scale computing environments Fault tolerant without expensive redundancy Asynchronous evaluation on heterogeneous environment increases diversity BOINC converges almost as fast as the BlueGene, while offering more availability and computational power Even computers with slow result report rates are useful Particle Swarm and Simplex-Genetic Hybrid methods provide significant improvement in convergence
Future Work Optimization Use report times to determine how to generate individuals Simulate asynchrony for benchmarks Automate selection of parameters Distributed Computing Parallel asynchronous workers Handle Malicious “Volunteers” Continued Collaboration http://www.nasa.gov