1 / 34

December 12, 2009

Robust Asynchronous Optimization for Volunteer Computing Grids. Travis Desell, Malik Magdon-Ismail, Boleslaw Szymanski , Carlos Varela , Heidi Newberg, Nathan Cole. Department of Computer Science Department of Physics, Applied Physics and Astronomy Rensselaer Polytechnic Institute

lockett
Download Presentation

December 12, 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Robust Asynchronous Optimization for Volunteer Computing Grids Travis Desell, Malik Magdon-Ismail, Boleslaw Szymanski, Carlos Varela, Heidi Newberg, Nathan Cole Department of Computer Science Department of Physics, Applied Physics and Astronomy Rensselaer Polytechnic Institute E-Science 2009 December 12, Oxford, UK December 12, 2009

  2. Overview Introduction Motivation Driving Scientific Application Asynchronous Genetic Search Why asynchronous? Methodology Recombination Particle Swarm Optimization ‏Generic Optimization Framework Approach Architecture Results Convergence Rates Re-computation rates Conclusions & Future Work Questions?

  3. Motivation Scientists need easily accessible distributed optimization tools Distribution is essential for scientific computing Scientific models are becoming increasingly complex Rates of data acquisition are far exceeding increases in computing power Traditional optimization strategies not well suited to large scale computing Lack scalability and fault tolerance

  4. Astro-Informatics Observing from inside the Milky Way provides 3D data: SLOAN digital sky survey has collected over 10 TB data. Can determine it's structure – not possible for other galaxies. Very expensive – evaluating a single model of the Milky Way with a single set of parameters can take hours or days on a typical high-end computer. Models determine where different star streams are in the Milky Way, which helps us understand better its structure and how it was formed. What is the structure and origin of the Milky Way galaxy?

  5. Computed Paths of Sagittarius Stream

  6. Generic Optimization Framework Separation of Concerns • Distributed Computing • Optimization • Scientific Modeling “Plug-and-Play” • Simple & generic interfaces

  7. Two Distribution Strategies Asynchronous evaluations Results may not be reported or reported late No processor dependencies Faults can be ignored Grids & Internet Single parallel evaluation Always uses most evolved population Can use traditional methods Faults require recalculation Grids require load balancing Supercomputers & Grids

  8. Asynchronous Architecture‏ Scientific Models Search Routines Data Initialisation Integral Function Integral Composition Likelihood Function Likelihood Composition Evolutionary Methods Genetic Search Particle Swarm Optimisation … Initial Parameters Optimised Parameters Work Request Work Results Work Request Work Results Evaluator (1)‏ Evaluator (N)‏ … Evaluator Creation BOINC (Internet)‏ SALSA/Java (RPI Grid)‏ Distributed Evaluation Framework

  9. GMLE Architecture (Parallel-Asynchronous)‏ Search Routines Communication Layer BOINC - HTTP Grid - TCP/IP Supercomputer - MPI Work Request Work Results Work Request Work Results Worker (1)‏ Worker (Z)‏ Combine Results Combine Results Distribute Parameters Distribute Parameters … MPI MPI Evaluator (1)‏ Evaluator (2)‏ Evaluator (N)‏ Evaluator (1)‏ Evaluator (2)‏ Evaluator (M)‏ … …

  10. Issues With Traditional Optimization Traditional global optimization techniques are evolutionary, but dependent on previous steps and are iterative Current population is used to generate the next population Dependencies and iterations limit scalability and impact performance With volatile hosts, what if an individual in the next generation is lost? Redundancy is expensive Scalability limited by population size

  11. Asynchronous Optimization Strategy Use an asynchronous methodology No dependencies on unknown results No iterations Continuously updated population N individuals are generated randomly for the initial population Fulfil work requests by applying recombination operators to the population Update population with reported results

  12. Asynchronous Search Strategy‏ Workers Report results and update population Request work Send work Population Request work when queue is low Work Queue Fitness (1)‏ Parameter Set (1)‏ Unevaluated Parameter Set (1)‏ Fitness (2)‏ Parameter Set (2)‏ Unevaluated Parameter Set (2)‏ . . . . . . . . . . . . . . . Generate members from population Fitness (n)‏ Parameter Set (n)‏ Unevaluated Parameter Set (m)‏

  13. Asynchronous Genetic Search Operators (1)‏ Average Simple operator for continuous problems Generated parameters are the average of two randomly selected parents Mutation Takes a parent and generates a mutation by randomly selecting a parameter and mutating it

  14. Asynchronous Genetic Search Operators (2)‏ Double Shot - two parents generate three children Average of the parents Outside the less fit parent, equidistant to parent and average Outside the more fit parent, equidistant to parent and average

  15. Asynchronous Genetic Search Operators (3)‏ Probabilistic Simplex N parents generate one or more children Points placed randomly along the line created by the worst parent, and the centroid (average) of the remaining parents

  16. Particle Swarm Optimization Particles ‘fly’ around the search space. They move according to their previous velocity and are pulled towards the global best found position and their locally best found position. Analogies: cognitive intelligence (local best knowledge) social intelligence (global best knowledge) December 12, 2009 16

  17. Particle Swarm Optimization PSO: vi(t+1) = w * vi(t) + c1 * r1 * (li - pi(t)) + c2 * r2 * (g - pi(t)) pi(t+1) = pi(t) + vi(t+1) w, c1, c2 = constants r1, r2 = random float between 0 and 1 vi(t) = velocity of particle i at iteration t pi(t) = position of particle i at iteration t li = best position found by particle i g = global best position found by all particles December 12, 2009 17

  18. Asynchronous PSO Generating new positions does not necessarily require the fitness of the previous position 1. Generate new particle or individual positions to fill work queue 2. Update local and global best on results PSO: If result improves particle’s local best, update local best, particle’s position and velocity of the result December 12, 2009 18

  19. Particle Swarm Optimization (Example) w * vi(t) current: pi(t) velocity: vi(t) previous: pi(t-1) global best local best c2 * (g - pi(t)) c1 * (li - pi(t)) possible new positions December 12, 2009 19

  20. Particle Swarm Optimization (Example) previous: pi(t-1) velocity: vi(t) new position previous: pi(t-1) velocity: vi(t) global best current: pi(t) current: pi(t) global best c2 * (g - pi(t)) local best local best w * vi(t) possible new positions Particle finds a new local best position and the global best position December 12, 2009 20

  21. Particle Swarm Optimization (Example) c1 * (li - pi(t)) c2 * (g - pi(t)) possible new positions global best local best w * vi(t) previous: pi(t-1) current: pi(t) velocity: vi(t) Another particle finds the global best position December 12, 2009 21

  22. Asynchronous PSO Workers (Fitness Evaluation) Local and global best updated if new individual has better fitness Report results and update population Request Work Send Work Population Unevaluated Individuals Fitness (1) Individual (1) Unevaluated Individual (1) Select individual to generate new individual from in round-robin manner Fitness (2) Individual (2) Unevaluated Individual (2) . . . . . . . . . . . . . . . . . . . . . . . . Generate individuals when queue is low Fitness (n) Individual (n) Unevaluated Individual (n) December 12, 2009 22

  23. Computing Environment: Milkyway@home http://milkyway.cs.rpi.edu BOINC Einstein@home, SETI@home, etc >50,000users; 80,000 CPUs; 600 teams; from 99 countries; Second largest BOINC computation (among 100’s) About 500 Teraflops Donate your idle computer time to help perform our calculations.

  24. MilkyWay@Home – Growth of Power December 12, 2009 24

  25. Computing Environments - BOINC MilkyWay@Home: http://milkyway.cs.rpi.edu/ Multiple Asynchronous Workers Approximately 10,000 – 30,000 volunteered computers engages at a time Asynchronous architecture used Asynchronous Evaluation Volunteered computers can queue up to 20 pending individuals Population updated when results reported Individuals may be reported slowly or not at all

  26. Handling of Work Units by the BOINC‏ Server

  27. Users do more than volunteer computing resources (Citizen’s Science): Open-source code gives users access to the MilkyWay@Home application Users have submitted many bug reports, fixes, and performance enhancements A user even created an ATI GPU capable version of the MilkyWay@Home application Forums provide opportunities for users to learn about astronomy and computer science User Participation

  28. With open-source application code, users can compile their own compiler-optimized versions and many do. However, there is also the possibility of users returning malicious results BOINC traditionally uses redundancy on every result to verify their correctness. This requires at least 2 results for every work unit! Asynchronous search doesn't require all work units to be verified, only those which improve the population We reduce the redundancy by comparing a result against the current partial results. Malicious/Incorrect Result Verification

  29. Limiting Redundancy (Genetic Search) 60% verification found best solutions Increased verification reduces reliability Reliability and convergence by number of parents seems dependent on verification rate December 12, 2009 29

  30. Limiting Redundancy (PSO) 30% verification found best solutions Increased verification reduces reliability Not as dramatically as AGS Lower inertia weights give better results December 12, 2009 30

  31. Optimization Method Comparison APSO found better solutions than AGS. APSO needed lower verification rates and was less effected by different verification rates. December 12, 2009 31

  32. Conclusions Asynchronous search is effective on large scale computing environments Fault tolerant without expensive redundancy Asynchronous evaluation on heterogeneous environment increases diversity BOINC converges almost as fast as the BlueGene, while offering more availability and computational power Even computers with slow result report rates are useful Particle Swarm and Simplex-Genetic Hybrid methods provide significant improvement in convergence

  33. Future Work Optimization Use report times to determine how to generate individuals Simulate asynchrony for benchmarks Automate selection of parameters Distributed Computing Parallel asynchronous workers Handle Malicious “Volunteers” Continued Collaboration http://www.nasa.gov

  34. Questions?

More Related