1 / 29

Managing Your Workforce on a Computational Grid

Managing Your Workforce on a Computational Grid. Computational Grids will give us access to resources anywhere. How can we make them usable by anyone?. You do not have to be a “super” person in order to benefit from “super” computing power. Grid Computing. Obstacles to HTC *. (Sociology)

emmett
Download Presentation

Managing Your Workforce on a Computational Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing YourWorkforce on a Computational Grid

  2. Computational Grids will give us access to resourcesanywhere.How can we make them usable by anyone?

  3. You do not have to be a “super” person in order to benefit from “super” computing power

  4. Grid Computing Obstacles to HTC * (Sociology) (Education) (Robustness) (Portability) (Technology) • Ownership Distribution • Customer Awareness • Size and Uncertainties • Technology Evolution • Physical Distribution *(this slide has been in use for more than 5 years)

  5. Our Bottom-Up Approach Build and organize your computing capabilities from the bottom up: • Organize the resources on your desk • Organize the resources on your floor • Organize the resources in your building • Organize the resources on your campus • Organize the resources in your state • Organize the resources in your nation • Organize the resources in your world

  6. Here is what a group of empowered (and well organized) scientists can do with Grid resources …

  7. NUG28 - Solved!!!! We are pleased to announce the exact solution of the nug28 quadratic assignment problem (QAP). This problem was derived from the well known nug30 problem using the distance matrix from a 4 by 7 grid, and the flow matrix from nug30 with the last 2 facilities deleted. This is to our knowledge the largest instance from the nugxx series ever provably solved to optimality. The problem was solved using the branch-and-bound algorithm described in the paper "Solving quadratic assignment problems using convex quadratic programming relaxations," N.W. Brixius and K.M. Anstreicher. The computation was performed on a pool of workstations using the Condor high-throughput computing system in a total wall time of approximately 4 days, 8 hours. During this time the number of active worker machines averaged approximately 200. Machines from UW, UNM and (INFN) all participated in the computation.

  8. The NUG30 Personal Condor … For the NUG30 run we will be flocking to -- the main Condor pool at Wisconsin (600 processors) -- the Condor pool at Georgia Tech (190 Linux boxes) -- the Condor pool at UNM (40 processors) -- the Condor pool at Columbia (16 processors) -- the Condor pool at Northwestern (12 processors) -- the Condor pool at NCSA (65 processors) -- the Condor pool at INFN (200 processors) We will be using glide_in to access the Origin 2000 (through LSF ) at NCSA. We will use "hobble_in" to access the Chiba City Linux cluster and Origin 2000 at Argonne.

  9. It works!!! Date: Thu, 8 Jun 2000 22:41:00 -0500 (CDT) From: Jeff Linderoth <linderot@mcs.anl.gov> To: Miron Livny <miron@cs.wisc.edu> Subject: Re: Priority This has been a great day for metacomputing! Everything is going wonderfully. We've had over 900 machines (currently around 890), and all the pieces are working great… Date: Fri, 9 Jun 2000 11:41:11 -0500 (CDT) From: Jeff Linderoth <linderot@mcs.anl.gov> Still rolling along. Over three billion nodes in about 1 day!

  10. Up to a Point … Date: Fri, 9 Jun 2000 14:35:11 -0500 (CDT) From: Jeff Linderoth <linderot@mcs.anl.gov> Hi Gang, The glory days of metacomputing are over. Our job just crashed. I watched it happen right before my very eyes. It was what I was afraid of -- they just shut down denali, and losing all of those machines at once caused other connections to time out -- and the snowball effect had bad repercussions for the Schedd.

  11. Back in Business Date: Fri, 9 Jun 2000 18:55:59 -0500 (CDT) From: Jeff Linderoth <linderot@mcs.anl.gov> Hi Gang, We are back up and running. And, yes, it took me all afternoon to get it going again. There was a (brand new) bug in the QAP "read checkpoint" information that was making the master coredump. (Only with optimization level -O4). I was nearly reduced to tears, but with some supportive words from Jean-Pierre, I made it through.

  12. The First 600K seconds … Number of workers

  13. NUG30 - Solved!!! Sender: goux@dantec.ece.nwu.edu Subject: Re: Let the festivities begin. Hi dear Condor Team, you all have been amazing. NUG30 required 10.9 years of Condor Time. In just seven days ! More stats tomorrow !!! We are off celebrating ! condor rules ! cheers, JP.

  14. 11 CPU years in less than a week,How did they do it? Effective management of their workforce!(www.mcs.anl.gov/metaneos/nug30)

  15. I. Employ the MW Paradigm Map the problem to a dynamic Master-Worker parallel algorithm (branch and bond) with the following properties: • Large number of independent tasks (always have work ready to be sent out in case a worker shows up) • Workers may join or leave/quit at anytime • Mater has control over size of tasks • Master is fault tolerant

  16. Decisions, decisions, decisions … • How many workers? • What kind of workers? • How long to wait for a worker to finish a task? • How big should a task be? • Who should get which task? • When to checkpoint?

  17. II. Build a robust implementation Use our MW runtime support library to implement the MW algorithm (www.cs.wisc.edu/condor/MW). • Opportunistic compute resources • Use PVM or file I/O to move work to workers and results back to master • Heterogeneous compute and communication resources • Mater and workers can recover from a crash

  18. III. Build a Personal Computational Grid Use Condor-G (a grid enabled version of Condor) to manage the workforce of your MW application on Condor and Grid resources • Uses Globus technology and services for inter-domain resource access • Uses Condor capabilities for intra-domain resource and job management and scheduling

  19. How doesCondor-Gwork?(joint effort of the Condor and Globus projects)

  20. Glide-in Glide-in Glide-in Grid Problem Solving Environment/User Condor - G Globus PBS LSF Condor Condor App App App App

  21. Condor_status Condor_submit Job Manager Grid_Man Gate Keeper Sched_D Local Job Scheduler Condor-G

  22. Adding the G to Condor Support for the G Universe

  23. Capabilities • G jobs look and feel like a Condor job • Local and personalized job management services • Reliable and fault tolerant job management

  24. Open Issues • Provide “run once and only once” guarantees • Dealing with failures/exceptions • File staging • Remote I/O • Visibility of remote job and queue status information

  25. Adding the G to CondorCondor “Glide-ins”

  26. Capabilities • Turn allocated G resources into members of a Condor pool • Use GSIFTP services to move Condor executable to the executing environment • Provide safeguards for broken communication lines • Support SMPs and clusters for multi CPU allocations

  27. Open Issues • Manage inventory of glide-ins - how many? When to submit? Which G resource? Which queue? • Provide status and exception information to application • Package Condor software for automatic installation on all platforms • Make remote scheduler aware of opportunistic nature of resource requests

  28. Current Status • MW runtime library available • Proof of concept (used by the NUGxx group) version of Condor-G available (based on Globus Run) • New version of Condor-G will be released as a “Contrib Module” by the end of the month (based on GRAM API) • “Glide-in” package available

  29. C High Throughput Computing ondor Don’t be picky, be agile!!!

More Related