180 likes | 195 Views
Grid Laboratory Of Wisconsin (GLOW). UW Madison’s Campus Grid. Dan Bradley Department of Physics & CS Representing the GLOW + Condor Teams. http://www.cs.wisc.edu/condor/glow. The Premise. Many researchers have computationally intensive problems.
E N D
Grid Laboratory Of Wisconsin(GLOW) UW Madison’s Campus Grid Dan Bradley Department of Physics & CSRepresenting the GLOW + Condor Teams http://www.cs.wisc.edu/condor/glow 2006 ESCC/Internet2 Joint Techs Workshop
The Premise Many researchers have computationally intensive problems. Individual workflows rise and fall over the coarse of weeks and months. Computers and computing people are less volatile than a researcher’s demand for them. 2006 ESCC/Internet2 Joint Techs Workshop
Grid Laboratory of Wisconsin 2003 Initiative funded by NSF/UWSix Initial GLOW Sites • Computational Genomics, Chemistry • Amanda, Ice-cube, Physics/Space Science • High Energy Physics/CMS, Physics • Materials by Design, Chemical Engineering • Radiation Therapy, Medical Physics • Computer Science Diverse users with different deadlines and usage patterns. 2006 ESCC/Internet2 Joint Techs Workshop
UW Madison Campus Grid • Condor pools in various departments, made accessible via Condor ‘flocking’ • Users submit jobs to their own private or department Condor scheduler. • Jobs are dynamically matched to available machines. • Crosses multiple administrative domains. • No common uid-space across campus. • No cross-campus NFS for file access. • Users rely on Condor remote I/O, file-staging, AFS, SRM, gridftp, etc. 2006 ESCC/Internet2 Joint Techs Workshop
UW Campus Grid Machines • GLOW Condor pool is distributed across the campus to provide locality with big users. 1200 2.8 GHz Xeon CPUs 200 1.8 GHz Opteron cores 100 TB disk • Computer Science Condor pool 1000 ~1GHz CPUs testbed for new Condor releases • Other private pools • job submission and execution • private storage space • excess jobs flock to GLOW and CS pools 2006 ESCC/Internet2 Joint Techs Workshop
New GLOW Members • Proposed minimum involvement • One rack with about 50 CPUs • Identified system support person who joins GLOW-tech • Can be an existing member of GLOW-tech • PI joins the GLOW executive committee • Adhere to current GLOW policies • Sponsored by existing GLOW members • UW ATLAS and other physics groups were proposed by CMS and CS, and were accepted as new members • Expressions of interest from other groups 2006 ESCC/Internet2 Joint Techs Workshop
Housing the Machines • Condominium Style • centralized computing center • space, power, cooling, management • standardized packages • Neighborhood Association Style • each group hosts its own machines • each contributes to administrative effort • base standards (e.g. Linux & Condor) to make easy sharing of resources • GLOW has elements of both, but leans towards neighborhood style 2006 ESCC/Internet2 Joint Techs Workshop
What About “The Grid” Who needs a campus grid? Why not have each cluster join “The Grid” independently? 2006 ESCC/Internet2 Joint Techs Workshop
The Value of Campus Scale simplicitysoftware stack is just Linux + Condor fluidity high common denominator makes sharing easier and provides richer feature-set collective buying powerwe speak to vendors with one voice standardized administratione.g. GLOW uses one centralized cfengine synergyface-to-face technical meetingsmailing list scales well at campus level 2006 ESCC/Internet2 Joint Techs Workshop
The value of the big G • Our users want to collaborate outside the bounds of the campus (e.g. Atlas and CMS are international). • We also don’t want to be limited to sharing resources with people who have made identical technological choices. • The Open Science Grid gives us the opportunity to operate at both scales, which is ideal. 2006 ESCC/Internet2 Joint Techs Workshop
On the OSG Map Any GLOW member is free to link their resources to other grids. facility: WISC site: UWMadisonCMS 2006 ESCC/Internet2 Joint Techs Workshop
condor_submit schedd (Job caretaker) startd (Job Executor) Submitting Jobs within UW Campus Grid UW HEP User HEP matchmaker CS matchmaker GLOW matchmaker flocking • Supports full feature-set of Condor: • matchmaking • remote system calls • checkpointing • MPI • suspension VMs • preemption policies 2006 ESCC/Internet2 Joint Techs Workshop
HEP matchmaker GLOW matchmaker CS matchmaker condor_submit Globus gatekeeper schedd (Job caretaker) flocking schedd (Job caretaker) startd (Job Executor) condor gridmanager Submitting jobs through OSG to UW Campus Grid Open Science Grid User 2006 ESCC/Internet2 Joint Techs Workshop
condor_submit schedd (Job caretaker) globus gatekeeper condor gridmanager Routing Jobs fromUW Campus Grid to OSG HEP matchmaker CS matchmaker GLOW matchmaker Grid JobRouter • Combining both worlds: • simple, feature-rich local mode • when possible, transform to grid job for traveling globally 2006 ESCC/Internet2 Joint Techs Workshop
GLOW Architecture in a Nutshell One big Condor pool • But backup central manager runs at each site (Condor HAD service) • Users submit jobs as members of a group (e.g. “CMS” or “MedPhysics”) • Computers at each site give highest priority to jobs from same group (via machine RANK) • Jobs run preferentially at the “home” site, but may run anywhere when machines are available 2006 ESCC/Internet2 Joint Techs Workshop
Accommodating Special Cases • Members have flexibility to make arrangements with each other when needed • Example: granting 2nd priority • Opportunistic access • Long-running jobs which can’t easily be checkpointed can be run as bottom feeders that are suspended instead of being killed by higher priority jobs • Computing on Demand • tasks requiring low latency (e.g. interactive analysis) may quickly suspend any other jobs while they run 2006 ESCC/Internet2 Joint Techs Workshop
Example Uses • Chemical Engineering • Students do not know where the computing cycles are coming from - they just do it - largest user group • ATLAS • Over 15 Million proton collision events simulated at 10 minutes each • CMS • Over 70 Million events simulated, reconstructed and analyzed (total ~10 minutes per event) in the past one year • IceCube / Amanda • Data filtering used 12 CPU-years in one month • Computational Genomics • Prof. Shwartz asserts that GLOW has opened up a new paradigm of work patterns in his group • They no longer think about how long a particular computational job will take - they just do it 2006 ESCC/Internet2 Joint Techs Workshop
Summary • Researchers are demanding to be well connected to both local and global computing resources. • The Grid Laboratory of Wisconsin is our attempt to meet that demand. • We hope you too will find a solution! 2006 ESCC/Internet2 Joint Techs Workshop