180 likes | 418 Views
Condor and the NGS John Kewley NGS Support Centre Manager. Outline. What is High Throughput Computing What is Condor? Condor and the NGS. HPC vs. HTC. HPC (High Performance Computing) Large amounts of [simultaneous] computing power for comparatively short periods of time.
E N D
Outline • What is High Throughput Computing • What is Condor? • Condor and the NGS NGS Innovation Forum, Manchester
HPC vs. HTC HPC (High Performance Computing) • Large amounts of [simultaneous] computing power for comparatively short periods of time HTC (High Throughput Computing) • Large amounts of computing over significantly longer periods, not necessarily all at the same time NGS Innovation Forum, Manchester
Various job types Parameter Studies OpenMP Master-worker Parallel Parameter Search Serial Embarrassingly Parallel Sequential Parameter Sweep Monte Carlo MPI PVM NGS Innovation Forum, Manchester
Terminology Parallel • Tightly-coupled processes • Need synchronisation • Information sharing • Message passing • Shared memory • 1 process fails, whole job fails • Single large homogenous resource • Processors used simultaneously Independent • Unordered (so not serial/sequential) • Nothing embarrassing about it • No communication once job starts • Might not need all results • Could run on different machines with different operating systems. NGS Innovation Forum, Manchester
What is Condor? • A job submission framework which utilises spare computing power • Works within a heterogeneous computer network • Desktop PCs, Linux workstations, servers, clusters, teaching lab resources can all be included in the Condor pool • Uses matchmaking to connect jobs with resources • Supports High Throughput Computing (HTC) • Developed over the past 20 years at the University of Wisconsin in Madison NGS Innovation Forum, Manchester
Useful Features • Automatic resubmission when jobs fail • Ability to cluster groups of jobs • Checkpointing / migration • DAGMan - Directed Acyclic Graph / workflow manager • Integration with Grid resources, especially through Condor-G • Staging and retrieval of data • Glide-in – dynamically add Grid worker nodes to your Condor pool NGS Innovation Forum, Manchester
Central Manager Execute Nodes Submit Nodes The NGS and Cardiff • NGS Partner site since April 05 • First resource was a 32 processor SGI cluster (Apr 05) • Second resource was the Condor pool (Jun 07) • Over 1000 Windows XP workstations • Mixture of P4s (80%) and C2Ds (20%) • Capped at 200 jobs running concurrently • Used by 10 different numbered accounts • See www.cf.ac.uk/arcca NGS Innovation Forum, Manchester
Other Condor on the NGS • Bristol: ~50 WindowsXP in a Condor pool fronted by a Linux server • Reading: ~400 Linux (CoLinux under WindowsXP) NGS Innovation Forum, Manchester
Execute Node What is Condor-G? condor_submit … Remote Site Head Node (Globus) Submit Node Internet Queue Job 1 Job 2 … Firewall Batch System NGS Innovation Forum, Manchester
OxGrid: Overview Department/College Department/College Oxford e-Research Centre Department/College Storage (SRB) BDII, VOMS, SSO CA... Resource Broker/ Login (Condor) Condor pool Departmental Clusters Condor pool Other University/Institution Other University/Institution Other University/Institution National Grid Service Resource Microsoft Cluster National Grid Service Cluster Super-computing centre NGS Innovation Forum, Manchester
User login Condor-G portal MyProxy server Condor-G central manager Condor-G submit host CSD-Physics cluster (ulgbc2) CSD-Physics cluster (ulgbc2) CSD AMD cluster (ulgbc1) NW-GRID cluster (ulgbc3) NW-GRID/POL cluster (ulgp4) Condor ClassAds Globus file staging NGS Innovation Forum, Manchester
University of Manchester, Research Computing Services • 100 cores (an additional 400 in 2nd pool) • Condor used as backfill for the SGE queues • IP-tunnelling used to enable connection to the NW-Grid backend nodes from Condor (rather than the provided GCB, the Generic Connection Broker) NGS Innovation Forum, Manchester
Novel Architecture !? • Condor itself is not that new • Some NGS users request Windows resources, but most previous NGS nodes used PBS, LSF or SGE on Linux • Campus Grids are being developed to harness all available processing power (incl. teaching pools, servers and clusters) • Condor can help NGS provide access to Windows resources NGS Innovation Forum, Manchester
Windows on the NGS Many users are looking for Windows resources on which to run their computations. As well as the resources provided by Cardiff, Bristol and Reading, Southampton have made available a group of 100 processors running under the Windows Compute Cluster Server NGS Innovation Forum, Manchester
Other work • Jean-Alain Grunchec of the University of Edinburgh is trying Condor Glidein to add NGS resources to his condor pool • The e-Minerals project utilised a condor submission mechanism to submit jobs to both local Condor pools and Grid resources such as NGS and NW-Grid • Both the EGEE resource broker (being trialled by NGS) and Gridway metascheduler are based on Condor technologies • STFC Daresbury Laboratory (another NW-Grid site) in collaboration with Cockcroft Centre in setting up a Campus Grid using NW-Grid and Condor resources NGS Innovation Forum, Manchester
Summary • Condor pools can be part of the NGS • Condor can be used in many ways with the NGS • Being combined with NGS in many Campus Grids • Condor can help NGS provide access to Windows resources Information on NGS resources can be found on http://www.grid-support.ac.uk/content/view/239/157/ NGS Innovation Forum, Manchester
Acknowledgements • Some slides are based on material from the University of Wisconsin-Madison Condor team. • Some of the slides describing the UK university condor work are based on ones they produced themselves (I hope nothing was "lost in translation" ? NGS Innovation Forum, Manchester