200 likes | 295 Views
Dr. David Wallom. Use of Condor in our Campus Grid and the University September 2004. Outline. The University of Bristol Grid (UoBGrid). The UoBGrid Resource Broker Users & their environment. Problems encountered. Other Condor use within Bristol. Summary. The UoBGrid.
E N D
Dr. David Wallom Use of Condor in our Campus Grid and the University September 2004
Outline • The University of Bristol Grid (UoBGrid). • The UoBGrid Resource Broker • Users & their environment. • Problems encountered. • Other Condor use within Bristol. • Summary.
The UoBGrid • Planned for ~1000+ CPUs from 1.2 → 3.2GHz arranged in 7 clusters & 3+ Condor pools located in 4 different departments. • Core services run on individual servers, e.g. Resource Broker & MDS.
The UoBGrid, now • Currently 270 CPUs in 3 clusters and 1 Windows Condor pool. • Central services run on 2 beige boxes in my office. • Windows Condor pool only in single student open access area. • Currently only two departments (Physics, Chemistry) fully engaged though more on their way. • Remainder of large clusters still on legacy versions of operating systems, University wide upgrade programme started.
Middleware • Virtual Data Toolkit. • Chosen for stability. • Platform independent installation method. • Widely used in other European production grid systems. • Contains the standard Globus Toolkit version 2.4 with several enhancements. • Also. • GSI enhanced OpenSSH. • myProxy Client & Server. • Has a defined support structure.
Condor-G Resource Broker • Uses the Condor-G matchmaking mechanism with Grid resources. • Set to run immediately a job appears. • Custom script for determination of resource status & priority. • Integrated the Condor Resource description mechanism and Globus Monitoring and Discovery Service.
Information Passed into Resource ClassAd MyType = "Machine" TargetType = "Job" Name and gatekeeper URLs dependant on resource system name and installed scheduler as systems may easily have more than one jobmanager installed. Name = "grendel.chm.bris.ac.uk-pbs“ gatekeeper_url = "grendel.chm.bris.ac.uk/jobmanager-pbs" Make sure Globus universe, check number of nodes in cluster and set max number of matched jobs to a particular resource. Requirements = (CurMatches < 5) && (TARGET.JobUniverse == 9) WantAdRevaluate = True Time classad constructed. UpdateSequenceNumber = 1097580300 Currently hard coded in the ClassAd CurMatches = 0 System information retreived from Globus MDS information for head node only not worker OpSys = "LINUX“ Arch = "INTEL" Memory = 501 Installed software defined in resource broker file for each resource INTEL_COMPILER=True GCC3=True
Possible extensions to Resource Information • Resource state information (LoadAvg, Claimed etc): • How is this defined for a cluster, maybe Condor-G could introduce new states of % full?? • Number of CPUs and free diskspace: • How do you define this in a cluster? Is the number of CPUs set as per worker or overall the whole system? Same for disk space. • Cluster performance (MIPS, KFlops): • This is not commonly worked out for small clusters so would need to be hardwired in but could be very useful for Ranking resources.
Load Management • Only defines the raw numbers of jobs running, idle & held (for whatever reason). • Has little measure of relative performance of nodes within grid, currently based on: • Head node processor type & memory. • MDS value of nodeCount for the jobmanager (this is not always the same as the real number of worker nodes. • Executes currently only to a single queue on each resource.
What is currently running and how do I find out? • Simple interface to condor_q • Planning to use Condor Job Monitor when installed due to scalability issues.
Issues with Condor-G • The following is a list of small issues we have: • How do you do some resource definitions for clusters? • When using condor_q –globus actual hostname job matched to is not displayed. • No job exit codes….. • The job exit codes will become more important with increased number of users/problems. • Once a job has been allocated to a remote cluster then rescheduling elsewhere is difficult.
The Users • BaBar: • One resource is the Bristol BaBar farm so Monte-Carlo event production in parallel to UoBGrid usage. • GENIE: • installing software onto each connected system by agreement with owners. • LHCb: • Windows compiled Pythia event generation. • Earth Sciences: • River simulation. • Myself… • Undergraduate written charge distribution simulation code.
Usage • Current record: • ~10000 individual jobs in a week, • ~2500 in one day.
Windows Condor through Globus • Install Linux machine as Condor Master only. • Configure this to flock to Windows Condor pool. • Install Globus Gatekeeper. • Edit jobmanager.pm file so that architecture for submitted jobs always WINNT51 (matches all the workers in the pool). • Appears in Condor-G Resource list as WINNT51 Resource.
Windows Condor pools available through a Globus Interface from a flocked Linux pool • Within three departments, currently there are three separate Windows Condor pools approximately 200 CPUs. • Planning to allow all student teaching resources in as many departments as possible to have the software installed. • This will allow a significant increase in university processing power with little cost increase. • When department gives the OK then they will be added to the flocking list on single Linux submission system machine. • Difficulty encountered with this setup is the lack of Microsoft Installer file. • Affects ability to use group policy method of software delivery and installation, directly affects how some computer officers view installing etc.
Evaluation of Condor again United Devices GridMP • Computational Chemistry group have significant links with industrial partner who is currently using U.D. GridMP. • Suggested to CC group they also use GridMP though after initial contact this was suggested to be very costly. • e-Science group suggested that Condor would be a better system for them to use. • Agreement from UD to do a published function & usage comparison between Condor & GridMP. • Due to start this autumn.