240 likes | 507 Views
Sun Grid Engine. Grids. Grids are collections of resources made available to customers. Compute grids make cycles available to customers from an access point; kind of like plugging into an electrical grid • Cluster grids: resources in one room • Campus grids: multiple clusters on one campus
E N D
Grids • Grids are collections of resources made available to customers. • Compute grids make cycles available to customers from an access point; kind of like plugging into an electrical grid • • Cluster grids: resources in one room • • Campus grids: multiple clusters on one campus • • Global grids: Cross administrative domains
Grids • Potentially (ideally?) you could completely outsource your HPC needs by buying time on a commercial grid. Running a big data center is tricky and takes expensive people. If you are, say, a small computer animation group working on an animated short it might not make sense to set up a data center for six months of work • OTOH, if you’re Pixar or Lucas this is a core competency
Sun Grid Engine • SGE is a piece of software that matches jobs to compute resources • BTW, SGE runs on OS X. This would be another fine project for someone to investigate
SGE • As we’ve seen, Sun Grid Engine can accept a batch job and give it to a compute node. • SGE (base level) is open source; see http://gridengine.sunsource.net/ • There are some other issues: • • Multiple queues • • Giving jobs only to nodes with the necessary resources • • Queue manipulation
SGE • Users submit jobs; they’re kept by SGE in a holding area until resources become available, then sent to an execution device. The results are reported back. • Types of hosts: master, execution, administration, and submit • Master runs the master daemon and scheduling daemon • Execution hosts are where jobs are run, admin hosts can manipulate the queues • There are a lot of knobs to twiddle on SGE
SGE • Imagine a bank that has five customers walk in. Four just want to deposit a check, and the fifth wants to set up a home loan. • If the home loan guy happens to be first, and there is only one queue, the four with short transactions wait for a long time. • What’s more, the home loan guy must have manager approval at some point in the process • So: set up two queues, one for long transactions, one an express lane. The home loan queue specifies that the manager must be available. • This reduces the median time spent in queue for the short transaction customers, and reduces the variance of the waiting time
SGE Queues • There may be more than one queue; jobs are associated with queues • qconf -sql Shows the list of defined queues • Why multiple queues? Some types of jobs may be very long or require specific resources, so users may submit jobs to queues optimized for those types of jobs Execution Host Q1 SGE Master SGE Scheduler Execution Host Q2 Execution Host
Scheduler • The scheduler (which assigns jobs to execute hosts) looks at several factors: • • Load parameters, how busy the execute hosts are by some measure • • Consumable resources, memory, disk space, licenses, etc. SGE keeps track of these and dispatches a job only if resources are available • • Attributes, such as 64-bit, G5, etc. These aren’t necessarily consumed, but may simply be a state • The scheduler may look at all these factors before assigning a job from the holding pool to an execution host
Consumable Resources • There are some finite resources in the cluster: CPU time, disk space, licenses, bandwidth • Available capacity for these is defined by the administrator; the scheduler examines available consumables when deciding what to run
Requestable Attributes • On job submission you can request attributes or characteristics: at least X amount of memory, a license for software package Y, a 64 bit host, etc. • In a production environment licenses can be a big deal. Circuit design software may cost thousands per node, so not every node on the cluster may have a license. • The attributes can be related to the hosts or the queues • Attributes that are “requestable” can be mentioned in the qsub command, so jobs may require that attribute to run
SGE • You don’t need to submit a job to a specific queue; instead you can simply ask for certain resources, and SGE will pick a queue based on the requirement profile
Environment Variables • When a job runs on a host some environment variables are set: • ARC • SGE_ROOT • SGE_STDOUT_PATH • HOME
Dependencies • Suppose you divide up a task into several subtasks. This can require sequencing--some subtasks may need to be finished before other subtasks can run. You can specify a list of jobs that must finish before this job runs
Listing Attributes • qconf -scl lists “complexes” of attributes. Typically this includes a complex for the queues, and one for the hosts • qconf -sc host|queue Lists attributes for a complex #name shortcut type value relop requestable consumable default #-------------------------------------------------------------------------------------- arch a STRING none == YES NO none num_proc p INT 1 == YES NO 0 load_avg la DOUBLE 99.99 >= NO NO 0
Modifying Attributes • Qconf -mc [complex name] opens up an editor that allows you to modify the complex settings
Attributes • Note that some attributes are “requestable”. This means that you can specify that your job requires that attribute from the qsub command line. • Qsub -l arch=“glinux” says the job requires a “glinux” host to run • Qconf -se compute-0-0 shows resources for a host
Priorities • By default jobs are handled in a FIFO manner. As they come in they are assigned to a compatible queue for processing by the scheduler. • Qsub -p can provide a priority to the job that can override FIFO behavior. • Qdel and qstat to find and delete jobs from the holding area
Checkpointing • Sometimes on very long jobs it is worthwhile to be able to stop the job and restart it later. • What are the issues involved here? • Why use it? • Starter, suspend, resume, terminate methods
Hard & Soft Requirements • A hard requirement must be present before the job is scheduled