330 likes | 566 Views
Configuring Resources for the Grid . Jerry Perez Senior Administrator Texas Tech University. Outline. What is a Job Manager? Types of Job Managers PBS Pro SGE LSF Condor/Condor-DAGman Rocks + Rolls (Quick overview). What is a Job Manager?.
E N D
Configuring Resources for the Grid Jerry Perez Senior Administrator Texas Tech University
Outline • What is a Job Manager? • Types of Job Managers • PBS Pro • SGE • LSF • Condor/Condor-DAGman • Rocks + Rolls (Quick overview)
What is a Job Manager? • A Job Management System is a software component that ensures: • Balanced use of cluster resources. • Fair allocation of these resources to user's jobs in a process that determines which job to run • When and where to run compute jobs.
Components of a Job Manager • Resource Management System • a process that maintains the current state of all the resources under its control, including the physical resources of the cluster and account information such as relative priorities and account balances. • Queuing System • a process that maintains the current state of jobs submitted but not completed. • Scheduler • a system that assigns jobs to resources.
Why do we need a Job Manager? A Job Management System should always be used for a cluster: • Operated as a public resource. • If there are a large number of users or users who don't know each other. • With a large number of nodes and processors. that runs a large number of jobs. • Whose nodes are heterogeneous in terms of memory, speed, number of processors, software licenses, networking, and other features. Note: Most clusters are homogeneous with respect to hardware and software.
PBS Pro Components: PBS Pro is made up of a number of components: • The server and clients such as user commands. • A server component manages a number of different objects, such as queues or jobs. • Each object consists of a number of data items or attributes. • Scheduling is policy based and operates in a FIFO round-robin type fashion. • Specific Queues can be configured for priority queuing. • Minimal Queue/Scheduler configuration
SGE – Sun Grid Engine • The SGE version 6 queue configuration allows for a queue to span more than one execution host to provide multiple hosts per queue configuration. • Uses concept of SGE Master node controlling “pools” of compute clients. • Can manage up to 10,000 clients per SGE Master node. • SGE can provide Load Leveling on the fly. • Scheduling can be policy based or topologically based. • Addresses the “Backfill” problem. (More on that later.) • Queue optimization is not automatic. It requires “tuning”.
SGE - Basic Cluster Configuration • Configured to reflect site dependencies and to influence batch system behavior. • Site dependencies include valid paths for programs such as mail or xterm. • A global configuration is provided for the Master Host as well as for every host in the grid engine system pool. • Can configure the system to use a configuration local to each host to override particular entries in the global configuration.
LSF • Scheduling can be policy based or topologically based. • Queue optimization is not automatic. It requires “tuning”. • Topologically based scheduling can use load information to schedule jobs. • Addresses the “Backfill” problem. • Jobs in a backfill queue cannot be preempted (a job in a backfill queue might be running in a reserved job slot, and starting a new job in that slot might delay the start of the big parallel job): • A backfill queue cannot be preemptable. • A preemptive queue whose priority is higher than the backfill queue cannot preempt the jobs in backfill queue.
LSF - How backfilling works • LSF assumes that a job will run until its run limit expires. • Backfill scheduling works most efficiently when all the jobs in the cluster have a run limit. • Since jobs with a shorter run limit have more chance of being scheduled as backfill jobs, users who specify appropriate run limits in a backfill queue will be rewarded by improved turnaround time.
Condor • Provides a job queuing mechanism • Scheduling policy • Priority scheme • Resource monitoring • Resource management.
Users submit their serial or parallel jobs to Condor. • Condor places them into a queue. • Chooses when and where to run the jobs based upon a policy. • Carefully monitors their progress • Informs the user upon completion • Uses FIFO round-robin scheduling out of the box. • Can use attribute-based scheduling.
Condor can be used to build Grid-style computing environments that cross administrative boundaries. • Condor's "flocking" technology allows multiple Condor compute installations to work together. • Condor incorporates many of the emerging Grid-based computing methodologies and protocols. • For instance, Condor-G is fully interoperable with resources managed by Globus.
Condor-DAGMan • DAGMan (Directed Acyclic Graph Manager) is a meta-scheduler for Condor. It manages dependencies between jobs at a higher level than the Condor Scheduler. • DAGMan is responsible for scheduling, recovery, and reporting for the set of programs submitted to Condor
Rocks + Rolls The complexity of cluster management (e.g., determining if all nodes have a consistent set of software) often overwhelms part-time cluster administrators, who are usually domain application scientists. Rocks is a complete clustering solution with a goal to help deliver the computational power of clusters to a wide range of scientific users.
Rocks + Rolls • Before you install Rocks, be sure you have decided what Rolls you wish to include in your installations. • You may install whatever you like, however remember you can only choose one scheduler: LSF, SGE, PBS, or Condor. • Schedulers do not like being used together due to resource conflicts.
Rocks + Rolls • Required Rolls: • Base • Hpc • Kernel • Web-server
Rocks + Rolls • List of various rolls: • Area51System - security related services and utilities • GangliaCluster - monitoring system from UCB • GridGlobus 4.0.1 (GT4) • Condor Roll • JavaSun Java SDK and JVM • MyrinetMyricom’s Myrinet drivers and MPICH environments • PbsPBS - job queueing system • NinfNinf-G - a simple, yet powerful, client-server-based standard RPC mechanism • SgeSun - Grid Engine job queueing system • VizSupport - for building visualization clusters • LSF - comes with Platform Rocks
Thank You. Questions?