130 likes | 262 Views
Jefferson Lab and the Portable Batch System. Walt Akers High Performance Computing Group. Jefferson Lab and PBS: Motivating Factors. New Computing Cluster Alpha Based Compute Nodes 16 XP1000 Single Processor Nodes (LINPACK 5.61 GFlop/Sec)
E N D
Jefferson Lab and the Portable Batch System Walt Akers High Performance Computing Group
Jefferson Lab and PBS: Motivating Factors • New Computing Cluster • Alpha Based Compute Nodes • 16 XP1000 Single Processor Nodes (LINPACK 5.61 GFlop/Sec) • 8 UP2000 Dual Processor Nodes (LINPACK 7.48 GFlop/Sec) • Heterogeneous Job Mix • Combination of Parallel and Non-Parallel Jobs • Job execution times range from a few hours to weeks • Data requirements range from minimal to several gigabytes • Modest Budget • Much of our funding was from internal sources • Initial hardware expense was relatively high • Expandability • Can the product be expanded from a few nodes to hundreds
Jefferson Lab and PBS: Alternative Systems • PBS - Portable Batch System • Open Source Product Developed at NASA Ames Research Center • DQS - Distributed Queuing System • Open Source Product Developed by SCRI at Florida State University • LSF - Load Sharing Facility • Commercial Product from Platform Computing • Already Deployed by the Computer Center at Jefferson Lab • Codine • Commercial Version of DQS from Gridware, Inc. • Condor • A Restricted Source ‘Cycle Stealing’ Product From The University of Wisconsin • Others To Numerous To Mention
Jefferson Lab and PBS: Why We Chose PBS? • Portability • The PBS distribution compiled and ran immediately on both the 64 bit Alpha and 32 bit Intel platforms. • Documentation • PBS comes with comprehensive documentation including an Administrators Guide, External Reference, and Internal Reference. • Active Development Community • There is a large community worldwide that continues to improve and refine PBS. • Modularity • PBS is a component oriented system. • A well defined API is provided to allow components to be replaced with locally defined modules. • Open Source • The source code for the PBS system is available without restriction. • Price • Hey, its free…
Jefferson Lab and PBS: The PBS View Of The World • PBS Server • Mastermind of the PBS System • Central Point of Contact • PBS Scheduler • Prioritizes Jobs • Signals Server to Start Jobs • Machine Oriented Mini-Server (MOM) • Executes Scripts on Compute Nodes • Performs User File Staging
Jefferson Lab and PBS: The PBS Server • Routing Queues • Can move jobs between multiple PBS Servers • Execution Queues • Defines default characteristics for submitted jobs • Defines a priority level for queued jobs • Holds jobs before, during and after execution • Node Capabilities • The server maintains a table of nodes, their capabilities and their availability. • Job Requirements • The server maintains a table of submitted jobs that is independent of the queues. • Global Policy • The server maintains global policies and default job characteristics.
Jefferson Lab and PBS: The PBS Scheduler • Prioritizes Jobs • Called periodically by the PBS Server • Downloads job lists from the server sorts them based on locally defined requirements. • Tracks Node Availability • Examines executing jobs to determine projected availability time for nodes. • Using this data the scheduler can calculate future deployments and determine when back-filling should be performed. • Recommends Job Deployment • At the end of the scheduling cycle, the scheduler will submit a list of jobs that can be started immediately to the server. • The PBS Server is responsible for verifying that the jobs can be started, and then deploying them.
Jefferson Lab and PBS: Machine Oriented Mini-Server • Executes Scripts • At the direction of the PBS Server, MOM executes the user provided scripts • For parallel jobs, the primary MOM (Mother Superior) starts the jobs on itself and all other assigned nodes. • Stages Data Files • Prior to script execution, the MOM is responsible for remotely copying user specified data files to Mother Superior. • Following execution, the resultant data files are remote copied back to the user specified host. • Tracks Resource Usage • MOM tracks the cpu time, wall time, memory and disk that has been used by the job. • Kills Rogue Jobs • Kills jobs at the PBS Server’s request
Jefferson Lab and PBS: What We’ve Learned So Far • PBS Is Reasonably Reliable, But Has Room For Improvement • PBS Server and PBS Scheduler components work well and behave predictably • PBS MOM works okay, but behaves bizarrely in certain situations • Disk full = chaos • Out of process slots = chaos • Improper file transfer or staging = chaos Note: The first two can be avoided by conspicuous system management, the last is the responsibility of the job submitter. • Red Hat Linux 6.2 • We’ve seen many problems associated with NFS. After upgrading to Kernel 2.2.16-3 many of these problems went away. • Klogd occasionally spins out of control and uses all available CPU cycles. • Sshd on SMP machines dies for no apparent reason. • Crontab works intermittently on SMP nodes. We’re considering experimenting with True 64 Unix to see if these problems exist there. • Writing a Scheduler Is Hard Work • We have developed two interim schedulers and are now working on the ‘final’ implementation.
Jefferson Lab and PBS: Ongoing Development • Underlord Scheduling System • Built on the existing PBS Scheduler Framework • Plug-in replacement for the default scheduler • Uses an object oriented interface to the PBS Server • Comprehensive match making scheme • Starts from an ordered list of jobs • Works with a collection of homogeneous or heterogeneous nodes • Locates the optimal node or combination of nodes where a job should be deployed • Uses user specified job parameters to project future job deployment • Uses future job scheduling in combination with backfilling to maximize system utilization. • Multi-layered job sorting algorithm • Time in queue • Projected execution time • Number of processors requested • Queue priority • Progressive user share (similar to the LSF scheme) • Generates a projection table • Allows users to determine when their job is projected to start
Jefferson Lab and PBS: Future Directions • Data Grid Server • In order to provide greater flexibility to the Batch System and allow it to accommodate data provided through the proposed Data Grid system, a Data Grid Server will be added to the existing system components. • This module will have the following capabilities • Will provide time projections for when data will be available • Will perform data migration to a script accessible host • Will provide mechanisms to transfer resultant data to a specified location • Will replace the existing staging capabilities of the PBS Server and PBS MOM. • PBS Meta-Facility - The Overlord Scheduler • The Overlord Scheduler will be a centralized location where jobs are submitted that can be forwarded to other PBS Clusters for execution. The Overlord Scheduler will have the following capabilities. • Will prioritize and sort all jobs based on global Meta-Facility rules • Will consider job requirements, data location and network throughput and will forward each job to the PBS Server where it will be scheduled earliest. • Will not forward jobs to one of the ‘Underlord’ systems until it is eligible for immediate execution there. • We don’t have all of this figured out yet… but, we are confident.
Jefferson Lab and PBS: Places On The Web • Jefferson Lab HPC Home Page • http://www.jlab.org/hpc • Currently we have most of the PBS documentation and some statistics about our cluster and its development. • PBS Home Page • http://www.openpbs.org • Register and download PBS and all documentation from this site`