180 likes | 266 Views
Preparing for the Grid— Changes in Batch Systems at Fermilab. HEPiX Batch System Workshop Karlsr uhe, Germany Ken Schumacher, Steven Timm. Introduction. All big experiments at Fermilab (CDF, D0, CMS) are moving to grid-based processing. This talk will cover the following:
E N D
Preparing for the Grid—Changes in Batch Systems at Fermilab HEPiX Batch System Workshop Karlsruhe, Germany Ken Schumacher, Steven Timm Batch systems@FNAL--Batch Workshop HEPiX Karlsruhe
Introduction • All big experiments at Fermilab (CDF, D0, CMS) are moving to grid-based processing. • This talk will cover the following: • Batch scheduling at Fermilab before the grid. • Changes of big Fermilab clusters to Condor and why it happened • Future requirements for batch scheduling at Fermilab Batch systems@FNAL--Batch Workshop HEPiX Karlsruhe
Before Grid--FBSNG • Fermilab had four main clusters, CDF Reconstruction Farm, D0 Reconstruction Farm, General Purpose Farm, CMS. • All used FBSNG (Farms Batch System Next Generation). http://www-isd.fnal.gov/fbsng • Most early activities on these farms were reconstruction of experimental data and generation of Monte Carlo. • All referred to generically as “Reconstruction Farms” Batch systems@FNAL--Batch Workshop HEPiX Karlsruhe
FBSNG scheduling in Reconstruction Farms • Dedicated reconstruction farm (CDF, D0) • Large cluster dedicated to one experiment • Small team of experts submits all jobs • Scheduling is trivial • Shared reconstruction farm (General Purpose) • Small cluster shared by 10 experiments, each with one or more queues • Each experiment has maximum quota of CPU’s they can use at once • Each experiment has maximum share of farm it can use when farm is oversubscribed • Most queues do not have time limits. Priority is calculated taking into account the average time jobs that have been running in the queue • Special queues for I/O jobs that run on the head node and go to and from mass storage • Guaranteed scheduling means that everything will eventually run • Other queues may be manually held to let a job run • May have to temporarily idle some nodes in order to let a large parallel job start up. Batch systems@FNAL--Batch Workshop HEPiX Karlsruhe
FBSNG Advantages and Disadvantages • Advantages • Light resource consumption by batch system daemons • Simple design—based on resource counting rather than load measuring and balancing • Cost--No per-node license fee • Customized for Fermilab Strong Authentication requirements (Kerberos). • Quite reliable, rarely if ever does FBSNG software fail. • Disadvantages • Designed strictly for Fermilab Run II production • Doesn’t have grid-friendly features (x509 authentication), although it could be added. Batch systems@FNAL--Batch Workshop HEPiX Karlsruhe
Grid can use any batch system, Why Condor? • Free software (but you can buy support). • Supported by large team at U. of Wisconsin (and not by Fermilab programmers) http://www.cs.wisc.edu/condor • Widely deployed in multi-hundred node clusters. • New versions of Condor allow Kerberos 5 and x509 authentication • Comes with Condor-G which simplifies submission of grid jobs • Condor-C components allow for interoperation of independent Condor pools. • Some of our grid-enabled users take advantage of the extended Condor features, so it is the fastest way to get our users on the grid. • USCMS production cluster at Fermilab has switched to Condor, CDF reconstruction farms cluster is switching. • General Purpose Farms, which are smaller, also plan to switch to Condor to be compatible with the two biggest compute resources on site. Batch systems@FNAL--Batch Workshop HEPiX Karlsruhe
Rise of Analysis Clusters • Experiments now use multi-hundred node Linux clusters for analysis as well, replacing expensive central machines • CDF Central Analysis Facility (CAF) originally used FBSNG—Now has switched to Condor. • D0 Central Analysis Backend (CAB) uses PBS/Torque • USCMS User Analysis Facility (UAF) used FBSNG as primitive load balancer for interactive shells—will switch to Cisco load balancer shortly. • Heterogeneous job mix • Many different users and groups have to be prioritized within the experiment Batch systems@FNAL--Batch Workshop HEPiX Karlsruhe
CAF software • In CDF terms, CAF refers to the cluster and the software that makes it go. • CDF collaborators (UCSD+INFN) wrote a series of wrappers around FBSNG referred to as “CAF”. • Wrappers allow connection to debug running job, or tail files on job that is running, many other things • Also added monitoring functions • Users are tracked by Kerberos principal, and prioritized with different batch queues, but all jobs run with just a few userID’s, making management easy. • dCAF is distributed CAF, the same setup replicated at dedicated CDF resources around the world. • Info at http://cdfcaf.fnal.gov Batch systems@FNAL--Batch Workshop HEPiX Karlsruhe
CondorCAF in production • CDF changed batch system to Condor in analysis facility • Also rewrote monitoring software to work with Condor • http://cdfcaf.fnal.gov/condorcaf/history/user.html • Condor “computing on demand” capacity allows users to list files, tail files, debug on batch nodes. • Lots of work from the Condor team to get them going with Kerberos authentication and the large number of nodes (~700). • Now half of CDF reconstruction farm also running Condor • Rest of CDF reconstruction farm will convert once validation is complete • SAM is data delivery and bookkeeping mechanism • used to fetch data files, keep track of intermediate files, store the results. • Replaces user-written bookkeeping system that was high-maintenance • Next steps, GlideCAF to make CAF work with Condor Glide-ins across the grid on non-dedicated resources. Batch systems@FNAL--Batch Workshop HEPiX Karlsruhe
Screen from CondorCAF monitoring Batch systems@FNAL--Batch Workshop HEPiX Karlsruhe
SAMGrid • D0 is using SAMGrid for all remote generation of Monte Carlo and reprocessing at several sites world wide. • D0 Farms at FNAL are biggest site. • http://projects.fnal.gov/samgrid • Special job managers written to do intelligent handling of production and Monte Carlo requests • All job requests and data requests go through head nodes to the outside net. Significant scalability issues, but it is in production. • D0 reconstruction farms at Fermilab will continue to use FBSNG. Batch systems@FNAL--Batch Workshop HEPiX Karlsruhe
Open Science Grid • Continuation of efforts that were begun in Grid3. • Integration testing has been ongoing since February • Provisioning and deployment is occurring as we speak. • At Fermilab, USCMS production cluster and General Purpose Farms will be initial presence on OSG. • 10 Virtual Organizations so far, mostly US-based: • USATLAS (ATLAS collaboration) • USCMS (CMS collaboration) • SDSS (Sloan Digital Sky Survey) • fMRI (functional Magnetic Resonance Imaging, based at Dartmouth) • GADU (Applied Genomics, based at Argonne) • GRASE (Engineering applications, based at SUNY Buffalo) • LIGO (Laser Interferometer Gravitational Observatory) • CDF (Collider Detector at Fermilab) • STAR (Solenoidal Tracker at RHIC—BNL) • iVDGL (International Virtual Data Grid Laboratory) • http://www.opensciencegrid.org Batch systems@FNAL--Batch Workshop HEPiX Karlsruhe
Structure of General Purpose Farms OSG Compute Element • One node runs Globus Gatekeeper and does all communication with the grid • Software comes from VDT (Virtual data toolkit, http://www.cs.wisc.edu/vdt/). • In this configuration this gatekeeper is also the Condor master. Condor software is part of VDT. • Will make a separate Condor head node later once software configuration is stable. • All grid software is exported by NFS to the compute nodes. No change to compute node install is necessary. Batch systems@FNAL--Batch Workshop HEPiX Karlsruhe
Fermigrid • Fermigrid is an internal project at Fermilab to get different Fermilab resources to be able to interoperate, and be available to the Open Science Grid • Fermilab will start with General Purpose Farms and CMS being available to OSG and to each other. • All non-Fermi organizations will send jobs through common site gatekeeper. • Site gatekeeper will route jobs to the appropriate cluster, probably using Condor-C, details to be determined. • Fermigrid provides VOMS server to manage all the Fermilab-based Virtual Organizations • Fermigrid provides GUMS server to map the grid Distinguished Names to unix userid’s. Batch systems@FNAL--Batch Workshop HEPiX Karlsruhe
Current Farms Configuration FNSFO FBSNG HEAD NODE ENSTORE ENCP NFS RAID FBS Submit GP Farms FBSNG Worker Nodes 102 currently Batch systems@FNAL--Batch Workshop HEPiX Karlsruhe
Configuration with Grid Job from OSG Fermigrid1 Site gatekeeper FNGP-OSG Gate-keeper ENSTORE FNPCSRV1FBSNG HEAD NODE Job from Fermilab NFS RAID FBS Submit Condor submit New Condor WN 40 (coming this summer) Condor WN 14 currently GP Farms FBSNG Worker Nodes 102 currently Batch systems@FNAL--Batch Workshop HEPiX Karlsruhe
Requirements • Scheduling • Current FBSNG installation in general purpose farms has complicated shares and quotas • Have to find best way to replicate this in Condor. • Hardest case to handle—low priority long jobs come into the farm while it is idle and fill it up. Do we pre-empt? Suspend? • Grid credentials and mass storage • Need to verify that we can use Storage Resource Manager and gridftp from compute nodes, not just head node. • Grid credentials—authentication + authorization • Condor has Kerberos 5 and x509 authentication • Need way to pass these credentials through the Globus GRAM bridge to the batch system • Otherwise local as well as grid jobs end up running non-authenticated and trusting the gatekeeper. Batch systems@FNAL--Batch Workshop HEPiX Karlsruhe
Requirements 2 • Accounting and auditing • Need features to track which groups and which users are using the resources • VO’s need to know who within their VO is using resources • Site admins need to know who is crashing their batch system • Extended VO Privilege • Should be able to set priorities in the batch system and mass storage system by virtual organization and role. • In other words, Production Manager should be able to jump ahead of Joe Graduate Student in the queue. • Practical Sysadmin concerns • Some grid user mapping scenarios visualize hundreds of pool userid’s per VO. • Have to give all of these accounts, quotas, home directories, etc. • Would be very nice to do as CondorCAF does and run with a few user id’s traceable back to kerberos principal or grid credential. Batch systems@FNAL--Batch Workshop HEPiX Karlsruhe