1 / 21

Fermi General Purpose Farms OSG Deployment With suggestions for Run II Farms

Fermi General Purpose Farms OSG Deployment With suggestions for Run II Farms. Steven Timm--Fermilab CDF Grid Workshop 3/11/05. Outline. Fermi GP Farms presence on the OSG and SAMGrid Typical OSG-enabled cluster Expansion/conversion plans. OSG Integration Testbed 0.1.x; x →∞.

base
Download Presentation

Fermi General Purpose Farms OSG Deployment With suggestions for Run II Farms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fermi General Purpose FarmsOSG DeploymentWith suggestions for Run II Farms Steven Timm--Fermilab CDF Grid Workshop 3/11/05

  2. Outline • Fermi GP Farms presence on the OSG and SAMGrid • Typical OSG-enabled cluster • Expansion/conversion plans

  3. OSG Integration Testbed 0.1.x; x→∞ • OSG Integration Testbed on the air this week. • http://www.ivdgl.org/gridcat/index.php?whichmap=us • About 8-10 sites, small subsets of clusters, • Includes Fermi GP Farms (14) and Fermi USCMS. • Currently using VDT 1.3.2, 1 new VDT release/week. • Can reinstall VDT in less than 1 hr • Default OSG configuration similar to Grid3 • My understanding is that big Grid3 clusters will deploy OSG on scale of couple of months

  4. Existing Farms Grid presence • Test site for SAMGrid Farms (samgfarm1 and samgfarm2, 31 old nodes) as part of general purpose farms • Production site for SAMGrid as part of D0 reconstruction farms • Part of Open Science Grid integration testbed, (fnpcg, 14 nodes)

  5. Current Fermi OSG presence • Node fnpcg as gatekeeper and condor master • 14 worker nodes as condor pool. • Using grid3-like gridmapfiles at the moment • Tested against test VOMS/GUMS server, it worked. • Passed all validation tests. • Tested submitting jobs to FBSNG as well as Condor, this works as well. • Default configuration uses NFS to share VDT, application areas, data areas to worker nodes. • OSG-like readiness plan follows:

  6. Batch system priorities for Grid Users • Most Fermi facilities including GP farms at 90% utilization or better • Give existing users (KTeV, SDSS, MINOS, MiniBoone, Auger, Astro, Patriot, E871, Theory) same share of priority whether they come in from grid or not. • Will make small VO’s for all, use voms-proxy-init to identify them appropriately • OSG jobs, Fermi-based or otherwise, will run at lowest priority and low (1-5) quota of simultaneous jobs. • At the moment OSG jobs confined to condor pool of 14 slow nodes that weren’t otherwise getting used at all. • Have a split/dual configuration for as short a time as necessary. • Get all of our VO’s recognized by the OSG. • A tall order to implement in Condor, our people go to training next week to get it done.

  7. Support and Documentation • http://grid.fnal.gov/fermigrid • http://www-oss.fnal.gov/scs/public/farms/grid/ • http://www.ivdgl.org/osg-int/ • http://plone.opensciencegrid.org/ • http://www.opensciencegrid.org/ • Contact information for support during integration phase • Steven Timm timm@fnal.gov (630) 840 8525

  8. Gsiftp, srmcp FBS submit From OSG FNGP-OSG Gate-keeper FNSFO FBSNG HEAD NODE FNPCSRV1FBSNG HEAD NODE ENCP STKEN NFS RAID Condor Submit GP Farms FBSNG Worker Nodes 102 currently Condor WN 14 currently Dccp

  9. Components needed for grid-aware reconstruction cluster • Worker nodes • Global file name space, NFS now, can be Panasas or Ibrix later • Gatekeeper node • Grid-aware interface to mass storage (SRM/dCache) • Concatenator node / SAM station(s)

  10. Global file system server • Fermilab farms have traditionally used NFS servers • Namespace doesn’t have to be large, 1-2 TB is OK • Some global namespace likely to be needed in OSG indefinitely for preloading of applications,etc. • CMS Tier1@Fermi has used IBRIX with good results but large human cost in implementation • Fermi CD/CSS is testing Panasas, good for a few TB but probably not cost-effective for larger. • Best to have files served from a machine that is not doing the I/O intensive concatenation and merging activities • Any NFS host up to serving Condor-CAF should be OK for OSG as well.

  11. Gatekeeper node • Memory and reliability are of the essence • For Fermigrid gateway we have bought 3xDell Poweredge 2850, dual 3.6GHz cpu, 4GB DDR-2 SDRAM. • For GP Farms gatekeeper we will use Dell Poweredge SC1425, also dual 3.6GHz. • Gatekeeper can also serve as condor master, or can easily be configured to point to a pre-configured condor master (such as a condor-CAF node).

  12. Concatenator / SAM station • Both D0 and CDF concatenate their output using SAM projects • Intermediate files stored on RAID disk “SAM Durable storage” so you don’t lose several days of production due to hardware failure • Prototype station should have 1-2 TB of disk, fast dual processors • Can run 3-4 merging processors on fast dual machine with RAID drives. • 4 machines with 1 TB each better than 1 machine with 4 TB, due to network bandwidths in and out. • Would prefer SCSI-based disk but will probably use IDE RAID for budget concerns. • No production SAM station on the GP farms right now but one will be added when we start taking SAMGrid job or if any of local users need it.

  13. Current Architecture • All home directories, grid users and regular users, served off FNSFO and accessible on condor and FBSNG nodes • $app and $data served off FNSFO via NFS, for now. • All VDT-related software served off of fnpcg, available only to condor nodes. • Grid jobs come in directly to fnpcg (soon to be renamed fngp-osg) fnpcg is gatekeeper and condor head.

  14. Next Steps • Fnpcg to be replaced with 2-cpu Dell server to be our production gatekeeper (Early March) • Will keep fnpcg as development gatekeeper • Fermigrid GUMS, VOMS to be deployed (late Feb., early March). Will transition as soon as they are ready (already tested). • Will add SAZ callout as soon as our users are loaded • FNSFO replaced with 4-way Dell, estimate mid-April, as NFS server

  15. Fermigrid Interface • Well understood how to interface to common site services (SAZ, GUMS, VOMS) • Eventually goal is to have site gatekeeper for all jobs from “visitor VO’s” • Study needed on best way to forward jobs from site globus-gatekeeper to ours. • Baseline design is Condor-G • Condor-C is new but looks promising (has job matching features to several condor pools) • Fallback position—can just authorize Fermigrid gateway as submitter node to our condor pool.

  16. Fermigrid Interface 2 • Interoperability with other Fermi Jobs, esp. SAMGrid: • IMO best way to do this eventually is to get SAMGrid job manager and associated software included into OSG VDT • Can replicate work that has been done on CMS farm adding SAMgrid if necessary. • Do not anticipate adding a standalone production SAM station on this farm, but will consult with our users to see if one is needed.

  17. Batch Systems • Three potential options, all known to work: • Globus job manager submits jobs to existing FBSNG farm • Use Condor glide-ins into FBSNG farm • Have partitioned farm with existing FBSNG nodes only for legacy usage, all new nodes will join Condor pool, transition to all Condor in fairly short time • We favor option (3)

  18. Batch Systems--2 • Reasons to go to Condor quickly • Has native handling of GSI/X509 authentication (which we believe is necessary to access grid-based interfaces to storage elements) • Many grid users have Condor-specific elements in their production and can’t easily run on non-Condor resources • Condor well-supported by non-Fermi staff • Don’t want to maintain dual-configuration farm for long period of time

  19. Batch Systems 3 • Issues that have to be resolved before full Condor migration • Verify that grid authentication is necessary and sufficient to access mass storage. • Replicate the very complicated user and scheduling allocations of the farms that are currently in FBSNG • User training to help them with the transition • VO’s established for the 10 or so major user groups on the GP Farms and propagation of same to OSG.

  20. Longer Range plans • More VDT/Condor software localized to worker nodes (already done by CMS) • Different global file system to replace NFS, Panasas likely candidate. • Clone gatekeeper/head node/ SAM station structure to CDF and D0 Farms

  21. Summary • GP Farms are on the OSG and have a procedure to stay on it. • Significant documentation development and user training is needed for transitioning users to grid usage • Working with Fermigrid site services will be minor change on any given gatekeeper • OSG installation is well understood and automated, can easily be replicated on other clusters.

More Related