400 likes | 563 Views
Yet Another Grid Project: The Open Science Grid at SLAC. Matteo Melani, Booker Bense and Wei Yang SLAC. Hepix Conference 10/13/05, SLAC, Menlo Park, CA, USA. July 22 nd , 2005.
E N D
Yet Another Grid Project:The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA, USA
July 22nd, 2005 • “The Open Science Grid Consortium today officially inaugurated the Open Science Grid, a national grid computing infrastructure for large scale science. The OSG is built and operated by teams from U.S. universities and national laboratories, and is open to small and large research groups nationwide from many different scientific disciplines.” - science grid this week-
Outline • OSG in a nutshell • OSG at SLAC: “PROD_SLAC” site • Authentication and Authorization in OSG • LSF-OSG integration • Running applications: US CMS and US ATLAS • Final thought
Outline • OSG in a nutshell • OSG at SLAC: “PROD_SLAC” site • Authentication and Authorization in OSG • LSF-OSG integration • Running applications: US CMS and US ATLAS • Final thought
ATLAS DC2 CMS DC04 Once upon a time there was… • Goal to build a shared Grid infrastructure to support opportunistic use of resources for stakeholders. Stakeholders are NSF, DOE sponsored Grid Projects (PPDG, GriPhyN, iVDGL), and US LHC software program. Team of computer and domain scientists deployed (simple) services in a Common infrastructure and interfaces across existing computing facilities.Operating stably for over a year in support of computationally intensive applications.Added communities without perturbation. • 30 sites, • ~3600 CPUs
Vision (1) The Open Science Grid: A production quality national grid infrastructure for large scale science. • Robust and scalable • Fully managed • Interoperates with other Grids
What is the Open Science Grid? (Ian Foster) • Open • A new sort of multidisciplinary cyberinfrastructure community • An experiment in governance, incentives, architecture • Part of a larger whole, with TeraGrid, EGEE, LCG, etc. • Science • Driven by demanding scientific goals and projects who need results today (or yesterday) • Also a computer science experimental platform • Grid • Standardized protocols and interfaces • Software implementing infrastructure, services, applications • Physical infrastructure—computing, storage, networks • People who know & understand these things!
OSG Consortium Members of the OSG Consortium are those organizations that have made agreements to contribute to the Consortium. • DOE Labs: SLAC, BNL, FNAL • Universities: CCR- University of Buffalo • Grid Projects: iVDGL, PPDG, Grid3, GriPhyN • Experiments: LIGO, US CMS, US ATLAS, CDF Computing, D0 Computing , STAR, SDDS • Middleware Projects: Condor, Globus, SRM Collaboration, VDT Partners are those organizations with whom we are interfacing to work on interoperation of grid infrastructures and services. • LCG, EGEE, TeraGrid
Character of Open Science Grid (1) • Pragmatic approach: • Experiments/users drives requirements • “Keep it simple and make more reliable” • Guaranteed and opportunistic use of resources provided through Facility-VO contracts. • Validated, supported core services based on VDT and NMI Middleware. (Currently GT3 based, moving soon to GT4) • Adiabatic evolution to increase scale and complexity. • Services and applications contributed from external projects. Low threshold to contributions and new services.
Character of Open Science Grid (2) • Heterogeneous Infrastructure • All Linux but different versions of the Software Stack at different sites. • Site autonomy: • Distributed ownership of resources with diverse local policies, priorities, and capabilities. • “no” Grid software on compute nodes. • But users want direct access for diagnosis and monitoring: • Quote from physicist on CDF: “Experiments need to keep under control the progress of their application to take proper actions, helping the Grid to work by having it expose much of its status to the users”
Services • Computing Service: GRAM form GT3.2.1+patches • Storage Service: SRM Interface (v1.1) as common interface to storage, DRM and dCache; most sites use NFS + GridFTP, we are looking into SRM-xrootd solution • File Transfer Service: GridFTP • VO Management Service: INFN VOMS • AA: GUMS v1.0.1, PRIMA v0.3, gPlazma • Monitoring Service: Monalisa, v1.2.34, MDS • Information Service: jClarens, v0.5.3-2, GridCat • Accounting Service: partially provided by Monalisa
Open Science Grid Release 0.2 User Portal Submit Host: Condor-GGlobus RSL Catalogs & Displays: GridCat ACDC MonaLisa Virtual Organization Management Site Boundary (WAN->LAN) Compute Element Compute Element GT2 GRAM Grid monitor Monitoring & Information GridCat, ACDC MonaLisa, SiteVerify Storage Element SRM V1.1 GridFTP CE WN $WN_TMP PRIMA; gPlazma Batch queue job priority Authentication Mapping:GUMS Identity and Roles : X509 Certs Common Space across WN: $DATA (local SE) $APP $TMP Courtesy of Ruth Pordes
OSG 0.4 User Portal Submit Host: Condor-GGlobus RSL Catalogs & Displays: GridCat ACDC MonaLisa Identity and Roles : X509 Certs Service Discovery: Virtual Organization Management Site Boundary (WAN->LAN) Compute Element Edge Service Framework (XEN) Lifetime Managed VO Services Compute Element GT2 GRAM Grid monitor Storage Element SRM V1.1 GridFTP GT4 GRAM CE Some Sites with Bandwidth Management Common Space across WN: $DATA (local SE) $APP $TMP Accounting Full Local SE Monitoring & Information GridCat, ACDC MonaLis SiteVerify Job monitoring & exit codes reporting WN $WN_TMP PRIMA; gPlazma Batch queue job priority Authentication Mapping:GUMS Courtesy of Ruth Pordes GIP + BDII network
Software distribution • Software is contributed by individual OSG members into collections we call "packages". • OSG provides collections of software for common services built on top of the VDT to facilitate participation. • There is very little OSG specific software and we strive to use standards based interfaces where possible. • OSG software packages are currently distributed as a Pacman caches. • Latest release on May 24th VDT 1.3.6
OSG’s deployed Grids OSG Consortium operates two grids: • OSG is the production grid: • Stable; for sustained production • 14 VOs • 38 sites, ~5,000 CPUs, 10 VOs. • Support provided • http://osg-cat.grid.iu.edu/ • OSG-ITB: is the test and development grid Grid: • For testing new services, technologies, versions… • 29 sites, ~2400 CPUs, • http://osg-itb.ivdgl.org/gridcat/
Operations and support • VOs are responsible for 1st level support • Distributed Operations and Support model from the outset. • Difficult to explain, but scalable and putting most support “locally”. • Key core component is central ticketing system with automated routing and import/export capabilities to other ticketing systems and text based information. • Grid Operations Center (iGOC) • Incident Response Framework, coordinated with EGEE.
Outline • OSG in a nutshell • OSG at SLAC: “PROD_SLAC” site • Authentication and Authorization in OSG • LSF-OSG integration • Running applications: US CMS and US ATLAS • Final thought
PROD_SLAC • 100 jobs slots available in TRUE resource sharing • 0.5 TB of disk space • osg-support@slac.stanford.edu • LSF 5.1 batch system • VO role-base authentication and authorization • VOs: Babar, US ATLAS, US CMS, LIGO, iVDGL
PROD_SLAC • 4 Sun V20z, dual processors machines • Storage is provided with NFS: 3 directories $APP, $DATA and $TMP • We do not run Ganglia or GRIS
Outline • OSG in a nutshell • OSG at SLAC: “PROD_SLAC” site • Authentication and Authorization in OSG • LSF-OSG integration • Running applications: US CMS and US ATLAS • Conclusions
UNIX account issue The Problem: • SLAC Unix account did not fit the OSG model: • Normal SLAC account have too many default privileges • Gatekeeper-AFS interaction is problematic The Solution: • Created a new class of Unix accounts just for the Grids • Creation of new process for this new type of account • New account type have minimum privileges: • no emails, no login accesses, • home dir on grid dedicated NFS, no write access beyond Grid NFS server
DN-UID mapping • Each (DN, voGroup) pair is mapped to an unique UNIX account • No group mapping • Account name schema: osg + VOname + VOgroup + NNNNN Example: A DN in USCMS VO (voGroup /uscms/) => osguscms00001 iVDGL VO, group mis (voGroup /ivdgl/mis) =>osgivdglmis00001 • If revoked, the account name/UID will never be reused (unlike for UNIX accounts) • Keep track of Grid UNIX accounts like ordinary UNIX user accounts (in RES) 1,000,000 < UID < 10,000,000 • All Grid UNIX accounts belongs to one single UNIX group • Home directories on Grid dedicated NFS, shells are /bin/false
Outline • OSG in a nutshell • OSG at SLAC: “PROD_SLAC” site • Authentication and Authorization in OSG • OSG-LSF integration • Running applications: US CMS and US ATLAS • Final thought
GRAM Issue The Problem: • Gatekeeper over aggressively poll jobs status; it overloads the LSF scheduler • Race conditions: LSF job manager unable to distinguish between error condition and loaded system (we usually have more than 2K jobs running) • Maybe reduced in next version of LSF The Solution: • Re-write part the LSF job manager: lsf.pm • Looking into writing custom bjobs to have local caching
The straw the broke the camel’s back • SLAC has more than 4000 jobs slots being schedule by a single machine • We operate is a fully production mode: operation disruption has to be avoided at all costs • Too many monitoring tools (ACDC, Monalisa, User’s Monitoring tools…) can easily overload the LSF scheduler by running bjobs –u all • Implementations of monitoring is a concern!
Outline • OSG in a nutshell • OSG at SLAC: “PROD_SLAC” site • Authentication and Authorization in OSG • LSF-OSG integration • Running applications: US CMS and US ATLAS • Final thought
US CMS Application Intentionally left blank! We could run 10-100 jobs right away
US ATLAS Application • ATLAS reconstruction and analysis jobs require access to remote database servers at CERN, BNL, and elsewhere • SLAC batch nodes don't have internet access • Solution is to use have clone of the database within the SLAC network or to create a tunnel
Outline • OSG in a nutshell • OSG at SLAC: “PROD_SLAC” site • Authentication and Authorization in OSG • LSF-OSG integration • Running applications: US CMS and US ATLAS • Final thought
Final thought “PARVASEDAPTAMIHISED…” - Ludovico Ariosto
Physical View 2 1 3 7 8 6 10 4 9 5 Ticketing Routing Example User in VO1 notices problem at RP3, notifies their SC (1). SC-C opens ticket (2) and assigns to SC-F. SC-F gets automatic notice (3) and contacts RP3 (4). Admin at RP3 fixes and replies to SC-F (5). SC-F notes resolution in ticket and marks it resolved (6). SC-C gets automatic notice of update to ticket (7). SC-C notifies user of resolution (8). User can complain if dissatisfied and SC-C can re-open ticket (9,10). OSG infrastructure SC private infrastructure