Grid Job and Information Management (JIM) for D0 and CDF

Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team

Overview • Introduction • Grid-level Management • SAM-Grid = SAM + JIM • Job Management • Information Management • Fabric-level Management • Running jobs on grid resources • Local sandbox management • The DZero Application Framework • Running MC at UWisc

Context • D0 Grid project started in 2001-2002 to handle D0’s expanded needs for globally distributed computing • JIM complements the data handling system (SAM) with jobs and info management • JIM is funded by PPDG (our team here), GridPP (Rod Walker in the UK) • Collaborative effort with the experiments. • CDF joined later in 2002

History • Delivered JIM prototype for D0, Oct 10, 2002: • Remote job submission • Brokering based on data cached • Web-based monitoring • SC-2002 demo – 11 sites (D0, CDF), big success • May 2003 – started deployment of V1 • Now – working on running MC in production on the Grid

data meta-data job Flow of: User Interface User Interface User Interface User Interface Submission Submission Global Job Queue Resource Selector Grid Client Match Making Info Gatherer Info Manager Info Collector Global DH Services SAM Naming Server Site Data Handling Local Job Handling Cluster XML DB server SAM Log Server Site Conf. Grid Gateway SAM Station (+other servs) Glob/Loc JID map Resource Optimizer ... Local Job Handler (CAF, D0MC, BS, ...) SAM DB Server Web Serv SAM Stager(s) MDS JIM Advertise RC MetaData Catalog Grid Monitoring Info Providers Worker Nodes Bookkeeping Service Cache MSS User Tools Dist.FS AAA Site Site Site SAM-Grid Logistics

Job Management Highlights • We distinguish grid-level (global) job scheduling (selection of a cluster to run) from local scheduling (distribution of the job within the cluster) • We consider 3 types of jobs • analysis: data intensive • monte carlo: CPU intensive • reconstruction: data and CPU intensive

Job Management – Distinct JIM Features • Decision making is based on both: • Information existing irrespective of jobs (resource description) • Functions of (jobs,resource) • Decision making is interfaced with data handling middleware • Decision making is entirely in the Condor framework (no own RB) – strong promotion of standards, interoperability • Brokering algorithms can be extended via plug-ins

User Interface User Interface Submission Client Submission Client Job Management Match Making Service Match Making Service Broker Queuing System Queuing System Information Collector Information Collector JOB Data Handling System Data Handling System Data Handling System Data Handling System Execution Site #1 Execution Site #n Computing Element Computing Element Computing Element Storage Element Storage Element Storage Element Storage Element Storage Element Grid Sensors Grid Sensors Grid Sensors Grid Sensors Computing Element

Information Management • In JIM’s view, this includes: • configuration framework • resource description for job brokering • infrastructure for monitoring • Main features • Sites (resources) and jobs monitoring • Distributed knowledge about jobs etc • Incremental knowledge building • GMA for current state inquiries, Logging for recent history studies • All Web based

Resource Advertisement classad Monitoring Configuration LDIF Service Instantiation XML Information Management via Site Configuration Main Site/cluster Config XMLDB Template XML XSLT … XSLT XSLT XSLT

Running jobs on Grid resources • The trend: Grid resources are not dedicated to a single experiment • Translation: • no daemons running on the worker nodes of a Batch System • no experiment specific software installed

Running jobs on Grid resources • The situation today is transitioning: • Worker nodes typically access the software via shared FS: not scalable! • Generally, experiments can install specific services on a node close to the cluster. • Local resource configuration still too diverse to easily plug into the Grid

The JIM local sandbox management • It keeps the job executable (from the Grid) at the head node and knows where its product dependencies are • It transports and installs the software to the worker node • It can instantiate services at the worker node • It sets up the environment for the job to run • It packages the output and hands it over to the Grid, so that it becomes available for the download at the submission site

Running a DZero application • We have JIM sandbox: where is the problem now? • JIM sandbox could immediately use the DZero Run Time Environment, but • Not all the DZero packages are RTE Compliant • User don’t have experience/incentives in using it today

Running Monte Carlo at UWisc • University of Wisconsin offered DZero the opportunity of using a 1000 node non-dedicated condor cluster • We are concentrating on putting it to use to run MC with mc_runjob (in production by year end)

The challenges I • MC code is not RTE compliant today • Chain of 3-5 stages. Each binary 50-200 MB, dynamically linked • Are compiled from 40 packages (total for D0 621). Need these packages at run time for RPC files • Root, Motif, X11, Ace libraries are found as dependencies (for MC generators…) • MC tarballs exist but are hand-crafted (and bug-prone) every time. Size unpacked 2GB (versus 12-15 GB full D0 app tree).

The challenges II • About every advanced C++ feature, every libc library call, every system call, are used • One can get different results on two RedHat 7.2 systems. • Total release tree takes N hours (up to 20+) to build – not something easy to do dynamically at remote site

Summary • The SAM-Grid offers an extensible working framework for Grid-level Job/Data/Info Management • JIM provides Fabric-level management tools for sandboxing • The applications need to be improved to run on Grid resources

Grid Job and Information Management (JIM) for D0 and CDF

Grid Job and Information Management (JIM) for D0 and CDF

Presentation Transcript

Job Analysis

Enterprise Finance Information Warehouse (EFIW)

Grid of Grids Information Management Kick off Meeting

Management grid

Management of Information Systems: 45-870

NASUA’s Aging Information Management Systems Study

Grid job submission using HTCondor

GSI Credential Management with MyProxy GGF8 Production Grid Management RG Workshop June 26, 2003

Grid Information Services

Grid Information Services

Low Level Grid Services (Job Management, Data Management, Monitoring Services)

Job submission into the LHC Grid ( Job Management + JDL)

Classroom Management

Office of Information Technology

Follow A SAM-Grid Job

Smart/Green Grid/Cities

NASUA’s Aging Information Management Systems Study

Job submission into the LHC Grid ( Job Management + JDL)

Follow A SAM-Grid Job

Presenters’ Contact Information

GSI Credential Management with MyProxy GGF8 Production Grid Management RG Workshop June 26, 2003

Grid Information Services