120 likes | 130 Views
This report provides an overview of the LBNE/Daya Bay experiments and their motivations for utilizing the PanDA system. It also discusses the current status of the local Daya cluster managed by Panda and the expansion to other sites.
E N D
LBNE/Daya Bay utilization of Panda: project review and status report PAS Group Meeting November 12, 2010 Maxim Potekhin for BNL Physics Applications Software Group Brookhaven National Laboratory potekhin@bnl.gov
Overview • Intro: Daya Bay and LBNE • Motivations for PAS to support LBNE/Daya • LBNE/Daya: Pre-PanDA mode of operation • Pandification • Current status of local Daya cluster managed by Panda • Expansion to other sites • Conclusion
Intro Both Daya Bay and LBNE are experiments studying neutrino oscillations, in different energy domains: • Daya Bay is a short base experiment in China currently entering the period of data taking after years of construction. It utilizes two nuclear reactors as source of neutrinos. • LBNE is a proposed complex Long Base Neutrino Experiment that will be utilizing FNAL neutrino beams with some of the detectors placed at DUSEL deep underground facility in South Dakota, the deepest of its kind. These include a 300 kiloton water Cherenkov detector. Daya Bay Personnel at BNL and a few other labs is also involved in the development of LBNE. Elements of software infrastructure are being inherited from Daya Bay and Ice Cube and will be utilized in LBNE. Maintenance, configuration and utilization of simulation software is one of primary responsibilities of the BNL group. Focus currently is on Monte Carlo simulation of large-scale Cherenkov detectors.
Motivations PAS has a broad mandate to support research conducted by the Physics Department. Standing of the group depends on how successful we are in doing that. In addition, BNL (primarily through PAS) is a stakeholder in Open Science Grid (OSG) and is benefiting from this collaboration. OSG aims to promote science in a wide range of disciplines by providing researchers with access to high-throughput computing via an open platform. PAS is de-facto owner of the PanDA product, and must leverage it co accomplish the above goals. LBNE/Daya Bay experiments are high-profile projects that are ideal area of application for PanDA.
Pre-PanDA mode of operation The BNL Daya Bay/LBNE group is using proven software components, including: • Geant 4 • Gaudi • SPADE (Data movement mechanism developed by Ice Cube collaboration which includes automatic sinking of data into HPSS) The main platform for configuration and steering component of the simulation software is a collection of Python modules. The framework thus developed is called NuWa. Prior to PAS involvement, there was no Workload Management System or Monitoring facility with which to direct and coordinate workflow either locally at BNL or across sites (PDSF, IIT etc). Job submission methods depended on the local interfaces to batch systems at each site, e.g. qsub would be used at PDSF and condor_submit at BNL.
Pandification LBNE/Daya group operates a cluster at BNL of approximately 16 cores, which are included in a local Condor pool. This serves as a small-scale production and validation platform, and there is an intention to expand to facilities outside BNL. Requirements with regard to resources, while not well defined, have been qualitatively revised upward since the beginning of our collaboration, both in terms of sites and amount of simulations. After a series of meetings in 2009, the BNL Daya Bay/LBNE group realized the potential of using PanDA to manage the workflow of their simulation activity across the facilities. The advantages are: • One point of entry to monitor workflow across nodes & sites • Single point of access to log files and other diagnostics • Easy and reliable versioning of task definition via hosting of PanDA transformation in a single point
Pandification , cont’d (1) Some hurdles to overcome were: • No machine committed to the role of Daya/LBNE gatekeeper at BNL, making the standard Condor-G pilot submission from one of our established and managed hosts impossible • Segregation of disk mounts between RCF and ATLAS parts of RACF facility, combined with paths being hardcoded in a few places in configuration software • Paths hardcoded in certain Panda setup scripts • No turn-key data movement mechanism in Panda (outside of ATLAS) and lack of ready staging area for LBNE data at BNL, plus functioning data sink to NERSC (the final storage point) still being established • Introduction of validation step into the workflow before the job is declared a success • On the wish-list: near-time access to log files – practical on Condor-C but impractical on Condor-G; odd behavior of some jobs needs to be debugged
Pandification , cont’d (2) Solutions: • Set up Condor-C pilot submission on one of Daya machines • Set up a Web server for serving log files: • Can’t properly install Apache on Daya cluster machines due to combination of technical and political reasons • Used al light-weight “nullhttp” server, then utilized an “official” OSG development server with Apache on it, that has BNL-approved conduits and proper disk mounts – the current solution • Data movement mechanism managed in the job wrapper with data deposited into xrootd at BNL, with final movement and decision made at a later stage Additional issues: • Need to look at evicted jobs that can’t be restarted due to increased image size – less of an issue lately but should look out of it, as it can clog the pilot queue
Current status of Daya cluster • Pilot submission works • Web server for serving log files – works • Size of log files produced by Daya jobs, previously large, was reduced on our recommendation • Pilot code was modified by Jose to purge files already staged out, as well as leftover Python code in the working directory, to conserve disk space • Current “rolling” disk usage is about 1GB which is recognized as acceptable by our Daya collaborators • We are standing ready to commence production for Daya managed by Panda, on BNL cluster, pending their finalizing of job configuration, as they reported in the meeting on 11/10/2010
Expansion to other sites Daya/LBNE collaboration has done preliminary setup at NERSC/PDSF. Points of interest: • NERSC HPSS will be the final point of storage of all data produced everywhere (including BNL) • Correct versions of ROOT, Python and other important components are provided by means of a “private” LBNE install w/o reliance on facility-wide installs • SPADE data mover is used to scoop up data from “dropboxes” and marshal it into HPSS (good!) • Management of groups and accounts at NERSC is different from what is now “mainstream” OSG auth/auth mechanisms – Virtual Organizations aren’t supported, instead users are assigned to Unix groups and provide their DN on NERSC Information System (NIM) as a way to authenticate with their plain Grid certificate (no VOMS/GUMS etc)
Expansion to other sites, cont’d PDSF: • Access issues (due to this non-standard setup) have been resolved in consultation with NERSC and Condor-G submission has been tested – ready to commence pilot submission • SPADE will be set up shortly In a few weeks, Daya/LBNE group expects a new cluster at Illinois Institute of Technology (IIT) to come on line. • SPADE system will be deployed to transport data to NERSC HPSS (no special code needed in the wrapper or the pilot) • We are in contact with personnel involved in the system setup • We’ll need to secure installation of at least a part of OSG software stack in order to implement Condor-G submission of pilots to that facility – discussions under way
Conclusion • We don’t see any outstanding problems to start production at BNL when Daya team finalizes task configuration • Pilot submission to PDSF and testing will commence in a few days • Will negotiate with IIT regarding Condor software stack installation there when cluster is online in a few weeks