230 likes | 241 Views
The SAM-Grid / LCG Interoperability Test Bed. Gabriele Garzoglio ( garzogli@fnal.gov ) Speaker: Pierre Girard ( pierre.girard@in2p3.fr). Overview. The Interoperability Test Bed Motivations Architecture Status Report Lesson learned / Problems encountered Still discussing… Conclusions.
E N D
The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio (garzogli@fnal.gov) Speaker: Pierre Girard (pierre.girard@in2p3.fr)
Overview • The Interoperability Test Bed • Motivations • Architecture • Status Report • Lesson learned / Problems encountered • Still discussing… • Conclusions
Motivations for the interoperability project • The SAM-Grid is a convenient meta-computing system for the RunII experiments because it offers… • …transparent access to the experiment data through SAM • …integrated application management (job environment preparation, application-sensitive policies, job aggregation) • But deployment is expensive… • The idea: DZero will increase its resource pool within the framework of LCG (EGEE), while relying on the SAM-Grid data and application management
Flow of Job Submission Offers services to … Basic Architecture SAM-Grid / LCG Forwarding Node SAM-Grid LCG SAM-Grid VO-Specific Services • Main issues to track down: • Accessibility of the services • Usability of the resources • Scalability
Network Boundaries Forwarding Node LCG Cluster VO-Service (SAM) Job Flow Offers Service FW C C FW FW FW C C C C C C C C S S S S Service/Resource Multiplicity SAM-Grid
C C C C Current Test Bed Configuration SAM-Grid Network Boundaries Forwarding Node LCG Cluster Integration in Progress VO-Service (SAM) Job Flow Offers Service FW Wuppertal FW C C S C RAL Clermont- Ferrand S Imperial College Lancaster CCIN2P3
Job Scheduling System Adaptation I • The SAM-Grid sees the FW node as another gateway • The SAM-Grid has developed a grid-to-fabric interface (job-manager) that interacts with multiple fabric services (SAM, Monitoring, Environment Preparation): the Batch System is one of them. • Batch system adaptation is done through a layer of abstraction and implemented via robust local scheduler handlers.
Job Scheduling System Adaptation II • This mechanism is so flexible that allowed the adaptation of SAM-Grid to LCG • Job Management (submit, status poll, kill, output gathering, …) is implemented via an LCG “scheduler” handler • The handler uses the LCG UI to submit jobs to an LCG broker (logically part of the FW node, in practice can be anywhere)
Overview • The Interoperability Test Bed • Motivations • Architecture • Status Report • Lesson learned / Problems encountered • Still discussing… • Conclusions
Status Report • We can submit real DZero data reprocessing and montecarlo jobs to LCG via SAM-Grid • Jobs land on the available LCG clusters • Jobs rely on the SAM station at CCIN2P3 to handle input (binaries and data) and output • …see the SAM-Grid monitoring
Problems/Lesson Learned I • Scratch management is responsibility of the site OR the application. • DZero requirements on local scratch space • Cannot run on NFS because of intensive I/O • Need 4 GB of local space • SAM-Grid uses job wrappers to do “smart” scratch management (find best scratch area to use) • These wrappers rely on the job managers to set up scratch variables ($TMP_DIR, …) • Under discussion: one aspect of considering a cluster DZero-certified should be having the scratch variables defined
Problems/Lesson Learned II • Use of the LCG brokers • Experienced problems with disk space for the input sandbox (input sandbox 4 MB, all the rest via SAM) • Needed administrative action to resolve the problem • Possibly mitigated since we can use multiple brokers (tested with Wupperal and CCIN2P3 brokers)
Problems/Lesson Learned III • Job Failure Analysis • In general, for a single SAM-Grid job, the forwarding node submits multiple LCG jobs (aggregation management). The output of all the jobs is bundled together in an output sandbox. • We observed problems retrieving the output of “aborted” LCG jobs • “Maradona” fails in handling the output • In this case, it is tough to understand what went wrong with the job
Problems/Lesson Learned IV • Resubmission of non-reentrant jobs • Some jobs should not be resubmitted in case of failure. They will be recovered as a separate activity • Problems overriding retrials of job submission from the JDL and the UI configuration • Is this a known bug? A configuration problem on our part?
Problems/Lesson Learned V • Network configuration • Sites hosting SAM must allow incoming network traffic from the FW node and from all LCG clusters (worker nodes) to allow data handling control and transport • SAM should be modified to provide port range control
Problems/Lesson Learned VI • SAM configuration • SAM can only use TCP-based communication (as expected, UDP does not work in practice on the WAN) • SAM had to be modified to allow service accessibility for jobs within private networks (pull-based vs call-back interfaces)
Still discussing... I • What does it mean certifying LCG for a certain DZero activity? • For reprocessing, all the SAM-Grid clusters have undergone an initial certification phase • The cluster processes a well known dataset, then results are compared with a reference result • What do we do for LCG? Should every individual cluster be certified? Should the LCG as a whole be certified? • The answer probably depends on the type of activity (Reprocessing, Montecarlo, Analysis, …)
Still discussing... II • Who operates the SAM-Grid / LCG interoperability system? • For the SAM-Grid DZero reprocessing, people at the facilities had interest in having their resources utilized: people at each facility have run operations submitting jobs to their own facilities • Running “operations” means being responsible for the production of the data (routine job submission/monitoring, troubleshooting, facility maintenance/upgrade, …) • How do we organize the people that operate the LCG interoperability system? Is one responsible person enough?
Still discussing... III • Support on LCG • In case something goes wrong on the LCG, DZero has to learn the best channels to request support • What response can DZero expect now and in 2 years? • As the system becomes more complex, it becomes difficult for the operators to pin point the reasons for job failures. LCG will get reports for failures of the SAM-Grid side… and vice-versa.
Overview • The Interoperability Test Bed • Motivations • Architecture • Status Report • Lesson learned / Problems encountered • Still discussing… • Conclusions
Conclusions / SAM • We are moving the test bed to “production” by • expanding the system • ramping up usage • We are discussing open issues in operating the interoperability system • LCG certification • Organizing the operations • Obtaining support for LCG problems • Our principal target production application is montecarlo for DZero
Conclusions / LCG • Grid batch job environment variables • Proposal for standardization made at last HEPIX and last Operations Workshop (Bologna) • http://edms.cern.ch/document/630962 • What is the next step ? How to proceed with implementation ? • Make easier the MW errors handling • By using a well defined set of MW error codes ? • Suitable for automatic handling
More info at… • http://www-d0.fnal.gov/computing/grid/doc/SAMGrid-LCG-integration.pdf • http://www-d0.fnal.gov/computing/grid/doc/SAMGrid-LCG-integration-Lyon-report.pdf • http://samgrid.fnal.gov:8080/ • http://www-d0.fnal.gov/computing/grid/ • http://d0db.fnal.gov/sam/