1 / 9

Experience with LCG in DC04

Experience with LCG in DC04. Ian Stokes-Rees. Summary. Key Points Job Breakdown Site Distribution Challenges to Using LCG DC04 Issues with LCG Future. Key Points. 900 production jobs completed on LCG ( 12% of total)

Download Presentation

Experience with LCG in DC04

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Experience with LCG in DC04 Ian Stokes-Rees

  2. Summary • Key Points • Job Breakdown • Site Distribution • Challenges to Using LCG • DC04 Issues with LCG • Future

  3. Key Points • 900 production jobs completed on LCG (12% of total) • Roberto Santinelli and Flavia Donno have provided invaluable LCG support • No major problems with LCG • Using GridFTP for all LCG data transfers • 3000+ attempted LCG job submissions • Very few jobs failing due to LCG problems

  4. Job Breakdown • 1616 jobs submitted to LCG in last week • These are not all LHCb production jobs • 475 successful LHCb production jobs • 534 exited with no work available • 421 failed due to a server crash at CERN • 66 failed due to expired proxy certificate* • 40 failed due to Gauss seg fault • 13 failed due to LCG problems (<1%)* • 67 are still running *may create problems for DIRAC auto-recovery and rescheduling

  5. Site Distribution

  6. Challenges to Using LCG (I) • GridFTP – non-trivial installation required at all sites which want to access LCG data • RS and FD provided both stand alone RPMs and tarballs for grid-ftp. • Remarkable that Globus don’t provide grid-ftp clients • Debugging problems is very time consuming and difficult • sometimes it is impossible and requires LCG operators or site admins to intervene • LCG does not return much (if any) information on failed jobs • Normalised Queue Time Limit • still looking for a good solution for this problem • User Quota • No way to figure out if the LCG job will exceed quota

  7. Challenges (II) • Commands not designed for large number of jobs • Analysing results of LCG jobs (and command output) very time consuming • XML format output of all LCG commands would be invaluable • Some commands can hang indefinitely • Requires wrapping all LCG commands in timeout watchdog • Some LCG commands return incomplete results • Input and output of commands not great for scripting • Most operations take a long time to execute • 10s of seconds to several minutes • Mystery Aborted jobs • Experienced by Alice and Atlas • Quite rare

  8. DC04 Issues Using LCG • VO_LHCB_SW_DIR – not writeable, not always configured properly • Extracting information from .BrokerInfo file • Reflection by job on WN to find out: • What CE it is running on • What the LCG Job ID is • File not always available (if job is targeted to site with –r) • File Transfers!We have had problems transfering via BBFTP, SFTP, GridFTP • This has led to many failed jobs • Difficult to test job environment in advance, so download, install, configuration, and running of DIRAC must be flawless first time without intervention

  9. Future • Ramp up loading of DC04 jobs onto LCG • Increased use of LCG Data Mgmt Tools • Replica Management • Customisation of LHCb RB and BDII • Automated LCG Agent • Monitor available production jobs and submit to LCG if there are available “slots” • Direct submission of production jobs to LCG • Currently running a “Fork” DIRAC Agent on WN • More direct interface via APIs • Evaluation of ARDA software

More Related