90 likes | 180 Views
Experience with LCG in DC04. Ian Stokes-Rees. Summary. Key Points Job Breakdown Site Distribution Challenges to Using LCG DC04 Issues with LCG Future. Key Points. 900 production jobs completed on LCG ( 12% of total)
E N D
Experience with LCG in DC04 Ian Stokes-Rees
Summary • Key Points • Job Breakdown • Site Distribution • Challenges to Using LCG • DC04 Issues with LCG • Future
Key Points • 900 production jobs completed on LCG (12% of total) • Roberto Santinelli and Flavia Donno have provided invaluable LCG support • No major problems with LCG • Using GridFTP for all LCG data transfers • 3000+ attempted LCG job submissions • Very few jobs failing due to LCG problems
Job Breakdown • 1616 jobs submitted to LCG in last week • These are not all LHCb production jobs • 475 successful LHCb production jobs • 534 exited with no work available • 421 failed due to a server crash at CERN • 66 failed due to expired proxy certificate* • 40 failed due to Gauss seg fault • 13 failed due to LCG problems (<1%)* • 67 are still running *may create problems for DIRAC auto-recovery and rescheduling
Challenges to Using LCG (I) • GridFTP – non-trivial installation required at all sites which want to access LCG data • RS and FD provided both stand alone RPMs and tarballs for grid-ftp. • Remarkable that Globus don’t provide grid-ftp clients • Debugging problems is very time consuming and difficult • sometimes it is impossible and requires LCG operators or site admins to intervene • LCG does not return much (if any) information on failed jobs • Normalised Queue Time Limit • still looking for a good solution for this problem • User Quota • No way to figure out if the LCG job will exceed quota
Challenges (II) • Commands not designed for large number of jobs • Analysing results of LCG jobs (and command output) very time consuming • XML format output of all LCG commands would be invaluable • Some commands can hang indefinitely • Requires wrapping all LCG commands in timeout watchdog • Some LCG commands return incomplete results • Input and output of commands not great for scripting • Most operations take a long time to execute • 10s of seconds to several minutes • Mystery Aborted jobs • Experienced by Alice and Atlas • Quite rare
DC04 Issues Using LCG • VO_LHCB_SW_DIR – not writeable, not always configured properly • Extracting information from .BrokerInfo file • Reflection by job on WN to find out: • What CE it is running on • What the LCG Job ID is • File not always available (if job is targeted to site with –r) • File Transfers!We have had problems transfering via BBFTP, SFTP, GridFTP • This has led to many failed jobs • Difficult to test job environment in advance, so download, install, configuration, and running of DIRAC must be flawless first time without intervention
Future • Ramp up loading of DC04 jobs onto LCG • Increased use of LCG Data Mgmt Tools • Replica Management • Customisation of LHCb RB and BDII • Automated LCG Agent • Monitor available production jobs and submit to LCG if there are available “slots” • Direct submission of production jobs to LCG • Currently running a “Fork” DIRAC Agent on WN • More direct interface via APIs • Evaluation of ARDA software