190 likes | 328 Views
ATLAS DC2 Pile-up Jobs on LCG. Atlas DC Meeting February 2005. Pile-up tasks. Jobs defined in 3 tasks: 210 dc2.003002.lumi10.A2_z_mumu.task 307 dc2.003026.lumi10.A0_top.task 308 dc2.003004.lumi10.A3_z_tautau.task
E N D
ATLAS DC2 Pile-up Jobs on LCG Atlas DC Meeting February 2005 chudoba@fzu.cz
Pile-up tasks • Jobs defined in 3 tasks: • 210 dc2.003002.lumi10.A2_z_mumu.task • 307 dc2.003026.lumi10.A0_top.task • 308 dc2.003004.lumi10.A3_z_tautau.task • Input files with min. bias were distributed to selected sites using DQ, 700GB • Each job used 8 input files with min. bias (~250MB each), downloaded from closeSE, and 1 input file with signal • 1 GB RAM per job required chudoba@fzu.cz
5 sites involved golias25.farm.particle.cz:2119/jobmanager-lcgpbs-lcgatlasprod lcg00125.grid.sinica.edu.tw:2119/jobmanager-lcgpbs-infinite lcgce01.triumf.ca:2119/jobmanager-lcgpbs-atlas lcgce02.ifae.es:2119/jobmanager-lcgpbs-atlas t2-ce-01.roma1.infn.it:2119/jobmanager-lcgpbs-infinite Number of jobs per site chudoba@fzu.cz
Status JOBSTATUS NJOBS failed 3702 finished 5703 pending 323 running 64 21 jobs have JOBSTATUS finished and CURRENTSTATE ABORTED - probably initial tests, ENDTIME = 23-SEP-04, 30-SEP-04 and 07-OCT-04 chudoba@fzu.cz
Why so big differences in the efficiency? PRAGUE: 48% TW: 70% ATTEMPT NJOBS 1 2442 2 466 3 244 4 291 5 130 6 71 7 66 8 52 9 48 10 26 11 7 ATTEMPT NJOBS 1 2662 2 361 3 184 • Other differences: • RB on TW • lexor running on UI on TW • many signal files stored on SE on TW chudoba@fzu.cz
Failures • Not easy to get cause of failure from proddb • VALIDATIONDIAGNOSTIC quite difficult to parse by script: • <workernode>t2-wn-36.roma1.infn.it</workernode><retcode>1</retcode><time>0m2.360s</time><error>STAGE-IN failed: WARNING: No FILE or RFIO access for existing replicasWARNING: Replication of sfn://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._01477.pool.root.1 to close SE failed: Error in replicating PFN sfn://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._01477.pool.root.1 to t2-se-01.roma1.infn.it: lcg_aa: File existslcg_aa: File existsGiving up after attempting replication TWICE.WARNING: Could not stage input file sfn://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._01477.pool.root.1: Gridftp copy failed from gsiftp://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._01477.pool.root.1 to file:/home/atlassgm/globus-tmp.t2-wn-36.17931.0/WMS_t2-wn-36_018404_https_3a_2f_2flcg00124.grid.sinica.edu.tw_3a9000_2fKv9HpVIUkMLTBBe-Ia3xLA/dc2.003002.simul.A2_z_mumu._01477.pool.root: the server sent an error response: 550 550 /castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._01477.pool.root.1: Invalid argument.EDGFileCatalog: level[Always] Disconnected</error><stageOut>No log for stageout phase</stageOut> • mw failures: • <JobInfo>Job RetryCount (0) hit</JobInfo> chudoba@fzu.cz
Some Jobs with many Attempts JOBDEFINITIONID=459795 • Attempt 1: 09-NOV-04 • <workernode>t2-wn-42.roma1.infn.it</workernode><retcode>1</retcode><time>0m43.250s</time><error>Transformation error: -------- Problem report -------[Unknown Problem]AthenaPoolConve... ERROR (PersistencySvc) pool::PersistencySvc::UserDatabase::connectForRead: FID is not existing in the catalog================================-------- Problem report -------[Unknown Problem]PileUpEventLoopMgrWARNING Original event selector has no events================================</error><stageOut>No log for stageout phase</stageOut> • ... • Attempt 11: 15-DEC-04 • <workernode>goliasx76.farm.particle.cz</workernode><retcode>1</retcode><time>0m41.460s</time><error>Transformation error: -------- Problem report -------[Unknown Problem]AthenaPoolConve... ERROR (PersistencySvc) pool::PersistencySvc::UserDatabase::connectForRead: FID is not existing in the catalog================================-------- Problem report -------[Unknown Problem]PileUpEventLoopMgrWARNING Original event selector has no events================================</error><stageOut>No log for stageout phase</stageOut> chudoba@fzu.cz
JOBDEFINITIONID=456843 • Attempt 1: • <workernode>t2-wn-37.roma1.infn.it</workernode><retcode>1</retcode><time>0m2.830s</time><error>STAGE-IN failed: WARNING: No FILE or RFIO access for existing replicasWARNING: Replication of srm://lcgads01.gridpp.rl.ac.uk//datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._02629.pool.root.6 to close SE failed: Error in replicating PFN srm://lcgads01.gridpp.rl.ac.uk//datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._02629.pool.root.6 to t2-se-01.roma1.infn.it: lcg_aa: File existslcg_aa: File existsGiving up after attempting replication TWICE.WARNING: Could not stage input file srm://lcgads01.gridpp.rl.ac.uk//datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._02629.pool.root.6: Get TURL failed: lcg_gt: Communication error on sendEDGFileCatalog: level[Always] Disconnected</error><stageOut>No log for stageout phase</stageOut> • Attempt 2: • <workernode>lcg00172.grid.sinica.edu.tw</workernode><retcode>2</retcode><time>0m23.660s</time><error>Transformation error: -------- Problem report -------[SOFTWARE]AthenaCrash================================</error><stageOut>No log for stageout phase</stageOut> • ... • Attempt 9: • <workernode>goliasx44.farm.particle.cz</workernode><retcode>2</retcode><time>0m23.340s</time><error>Transformation error: -------- Problem report -------[SOFTWARE]AthenaCrash================================</error><stageOut>No log for stageout phase</stageOut> chudoba@fzu.cz
JOBDEFINITIONID=504139 • Attempt 1: • <workernode>t2-wn-48.roma1.infn.it</workernode><retcode>2</retcode><time>66m58.650s</time><error>Transformation error: -------- Problem report -------[SOFTWARE]AthenaCrash================================</error><stageOut>No log for stageout phase</stageOut> • Attempt 2: • <workernode>lcg00144.grid.sinica.edu.tw</workernode><retcode>2</retcode><time>66m56.800s</time><error>Transformation error: -------- Problem report -------[SOFTWARE]AthenaCrash================================</error><stageOut>No log for stageout phase</stageOut> • the same up to attempt 5 • Attempt 6: mw failure • Attempt 7: • <workernode>goliasx60.farm.particle.cz</workernode><retcode>0</retcode><time>152m53.780s</time> • ??? chudoba@fzu.cz
Jobs properties • no exact relation between a job in the oracle db and an entry in the PBS log file • STARTTIME and ENDTIME are just hints • Some jobs on golias: • 1232 finished jobs in December registered in proddb • 1299 selected jobs from PBS logs in December, cuts on CPU time and virtual memory values • Nodes: 3.06 GHz Xeon, 2GB RAM • Histos based on information from PBS log files chudoba@fzu.cz
some jobs (6) successfully ran on machine with only 1GB RAM but the wallTime was 20h – probably a lot of swapping chudoba@fzu.cz
WN -> SE -> NFS server • WN has the same NFS mount – could it be used directly? chudoba@fzu.cz
Conclusions • no job name in the local batch system – difficult to identify • version of the lexor executor should be in the proddb • proddb: very slow response, these queries were done on atlassg (has snapshot of proddb from Feb 8) • a study of log files should be done before increasing MAXATTEMPT • proddb should be cleaned chudoba@fzu.cz