1 / 19

ATLAS DC2 Pile-up Jobs on LCG

ATLAS DC2 Pile-up Jobs on LCG. Atlas DC Meeting February 2005. Pile-up tasks. Jobs defined in 3 tasks: 210 dc2.003002.lumi10.A2_z_mumu.task 307 dc2.003026.lumi10.A0_top.task 308 dc2.003004.lumi10.A3_z_tautau.task

gina
Download Presentation

ATLAS DC2 Pile-up Jobs on LCG

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ATLAS DC2 Pile-up Jobs on LCG Atlas DC Meeting February 2005 chudoba@fzu.cz

  2. Pile-up tasks • Jobs defined in 3 tasks: • 210 dc2.003002.lumi10.A2_z_mumu.task • 307 dc2.003026.lumi10.A0_top.task • 308 dc2.003004.lumi10.A3_z_tautau.task • Input files with min. bias were distributed to selected sites using DQ, 700GB • Each job used 8 input files with min. bias (~250MB each), downloaded from closeSE, and 1 input file with signal • 1 GB RAM per job required chudoba@fzu.cz

  3. 5 sites involved golias25.farm.particle.cz:2119/jobmanager-lcgpbs-lcgatlasprod lcg00125.grid.sinica.edu.tw:2119/jobmanager-lcgpbs-infinite lcgce01.triumf.ca:2119/jobmanager-lcgpbs-atlas lcgce02.ifae.es:2119/jobmanager-lcgpbs-atlas t2-ce-01.roma1.infn.it:2119/jobmanager-lcgpbs-infinite Number of jobs per site chudoba@fzu.cz

  4. chudoba@fzu.cz

  5. chudoba@fzu.cz

  6. chudoba@fzu.cz

  7. Status JOBSTATUS NJOBS failed 3702 finished 5703 pending 323 running 64 21 jobs have JOBSTATUS finished and CURRENTSTATE ABORTED - probably initial tests, ENDTIME = 23-SEP-04, 30-SEP-04 and 07-OCT-04 chudoba@fzu.cz

  8. Why so big differences in the efficiency? PRAGUE: 48% TW: 70% ATTEMPT NJOBS 1 2442 2 466 3 244 4 291 5 130 6 71 7 66 8 52 9 48 10 26 11 7 ATTEMPT NJOBS 1 2662 2 361 3 184 • Other differences: • RB on TW • lexor running on UI on TW • many signal files stored on SE on TW chudoba@fzu.cz

  9. Failures • Not easy to get cause of failure from proddb • VALIDATIONDIAGNOSTIC quite difficult to parse by script: • <workernode>t2-wn-36.roma1.infn.it</workernode><retcode>1</retcode><time>0m2.360s</time><error>STAGE-IN failed: WARNING: No FILE or RFIO access for existing replicasWARNING: Replication of sfn://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._01477.pool.root.1 to close SE failed: Error in replicating PFN sfn://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._01477.pool.root.1 to t2-se-01.roma1.infn.it: lcg_aa: File existslcg_aa: File existsGiving up after attempting replication TWICE.WARNING: Could not stage input file sfn://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._01477.pool.root.1: Gridftp copy failed from gsiftp://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._01477.pool.root.1 to file:/home/atlassgm/globus-tmp.t2-wn-36.17931.0/WMS_t2-wn-36_018404_https_3a_2f_2flcg00124.grid.sinica.edu.tw_3a9000_2fKv9HpVIUkMLTBBe-Ia3xLA/dc2.003002.simul.A2_z_mumu._01477.pool.root: the server sent an error response: 550 550 /castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._01477.pool.root.1: Invalid argument.EDGFileCatalog: level[Always] Disconnected</error><stageOut>No log for stageout phase</stageOut> • mw failures: • <JobInfo>Job RetryCount (0) hit</JobInfo> chudoba@fzu.cz

  10. Some Jobs with many Attempts JOBDEFINITIONID=459795 • Attempt 1: 09-NOV-04 • <workernode>t2-wn-42.roma1.infn.it</workernode><retcode>1</retcode><time>0m43.250s</time><error>Transformation error: -------- Problem report -------[Unknown Problem]AthenaPoolConve... ERROR (PersistencySvc) pool::PersistencySvc::UserDatabase::connectForRead: FID is not existing in the catalog================================-------- Problem report -------[Unknown Problem]PileUpEventLoopMgrWARNING Original event selector has no events================================</error><stageOut>No log for stageout phase</stageOut> • ... • Attempt 11: 15-DEC-04 • <workernode>goliasx76.farm.particle.cz</workernode><retcode>1</retcode><time>0m41.460s</time><error>Transformation error: -------- Problem report -------[Unknown Problem]AthenaPoolConve... ERROR (PersistencySvc) pool::PersistencySvc::UserDatabase::connectForRead: FID is not existing in the catalog================================-------- Problem report -------[Unknown Problem]PileUpEventLoopMgrWARNING Original event selector has no events================================</error><stageOut>No log for stageout phase</stageOut> chudoba@fzu.cz

  11. JOBDEFINITIONID=456843 • Attempt 1: • <workernode>t2-wn-37.roma1.infn.it</workernode><retcode>1</retcode><time>0m2.830s</time><error>STAGE-IN failed: WARNING: No FILE or RFIO access for existing replicasWARNING: Replication of srm://lcgads01.gridpp.rl.ac.uk//datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._02629.pool.root.6 to close SE failed: Error in replicating PFN srm://lcgads01.gridpp.rl.ac.uk//datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._02629.pool.root.6 to t2-se-01.roma1.infn.it: lcg_aa: File existslcg_aa: File existsGiving up after attempting replication TWICE.WARNING: Could not stage input file srm://lcgads01.gridpp.rl.ac.uk//datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._02629.pool.root.6: Get TURL failed: lcg_gt: Communication error on sendEDGFileCatalog: level[Always] Disconnected</error><stageOut>No log for stageout phase</stageOut> • Attempt 2: • <workernode>lcg00172.grid.sinica.edu.tw</workernode><retcode>2</retcode><time>0m23.660s</time><error>Transformation error: -------- Problem report -------[SOFTWARE]AthenaCrash================================</error><stageOut>No log for stageout phase</stageOut> • ... • Attempt 9: • <workernode>goliasx44.farm.particle.cz</workernode><retcode>2</retcode><time>0m23.340s</time><error>Transformation error: -------- Problem report -------[SOFTWARE]AthenaCrash================================</error><stageOut>No log for stageout phase</stageOut> chudoba@fzu.cz

  12. JOBDEFINITIONID=504139 • Attempt 1: • <workernode>t2-wn-48.roma1.infn.it</workernode><retcode>2</retcode><time>66m58.650s</time><error>Transformation error: -------- Problem report -------[SOFTWARE]AthenaCrash================================</error><stageOut>No log for stageout phase</stageOut> • Attempt 2: • <workernode>lcg00144.grid.sinica.edu.tw</workernode><retcode>2</retcode><time>66m56.800s</time><error>Transformation error: -------- Problem report -------[SOFTWARE]AthenaCrash================================</error><stageOut>No log for stageout phase</stageOut> • the same up to attempt 5 • Attempt 6: mw failure • Attempt 7: • <workernode>goliasx60.farm.particle.cz</workernode><retcode>0</retcode><time>152m53.780s</time> • ??? chudoba@fzu.cz

  13. Jobs properties • no exact relation between a job in the oracle db and an entry in the PBS log file • STARTTIME and ENDTIME are just hints • Some jobs on golias: • 1232 finished jobs in December registered in proddb • 1299 selected jobs from PBS logs in December, cuts on CPU time and virtual memory values • Nodes: 3.06 GHz Xeon, 2GB RAM • Histos based on information from PBS log files chudoba@fzu.cz

  14. some jobs (6) successfully ran on machine with only 1GB RAM but the wallTime was 20h – probably a lot of swapping chudoba@fzu.cz

  15. chudoba@fzu.cz

  16. chudoba@fzu.cz

  17. WN -> SE -> NFS server • WN has the same NFS mount – could it be used directly? chudoba@fzu.cz

  18. chudoba@fzu.cz

  19. Conclusions • no job name in the local batch system – difficult to identify • version of the lexor executor should be in the proddb • proddb: very slow response, these queries were done on atlassg (has snapshot of proddb from Feb 8) • a study of log files should be done before increasing MAXATTEMPT • proddb should be cleaned chudoba@fzu.cz

More Related