340 likes | 490 Views
A TLAS MC production errors per site. Overview. MC production monitoring reminder 2008 statistics 2008 errors Comparison with 2007. Panda server at CERN. Please do not use anymore Panda monitor at BNL Use the CERN instance http://panda.cern.ch/.
E N D
Overview • MC production monitoring reminder • 2008 statistics • 2008 errors • Comparison with 2007 Eric Lancon
Panda server at CERN • Please do not use anymore Panda monitor at BNL • Use the CERN instance http://panda.cern.ch/ Eric Lancon
http://dashb-atlas-prodsys-test.cern.ch/dashboard/request.py/site-admin?cloud=&grouping=sitehttp://dashb-atlas-prodsys-test.cern.ch/dashboard/request.py/site-admin?cloud=&grouping=site GRIF Enter GRIF for example Eric Lancon
Click GRIF to getstatistics per CE Eric Lancon
Click + to geterror messages Eric Lancon
Click FR to getdetail of jobs running/allocated etc.. Eric Lancon
Pilots (3hrs) : Nb of pilots on site inlast 3h Assigned : Job for site input not yetavailable of site Activated : input available, waiting for a pilot Failed : failures in last 12hr Eric Lancon
ATLAS MC production - 2008 Eric Lançon
Statistics on FR-coud sites 1.636.410 Jobs Eric Lancon
Eric Lancon Some variations between sites and withinyear
Source of errors • ATLAS software errors • Should not happenat T2s • T1 onlyused for test • Panda problems • Communicationbetween pilots & data-base • Bugs in pilot code • Site problems • ATLAS software setup (althoughit has previously been checked) • Local storageproblem • Archiving of resultsat T1 • Shipping and storing Eric Lancon
ATTENTION – ACHTUNG • Internal ATLAS error types willbeused • Have changedduringyear : • Example : • EXEPANDA_DQ2PUT_FILECOPYERROR • WRAPOSG_DQ2PUT_FILECOPYERROR • Sameerror (unable to register output file) but willappear 2 times • Finally…. I am not sure I understand all the errors Eric Lancon
ATLAS software Input file problem Black Hole site for a short period Eric Lancon
Site / Site Analysis • T1 not considered • Serves as ATLAS software validation • Do reconstruction (N inputs -> 1 output) whereas T2s do simulation (1 input -> 1 output) mainly Eric Lancon
Transfer time out Tokyo -> T1 Eric Lancon
Pilot communication lost Input Storage Output Storage Transfer time out GRIF-> T1 Killed by batch system Eric Lancon
Output Storage LFC problem Pilot communication lost ATLAS software error Killed by batch system Eric Lancon
Missing ATLAS software Output Storage LFC problem Input Storage ATLAS software error Eric Lancon
Killed by batch system Input Storage ATLAS software error Eric Lancon
Missing ATLAS software Input Storage ATLAS software error Eric Lancon
ATLAS software error Output Storage Input Storage Pilot communication lost (/afs) Eric Lancon
Errors per Quater • Performed for some sites only… APOLOGISES • Some sites have almost same errors (some time site independent) over year • Some sites have different errors over year (stability) Eric Lancon
<ATLAS> : all ATLAS CC-Lyon : T1 part T2/T3 : FR sites but T1 Job efficiency Shows all problems (configuration, inputs, output) CPU efficiency 80% = 20% waisted ressources Shows mainly output problems Version mise a jour pour 2008 des tableaux presentes au CP-LCG-FranceFev. 2008
Some conclusions • Errors are very much site dependent • Except output storage • Site errors can only be improved by • Careful attention to ATLAS job on site • By someone for the site • Only big errors are spotted by ATLAS central operation and by FR-ATLAS • Other needs site operation Eric Lancon
FT tests • Ddm • Tests de tranfert ‘weekly’ • http://dashb-atlas-data-tier0.cern.ch/dashboard/request.py/site?name=&statsInterval=4&fromDate=2009-01-12%2012:40&toDate=2009-01-12%2016:40&activity=2 • Production MC • Frequence hebdomadaire, mais peu utilise encore • http://dashb-atlas-prodsys-test.cern.ch/dashboard/request.py/overview?task-flag=functional%20test&period=last-9-days&grouping=cloud • Analysis • ST a renouvellerregulierement, frequence? • stagein Eric Lançon