80 likes | 248 Views
ATLAS DC2 seen from Prague Tier2 center - some remarks. Atlas sw workshop September 2004. Hardware in Prague available for ATLAS. Golias: 32 dual CPU nodes PIII1.13GHz, 1GB RAM upgraded since July: + 49 dual CPU Xeon 3.06 GHz, 2 GB RAM (WN) 3TB disk space reserved for atlas
E N D
ATLAS DC2 seen from Prague Tier2 center - some remarks Atlas sw workshop September 2004 chudoba@fzu.cz
Hardware in Prague available for ATLAS • Golias: • 32 dual CPU nodes PIII1.13GHz, 1GB RAM • upgraded since July: + 49 dual CPU Xeon 3.06 GHz, 2 GB RAM (WN) • 3TB disk space reserved for atlas • PBSPro batch system • lcgatlasprod queue reserved for atlas VO members, high priority • Skurut: • 16 dual CPU nodes PIII700MHz, 1GB RAM • OpenPBS batch system • queues: lcgpbs-short, long, infinite, used mainly by atlas • 2 independent CEs in LCG2 chudoba@fzu.cz
Jobs waiting for input or output replication, sometimes hanging ‘forever’: Example: Job Id Queue User Node CPUTime WallTime 34031.golias lcgatlasprod atlas001 golias30 03:09:28 43:30:39 34035.golias lcgatlasprod atlas002 golias03 04:17:38 43:19:18 34113.golias lcgatlasprod atlas002 golias10 03:00:41 41:52:11 34127.golias lcgatlasprod atlas001 golias11 04:19:11 41:21:46 34583.golias lcgatlasprod atlassgm goliasx56 00:00:17 26:01:14 ... Not yet cured: running jobs, 20.9.2004: Job Id Queue User Node CPUTime WallTime 55162.golias lcgatlasprod atlassgm goliasx42 00:00:03 102:19:45 58528.golias lcgatlasprod atlas001 golias02 11:22:40 11:33:13 58529.golias lcgatlasprod atlas001 golias03 00:00:16 11:33:49 ... Usually such long jobs are killed either by administrator or by PBS time limit chudoba@fzu.cz
July 1 – September 21 number of jobs in DQ: 1349 done 1231 failed = 2580 jobs number of jobs in DQ:362 done572 failed = 934 jobs chudoba@fzu.cz
Job distribution • almost always not enough jobs on GOLIAS ATLAS • SKURUT usage much better chudoba@fzu.cz
Memory usage atlas jobs on GOLIAS, july – september (part) 2004 chudoba@fzu.cz
CPU Time Xeon 3.06GHz PIII1.13GHz hours hours PIII700MHz queue limit: 48 hours later changed to 72 hours chudoba@fzu.cz hours
Miscellaneous • no job name in the local batch system – difficult to identify • no (?) documentation where to look for log files, which logs are relevant • lost jobs due to CPU time limit - no warning • lost jobs due to one missconfigured node - spotted from local logs and by Simone too • some jobs loop forever – where to send this information? chudoba@fzu.cz