Operational Issues in Prague

Operational Issues in Prague Data Challenge Experience GDB - NIKHEF M. Lokajicek

Prague experience • Experiments and people • HW in Prague • Local DC statistics • Experience GDB - NIKHEF M. Lokajicek

Experiments and people • Three institutions in Prague • Academy of Sciences of the Czech Republic • Charles University in Prague • Czech Technical University in Prague • Collaborate on experiments • CERN – ATLAS, ALICE, TOTEM, *AUGER* • FNAL – D0 • BNL -STAR • DESY – H1 • Collaborating community 125 persons • 60researchers • 43 students and PHD students • 22 engineers and 21 technicians • LCG Computing staff – take care for GOLIAS and Skurut • Jiri Kosina – LCG SW installation, networking • Jiri Chudoba – ATLAS and ALICE SW and running • Jan Svec – HW, operating system, PbsPro, networking, D0 SW support (SAM, JIM) • Vlastimil Hynek – run D0 simulations • Lukas Fiala – HW, networking, web GDB - NIKHEF M. Lokajicek

Available HW in Prague GOLIAS • Two independent farms in Prague • GOLIAS – Institute of Physics AS CR • LCG farm serving D0, ATLAS, ALICE • Skurut – CESNET, z.s.p.o. • EGEE preproduction farm, used for ATLAS DC • Sharing of resourcesD0:ATLAS:ALICE= 50:40:10 • Golias: • 80 dual CPU, 40 TB • 32 dual CPU nodes PIII1.13GHz, 1GB RAM • In July 04 + 49 dual CPU Xeon 3.06 GHz, 2 GB RAM (WN) • 10 TB disk space, we use LVM to create 3 volumes with 3 TB, one per experiment, nfs mounted on SE • In July 04 + 30 TB disk space, now in tests, • PBSPro batch system • 18 racks, more than half empty, 150 kW secured input electric power GDB - NIKHEF M. Lokajicek

Skurut – located at CESNET 16 dual CPU nodes PIII 700MHz, 1GB RAM OpenPBS batch syste Older, but stable, no upgrades, no development, no changes in PBS Available HW in Prague GDB - NIKHEF M. Lokajicek

Network connection • General – Geant connection • Gb infrastructure at GOLIAS, over 10 Gbps Metropolitan Prague backbone • CZ - GEANT 2.5 Gbps (over 10 Gbps HW) • USA 0.8 Gbps • Dedicated connection – provided by CESNET • Delivered by CESNET in Collaboration with NederLight and recently in the scope of GLIF projects • 1 Gbps (10 Gbps line) optical connection Golias-CERN • Plan to provide the connection for other groups in Prague • In consideration connections to FERMILAB, RAL or Taipei • Independent optical connection between the collaborating Institutes in Prague, finished by end 2004 GDB - NIKHEF M. Lokajicek

Local DC results GDB - NIKHEF M. Lokajicek

ATLAS - July 1 – September 21 number of jobs in DQ: 1349 done 1231 failed = 2580 jobs, 52% number of jobs in DQ:362 done572 failed = 934 jobs, 38% GDB - NIKHEF M. Lokajicek

Local job distribution • GOLIAS • not enough jobs 2 Aug 23 Aug ALICE D0 ATLAS GDB - NIKHEF M. Lokajicek

Local job distribution • SKURUT • ATLAS jobs • usage much better GDB - NIKHEF M. Lokajicek

ATLAS - Memory usage atlas jobs on GOLIAS, july – september (part) 2004 GDB - NIKHEF M. Lokajicek

ATLAS - CPU Time Xeon 3.06GHz PIII1.13GHz hours hours PIII700MHz queue limit: 48 hours later changed to 72 hours GDB - NIKHEF M. Lokajicek hours

Statistics for 1.7.-6.10.2004 ATLAS - Jobs distribution GDB - NIKHEF M. Lokajicek

ATLAS - Real and CPU Time very long tail for real time – some jobs were hanging during IO operation GDB - NIKHEF M. Lokajicek

ATLAS CPU and REAL TIME difference No imposed time limit on atlas jobs, but some hanging jobs had to be killed. GDB - NIKHEF M. Lokajicek

ATLAS Total statistics • Total time used: • 1593 days of CPU time • 1829 days of real time • Mean usage in 90 days: • 17.7 working CPUs/day • 20.3 used CPUs/day • ONLY JOBS WITH CPU TIME > 100s COUNTED GDB - NIKHEF M. Lokajicek

ATLAS Miscellaneous • no job name in the local batch system – difficult to identify • no (?) documentation where to look for log files, which logs are relevant • lost jobs due to CPU time limit - no warning • lost jobs due to one missconfigured node - spotted from local logs and by Simone too • some jobs loop forever GDB - NIKHEF M. Lokajicek

ALICE jobs 1.7.- 6.10. 04 GDB - NIKHEF M. Lokajicek

ALICE GDB - NIKHEF M. Lokajicek

ALICE Total statistics • Total time used: • 2076 days of CPU time • 2409 days of real time • Mean usage in 100 days: • 20.7 working CPUs/day • 24 used CPUs/day • ONLY JOBS WITH CPU TIME > 100s COUNTED GDB - NIKHEF M. Lokajicek

Experience, lessons learned • LCG installation • On GOLIAS we use PbsPro. Due to modificatons we use manual installation • Worker nodes – the first installation via LCFGng, then switched off • All other configurations and upgrades manually • In case of problems – manual installations helps to understand which intervention should be done (LCFGng non transparent) • Currently installed LCG version 2_2_0 • Problems encountered • Earlier installation manuals were in pdf only, new version also in html – enables useful copy/paste – OK • LCG 2_2_0 has R-GMA inside – unfortunately manual installation version is incomplete, is not sufficient for manual configuration – parts on tomcat and java security missing GDB - NIKHEF M. Lokajicek

Experience, lessons learned • PBS • Skurut – OpenPbs, simply configured, effectively used for one experiment only • GOLIAS – PbsPro • 3 experiments with defined proportions • We have problems to set wanted conditions, regular manual intervention to set number of nodes for various queues, priorities • We do not want nodes idle, if some higher priority experiment does not send jobs • Already mentioned problem of pending i/o operations from which some jobs will not recover GDB - NIKHEF M. Lokajicek

Operational Issues in Prague