1 / 23

Operational Issues in Prague

This article discusses the operational issues faced during the Prague Data Challenge Experience, including experiments and collaboration, hardware availability and usage, local data center statistics, and network connections. Lessons learned and future plans are also mentioned.

bcharles
Download Presentation

Operational Issues in Prague

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Operational Issues in Prague Data Challenge Experience GDB - NIKHEF M. Lokajicek

  2. Prague experience • Experiments and people • HW in Prague • Local DC statistics • Experience GDB - NIKHEF M. Lokajicek

  3. Experiments and people • Three institutions in Prague • Academy of Sciences of the Czech Republic • Charles University in Prague • Czech Technical University in Prague • Collaborate on experiments • CERN – ATLAS, ALICE, TOTEM, *AUGER* • FNAL – D0 • BNL -STAR • DESY – H1 • Collaborating community 125 persons • 60researchers • 43 students and PHD students • 22 engineers and 21 technicians • LCG Computing staff – take care for GOLIAS and Skurut • Jiri Kosina – LCG SW installation, networking • Jiri Chudoba – ATLAS and ALICE SW and running • Jan Svec – HW, operating system, PbsPro, networking, D0 SW support (SAM, JIM) • Vlastimil Hynek – run D0 simulations • Lukas Fiala – HW, networking, web GDB - NIKHEF M. Lokajicek

  4. Available HW in Prague GOLIAS • Two independent farms in Prague • GOLIAS – Institute of Physics AS CR • LCG farm serving D0, ATLAS, ALICE • Skurut – CESNET, z.s.p.o. • EGEE preproduction farm, used for ATLAS DC • Sharing of resourcesD0:ATLAS:ALICE= 50:40:10 • Golias: • 80 dual CPU, 40 TB • 32 dual CPU nodes PIII1.13GHz, 1GB RAM • In July 04 + 49 dual CPU Xeon 3.06 GHz, 2 GB RAM (WN) • 10 TB disk space, we use LVM to create 3 volumes with 3 TB, one per experiment, nfs mounted on SE • In July 04 + 30 TB disk space, now in tests, • PBSPro batch system • 18 racks, more than half empty, 150 kW secured input electric power GDB - NIKHEF M. Lokajicek

  5. Skurut – located at CESNET 16 dual CPU nodes PIII 700MHz, 1GB RAM OpenPBS batch syste Older, but stable, no upgrades, no development, no changes in PBS Available HW in Prague GDB - NIKHEF M. Lokajicek

  6. Network connection • General – Geant connection • Gb infrastructure at GOLIAS, over 10 Gbps Metropolitan Prague backbone • CZ - GEANT 2.5 Gbps (over 10 Gbps HW) • USA 0.8 Gbps • Dedicated connection – provided by CESNET • Delivered by CESNET in Collaboration with NederLight and recently in the scope of GLIF projects • 1 Gbps (10 Gbps line) optical connection Golias-CERN • Plan to provide the connection for other groups in Prague • In consideration connections to FERMILAB, RAL or Taipei • Independent optical connection between the collaborating Institutes in Prague, finished by end 2004 GDB - NIKHEF M. Lokajicek

  7. Local DC results GDB - NIKHEF M. Lokajicek

  8. ATLAS - July 1 – September 21 number of jobs in DQ: 1349 done 1231 failed = 2580 jobs, 52% number of jobs in DQ:362 done572 failed = 934 jobs, 38% GDB - NIKHEF M. Lokajicek

  9. Local job distribution • GOLIAS • not enough jobs 2 Aug 23 Aug ALICE D0 ATLAS GDB - NIKHEF M. Lokajicek

  10. Local job distribution • SKURUT • ATLAS jobs • usage much better GDB - NIKHEF M. Lokajicek

  11. ATLAS - Memory usage atlas jobs on GOLIAS, july – september (part) 2004 GDB - NIKHEF M. Lokajicek

  12. ATLAS - CPU Time Xeon 3.06GHz PIII1.13GHz hours hours PIII700MHz queue limit: 48 hours later changed to 72 hours GDB - NIKHEF M. Lokajicek hours

  13. Statistics for 1.7.-6.10.2004 ATLAS - Jobs distribution GDB - NIKHEF M. Lokajicek

  14. ATLAS - Real and CPU Time very long tail for real time – some jobs were hanging during IO operation GDB - NIKHEF M. Lokajicek

  15. ATLAS CPU and REAL TIME difference No imposed time limit on atlas jobs, but some hanging jobs had to be killed. GDB - NIKHEF M. Lokajicek

  16. ATLAS Total statistics • Total time used: • 1593 days of CPU time • 1829 days of real time • Mean usage in 90 days: • 17.7 working CPUs/day • 20.3 used CPUs/day • ONLY JOBS WITH CPU TIME > 100s COUNTED GDB - NIKHEF M. Lokajicek

  17. ATLAS Miscellaneous • no job name in the local batch system – difficult to identify • no (?) documentation where to look for log files, which logs are relevant • lost jobs due to CPU time limit - no warning • lost jobs due to one missconfigured node - spotted from local logs and by Simone too • some jobs loop forever GDB - NIKHEF M. Lokajicek

  18. ALICE jobs 1.7.- 6.10. 04 GDB - NIKHEF M. Lokajicek

  19. ALICE GDB - NIKHEF M. Lokajicek

  20. ALICE GDB - NIKHEF M. Lokajicek

  21. ALICE Total statistics • Total time used: • 2076 days of CPU time • 2409 days of real time • Mean usage in 100 days: • 20.7 working CPUs/day • 24 used CPUs/day • ONLY JOBS WITH CPU TIME > 100s COUNTED GDB - NIKHEF M. Lokajicek

  22. Experience, lessons learned • LCG installation • On GOLIAS we use PbsPro. Due to modificatons we use manual installation • Worker nodes – the first installation via LCFGng, then switched off • All other configurations and upgrades manually • In case of problems – manual installations helps to understand which intervention should be done (LCFGng non transparent) • Currently installed LCG version 2_2_0 • Problems encountered • Earlier installation manuals were in pdf only, new version also in html – enables useful copy/paste – OK • LCG 2_2_0 has R-GMA inside – unfortunately manual installation version is incomplete, is not sufficient for manual configuration – parts on tomcat and java security missing GDB - NIKHEF M. Lokajicek

  23. Experience, lessons learned • PBS • Skurut – OpenPbs, simply configured, effectively used for one experiment only • GOLIAS – PbsPro • 3 experiments with defined proportions • We have problems to set wanted conditions, regular manual intervention to set number of nodes for various queues, priorities • We do not want nodes idle, if some higher priority experiment does not send jobs • Already mentioned problem of pending i/o operations from which some jobs will not recover GDB - NIKHEF M. Lokajicek

More Related