120 likes | 132 Views
This overview provides information on service planning, current status, capacities, networking, personnel, monitoring, hardware and software, middleware, and service in T2 - Prague GDB meeting.
E N D
Service planning and monitoring in T2 - Prague GDB meeting BNL, MIlos Lokajicek
Overview • Introduction • Service planning and current status • Capacities • Networking • Personnel • Monitoring • HW and SW • Middleware • Service • Remarks
Introduction • Czech Republic’s LHC activities • ATLAS, target 3% of authors -> activities • ALICE, target 1 % • TOTEM, much smaller experiments, relative target higher. • (non LHC – HERA/H1, TEVATRON/D0, AUGER) • Institutions (mention just big groups) • Academy of Sciences of the Czech Republic • Institute of Physics • Nuclear Physics Institute • Charles University in Prague • Faculty of Mathematics and Physics • Czech Technical University in Prague • Faculty of Nuclear Sciences and Physical Engineering • HEP manpower (2005) • 145 people • 59 physicists • 22 engineers • 21 technicians • 43 undergraduate students a PHD students
Service planning • Table based on LCG MoU for ATLAS and Alice and our anticipated share • Project proposals to various grant systems in the Czech Republic • Prepare bigger project proposal for CZ GRID together with CESNET • For the LHC needs • In 2010 add 3x more capacity for Czech non-HEP scientists, financed fro state resources and structural funds of EU • All proposals include new personnel (up to 10 new persons) • Today, regular financing, sufficient for D0 • today 250 cores, 150 kSI2k, 40 TB disk space, no tapes
Networking • Local connection of institutes in Prague • Optical 1 Gbps E2E lines • WAN • Opticla E2E lines to Fermilab, Taipei new FZK (from 1 Sept 06) • Connection Prague – Amsterodam now through GN2 • Planning further lines to other T1s Sima @ CEF Networks workshop Prauge, May 30th, 2006
Personnel • Now 4 persons to run T2 • Jiri Kosina – middleware (leaving, looking for replacement), Storage (FTS), monitoring • Tomas Kouba – middleware, monitoring • Jan Svec – basic HW, OS, storage, networking, monitoring • Lukas Fiala - Basic HW, networking, web services • Jiri Chudoba – liason to ATLAS and ALICE, running the jobs and reporting errors, service monitoring • Further information is based on their experience
Monitoring • HW and basic SW • installation and test of new hardwarenormally choose proven HWHW - installation by delivery firminstall operating system and solve problems with delivery firmsinstall middlewaretest it for some time outside the production service • Nagiosworking nodes access via pingdisks – how the partitions are fullload averageif pbs_mom process is runningnumber of running processesif ssh demon is runninghow full is the swap…. • Limits for warning and error • Distribution of mails or SMS to admins – fixing problems remotely • Regular check of nagios web page for red dots • Regular automatic (cron) checks and restarts for some daemons
Monitoring • PBS • job count (via RRD and mrtg) • Local tools for monitoring of number of jobs/machine/per chosen period • Apel • not much useful, might be setup for more useful info • Gridice • ATLAS • Checks and statistics from ATLAS database • ALICE - Mona Lisa - very useful • Monitor pool accounts and actual user certificates • Networking • Network traffic to FZK, SARA, CERN in certain ip range • With the help of ipaccounting (utility ipac-ng)http://golias100.farm.particle.cz/ipac/ • SFT – site functional tests – very useful
outgoing to fzk1 Max: 37M Average: 6M Total: 129G outgoing to internet Max: 61M Average: 8M Total: 164G
Updates and patches • YAIM + automated updates on all farm nodes using simple BEX script toolkit (takes care of upgrading the node which was switched off at the deployment/upgrade phase ... keeps all nodes in sync automatically)ftp://atrey.karlin.mff.cuni.cz/pub/local/mj/linux/bex-2.0.tar.gz, info in README file
Service monitoring • Using higher described checks and their combinations • Rely on centrally/by experiments supported useful monitors • We would appreciate to receive early warning if jobs on some site/working_nodes start quickly fail after submission • Service requirements for T2s in “extended”working hours • No special plan today • Try to provide architecture that responsible people can even travel and do as much as possible remotely (e.g. network console access) • Future computing capacities will probably require new arrangements