200 likes | 359 Views
Experiment support at IN2P3. Artem Trunov CC-IN2P3 artem.trunov@cc.in2p3.fr. Experiment support at IN2P3. Site introduction System support Grid support Experiment support. Introduction: Centre de Calcul of IN2P3. Serving 40+ experiments HEP Astrophysics Biomedical
E N D
Experiment support at IN2P3 Artem Trunov CC-IN2P3 artem.trunov@cc.in2p3.fr
Experiment support at IN2P3 • Site introduction • System support • Grid support • Experiment support
Introduction: Centre de Calcul of IN2P3 • Serving 40+ experiments • HEP • Astrophysics • Biomedical • 20 years of history • 40+ employees http://cc.in2p3.fr/
CC-IN2P3 - Stats • ~900 batch workers - 2200 CPUs (cores) – 3400 jobs • Dual CPU P-IV 2.4 GHz, 2GB RAM • Dual CPU Xeon 2.8GHz, 2GB RAM • Opteron Dual CPU Dual core 2.2Ghz, 8GB RAM. • Mass Storage – HPSS • 1.6 PB total volume stored • Daily average transfer volume ~10TB • Rfio access to disk cache, xrootd, dcache • Network • 10 Gb link to CERN since January 2006 • 1 core router (catalyst 6500 series) • 1 Gb uplink per 24 WN (upgradable to 2x1GB)
LCG Tier 1 Center in France LCG MoU Jan 31 2006
Systems support • A guard on duty 24 hours • Smoke etc alarm sensors, service provider visits • Telecom equipment sends an sms on failure • Storage robotic equipment sends signals to the service provider who acts according to the service agreement. • Unix administrator is on shift during evening hours. He can call an expert at home, but the expert is not obliged to work odd hours • Weekly reviews of problems during shifts • Nights and weekends not covered, overtime not paid. • But in case of major accidents like power failure system group is making all efforts to bring the site back as soon as possible. • During the day routine monitoring of • Slow or hang jobs, jobs that end to quick, jobs exceeding requested resources – a message is sent to submitter • Storage services for problems • Some service to end users is delegated to experiments representatives • Password reset, manipulation with job resources
Grid and admin support at IN2P3 • Dedicated Grid team • People have mixed responsibilities • 2.5 – all middleware: CEs, LFC etc • + operations • 1.5 – dcache + FTS • 1.5 – HPSS administrator • 1 – Grid jobs support • Few people doing CIC, GOC development/support • + many people not working on grid • Unix, network, DB, web • Not actually providing 24/7 support • Few people do too much work. • Weekly meetings with “Tour of VOs” are very usefull storage GRID team production system
Grid and admin support at IN2P3 • But Grid people not enough to make grid working • Didn’t Grid mean to help experiments? - yes • Are experiments happy? - no • Are Grid people happy with experiments? – no • “Atlas is the most ‘grid-compatible’, but it gives us a lot of problems”. • “Alice is the least ‘grid-compatible’, but it gives us a lot of problems”. • One of the answers: it’s necessary to work closely with experiments. • Lyon: a part of user support group is oriented to LHC experiment support
Experiment support at CC-IN2P3 • Experiment support group - 7 person • 1 – BaBar + astrophysics • 1 – all Biomed • 1 – Alice and CMS • 1 – Atlas • 1 – D0 • 1 – CDF • 1 – general support (sw etc) • All – general support, GGUS (near future) • Plan is to have one person for each LHC experiment • All are former physicists • The director of the CC is also physicist
Having physicists on site helps • Usually had in-depth experience with at least one experiment • Understand experiment’s computing models and requirements • Proactive in problem solving, have broad view on experiments computing. • Work to make physicists happy • Bring mutual benefit to the site and the experiment • Current level of Grid and experiments’ middleware really requires a lot of efforts • And grid is not getting easier, but more complex.
CMS Tier1 people – who are they? • Little survey on CMS computing support at sites • Spain (PIC) – Have dedicated person • Italy (CNAF) – Have dedicated person • France (IN2P3) - yes • Germany (FZK) – no • Support is by physics community • SARA/NIKEF – no • Nordic countries – terra incognita • US (FNAL) - yes • UK (RAL) – yes, but virtually no experimental support at UK Tier2 sites
What are they doing at their sites? • Making sure site setup works • integration work • optimization • Site admins usually can’t check whether their setup works and asks VO to test it. • Reducing “round trip time” between experiment and site admins • Talking is much better for understanding than email exchange. • Sometime it’s just necessary to sit together in order to resolve problems • Helping site admins to understand experiment’s use of the site • Especially “exotic” cases, like Alice (asks for xrootd) • Also requires lot of iterations • Testing new solutions • Then deploying and supporting • Administration • At Lyon, all xrootd servers are managed by user support • Grid expertise, managing VO Boxes, services (FTS)
VO support scenarios • Better “range” when experiment expert is on site • Could interact directly with more people and systems • Integration is a keyword Grid admins DB systems Siteadmins storage Experiment community Grid a) Grid admins DB systems Siteadmins storage Grid b)
At Lyon • Some extra privileges that on-site people use • Installing and supporting xrootd servers (BaBar, Alice, CMS, other experiments), root • Managing VO boxes, root • Installing all VO software at site - special • Co-managing FTS channels - special • Developing SW (Biomed) • Debugging transfers, jobs – root • Transfer and storage really needs attention!
Grid support problems • SFT not objective • Grid service depends on some external components • E.g. foreign RB or BDII may be at fault • Too many components are unreliable • Impossible to keep a site up according to MoU • GGUS support not 24 h either. • Select some support areas and focus on them
Focus points • In short, what does an experiment want from a site? • Keep the data • To ingest a data stream and archive it on tape • Access the data • Grid jobs must be able to read data back • When experiments are not happy? • Transfers are failing • Jobs are failing • What causes failures? • ► Network is stable • ► Local batch is stable • ►Grid middleware • ►Storage
Debugging grid jobs • Difficult • Log file not available even for superuser – they are on a Resource Broker • Whether it’s user application failing or site’s infrastructure is not clear without VO expertise • At Lyon, production people monitor slow jobs, jobs that end too quick, jobs killed • sending messages to users – not scalable.
Storage failures • Storage is another key component (transfer, data access) • Transfer made too complex • SRM • Non-interoperable storage solutions • Exotic use cases • Site storage experts can’t always debug transfers themselves • Don’t have VO certificates/roles, • Don’t have access to VO middleware that initiates transfers • Storage classes WG (T1 storage admins) is trying to • reduce complexity, • bring the terminology to the common ground, • make developers understand the real use cases, • make experiments aware of the real life solutions, • make site storage experts fully comfortable with experiments’ demands • This is all for the mutual interest • File loss is inevitable • An exchange mechanism for site-VO interactions is being developed by Storage Classes WG initiative. • But detection of the file loss is still a site’s responsibility!
Storage • Needs serious monitoring • Pools down • Staging fails • Servers overloaded • Make storage a part of 24h support? Train operators to debug storage problems and localize the damage to an experiment (e.g. by notification to experiments) • Is this possible to develop such monitoring and support scenario, where problems are fixed before users complain? • Shame, when a user tells an admin “You have a problem”. Better be vice-versa. • At Lyon will try to expand storage expertise • Experiment support people are already involved. • VO will obviously have to train data taking crew on shift to recognize storage problems at sites (since exporting data from CERN is a part of initial reconstruction workflow).
Little summary about what should work in Grid support • Dedicated grid team • Deep interaction with experiments, regular meetings on progress and issues • Storage/transfers monitoring • Human factor Everything will be fine!