Experiment support at IN2P3

Experiment support at IN2P3 Artem Trunov CC-IN2P3 artem.trunov@cc.in2p3.fr

Experiment support at IN2P3 • Site introduction • System support • Grid support • Experiment support

Introduction: Centre de Calcul of IN2P3 • Serving 40+ experiments • HEP • Astrophysics • Biomedical • 20 years of history • 40+ employees http://cc.in2p3.fr/

CC-IN2P3 - Stats • ~900 batch workers - 2200 CPUs (cores) – 3400 jobs • Dual CPU P-IV 2.4 GHz, 2GB RAM • Dual CPU Xeon 2.8GHz, 2GB RAM • Opteron Dual CPU Dual core 2.2Ghz, 8GB RAM. • Mass Storage – HPSS • 1.6 PB total volume stored • Daily average transfer volume ~10TB • Rfio access to disk cache, xrootd, dcache • Network • 10 Gb link to CERN since January 2006 • 1 core router (catalyst 6500 series) • 1 Gb uplink per 24 WN (upgradable to 2x1GB)

LCG Tier 1 Center in France LCG MoU Jan 31 2006

Systems support • A guard on duty 24 hours • Smoke etc alarm sensors, service provider visits • Telecom equipment sends an sms on failure • Storage robotic equipment sends signals to the service provider who acts according to the service agreement. • Unix administrator is on shift during evening hours. He can call an expert at home, but the expert is not obliged to work odd hours • Weekly reviews of problems during shifts • Nights and weekends not covered, overtime not paid. • But in case of major accidents like power failure system group is making all efforts to bring the site back as soon as possible. • During the day routine monitoring of • Slow or hang jobs, jobs that end to quick, jobs exceeding requested resources – a message is sent to submitter • Storage services for problems • Some service to end users is delegated to experiments representatives • Password reset, manipulation with job resources

Grid and admin support at IN2P3 • Dedicated Grid team • People have mixed responsibilities • 2.5 – all middleware: CEs, LFC etc • + operations • 1.5 – dcache + FTS • 1.5 – HPSS administrator • 1 – Grid jobs support • Few people doing CIC, GOC development/support • + many people not working on grid • Unix, network, DB, web • Not actually providing 24/7 support • Few people do too much work. • Weekly meetings with “Tour of VOs” are very usefull storage GRID team production system

Grid and admin support at IN2P3 • But Grid people not enough to make grid working • Didn’t Grid mean to help experiments? - yes • Are experiments happy? - no • Are Grid people happy with experiments? – no • “Atlas is the most ‘grid-compatible’, but it gives us a lot of problems”. • “Alice is the least ‘grid-compatible’, but it gives us a lot of problems”. • One of the answers: it’s necessary to work closely with experiments. • Lyon: a part of user support group is oriented to LHC experiment support

Experiment support at CC-IN2P3 • Experiment support group - 7 person • 1 – BaBar + astrophysics • 1 – all Biomed • 1 – Alice and CMS • 1 – Atlas • 1 – D0 • 1 – CDF • 1 – general support (sw etc) • All – general support, GGUS (near future) • Plan is to have one person for each LHC experiment • All are former physicists • The director of the CC is also physicist

Having physicists on site helps • Usually had in-depth experience with at least one experiment • Understand experiment’s computing models and requirements • Proactive in problem solving, have broad view on experiments computing. • Work to make physicists happy • Bring mutual benefit to the site and the experiment • Current level of Grid and experiments’ middleware really requires a lot of efforts • And grid is not getting easier, but more complex.

CMS Tier1 people – who are they? • Little survey on CMS computing support at sites • Spain (PIC) – Have dedicated person • Italy (CNAF) – Have dedicated person • France (IN2P3) - yes • Germany (FZK) – no • Support is by physics community • SARA/NIKEF – no • Nordic countries – terra incognita • US (FNAL) - yes • UK (RAL) – yes, but virtually no experimental support at UK Tier2 sites

What are they doing at their sites? • Making sure site setup works • integration work • optimization • Site admins usually can’t check whether their setup works and asks VO to test it. • Reducing “round trip time” between experiment and site admins • Talking is much better for understanding than email exchange. • Sometime it’s just necessary to sit together in order to resolve problems • Helping site admins to understand experiment’s use of the site • Especially “exotic” cases, like Alice (asks for xrootd) • Also requires lot of iterations • Testing new solutions • Then deploying and supporting • Administration • At Lyon, all xrootd servers are managed by user support • Grid expertise, managing VO Boxes, services (FTS)

VO support scenarios • Better “range” when experiment expert is on site • Could interact directly with more people and systems • Integration is a keyword Grid admins DB systems Siteadmins storage Experiment community Grid a) Grid admins DB systems Siteadmins storage Grid b)

At Lyon • Some extra privileges that on-site people use • Installing and supporting xrootd servers (BaBar, Alice, CMS, other experiments), root • Managing VO boxes, root • Installing all VO software at site - special • Co-managing FTS channels - special • Developing SW (Biomed) • Debugging transfers, jobs – root • Transfer and storage really needs attention!

Grid support problems • SFT not objective • Grid service depends on some external components • E.g. foreign RB or BDII may be at fault • Too many components are unreliable • Impossible to keep a site up according to MoU • GGUS support not 24 h either. • Select some support areas and focus on them

Focus points • In short, what does an experiment want from a site? • Keep the data • To ingest a data stream and archive it on tape • Access the data • Grid jobs must be able to read data back • When experiments are not happy? • Transfers are failing • Jobs are failing • What causes failures? • ► Network is stable • ► Local batch is stable • ►Grid middleware • ►Storage

Debugging grid jobs • Difficult • Log file not available even for superuser – they are on a Resource Broker • Whether it’s user application failing or site’s infrastructure is not clear without VO expertise • At Lyon, production people monitor slow jobs, jobs that end too quick, jobs killed • sending messages to users – not scalable.

Storage failures • Storage is another key component (transfer, data access) • Transfer made too complex • SRM • Non-interoperable storage solutions • Exotic use cases • Site storage experts can’t always debug transfers themselves • Don’t have VO certificates/roles, • Don’t have access to VO middleware that initiates transfers • Storage classes WG (T1 storage admins) is trying to • reduce complexity, • bring the terminology to the common ground, • make developers understand the real use cases, • make experiments aware of the real life solutions, • make site storage experts fully comfortable with experiments’ demands • This is all for the mutual interest • File loss is inevitable • An exchange mechanism for site-VO interactions is being developed by Storage Classes WG initiative. • But detection of the file loss is still a site’s responsibility!

Storage • Needs serious monitoring • Pools down • Staging fails • Servers overloaded • Make storage a part of 24h support? Train operators to debug storage problems and localize the damage to an experiment (e.g. by notification to experiments) • Is this possible to develop such monitoring and support scenario, where problems are fixed before users complain? • Shame, when a user tells an admin “You have a problem”. Better be vice-versa. • At Lyon will try to expand storage expertise • Experiment support people are already involved. • VO will obviously have to train data taking crew on shift to recognize storage problems at sites (since exporting data from CERN is a part of initial reconstruction workflow).

Little summary about what should work in Grid support • Dedicated grid team • Deep interaction with experiments, regular meetings on progress and issues • Storage/transfers monitoring • Human factor Everything will be fine!

Experiment support at IN2P3

Experiment support at IN2P3

Presentation Transcript

iRODS usage at CC-IN2P3

Experiment Support group

Experiment Support group

Belle Experiment at KEKB

IN2P3 Network

IN2P3 33 years

Experiment Support group

IN2P3 Computing Center

Experiment Support

CMS experiment at LHC

CRAB Server IN2P3

CNRS/in2p3

PES 2010 IN2P3

CC-IN2P3

Grid interoperability developments at CC-IN2P3

CDF Experiment Fermilab Experiment Support

EXPERIMENT SUPPORT DURING CCRC08