220 likes | 378 Views
Analysis Operations Postmortem Report from 2009 beam data and overall 2009 experience. Stefano Belforte Frank Wuerthwein James Letts Sanjay Padhi Dave L. Evans Marco Calloni. Summary. 1. Metrics : facts 2. Operational experience: subjective report 3. Concerns & Lessons: our opinions
E N D
Analysis Operations Postmortem Report from 2009 beam data and overall 2009 experience Stefano Belforte Frank Wuerthwein James Letts Sanjay Padhi Dave L. Evans Marco Calloni Analysis Operations 2009 Post-Mortem
Summary • 1. Metrics : facts • 2. Operational experience: subjective report • 3. Concerns & Lessons: our opinions • 4. Conclusions: my very personal opinions Analysis Operations 2009 Post-Mortem
1. Metrics Analysis Operations 2009 Post-Mortem
Metrics report from James Letts 1 Analysis Operations 2009 Post-Mortem
Metrics report from James Letts 2 Analysis Operations 2009 Post-Mortem
Metrics report from James Letts 3 Analysis Operations 2009 Post-Mortem
Metrics report from James Letts 4 Analysis Operations 2009 Post-Mortem
Metrics report from James Letts 5 Analysis Operations 2009 Post-Mortem
Metrics CAF (only batch work is monitored ) Load From Lemon last year Users 114 hosts each 8-core = 900 batch slots Purple = URL Dashboard Nov1 to Jan 1 CAF CrabServer last 2 months 160M sec ~30 used hour each clock hour (30 slots) 30K jobs ~500 jobs/day Analysis Operations 2009 Post-Mortem
2. Operation Report Analysis Operations 2009 Post-Mortem
Operations: Support • Little data ? • Users activity not a visible load • Lessons in this area still from OctX • Much MC, largely a success in this area • MC sample felt to be of proper size and available as needed • sort of too many, difficult to know which one to use • metainfo in DBS does not appear adequate yet • people work out of dataset lists in twiki's, mails • Crab support load still too high • Requires (good) expert support • No idea yet of how effective can new people be • Most questions not solved by pointing to documentation • good: documentation and tutorials are good and people use them • bad: few threads are closed quickly, often need to reproduce, debug, pass to developers, even within our group we need to consult with each other and make mistakes • Many questions belong to : better message from tool will prevent users from asking again Analysis Operations 2009 Post-Mortem
Operations: CAF • Users relied on CAF in 2009 • Little data, ntuple could be made on CAF and replicated around by "all sort of means“. CAF does not appear steadily overloaded. • in a pattern of one-day-beam-few-days-no that may continue • may be painful to wean from it in a rush under pressure • many people use grid daily, we know it works, but it needs a bit of getting used to • use cases appeared for moving data from CAF to T2's (MC so far, expect data and calibration next) • CAF is by design separated by the distributed infrastructure • But hard to explain to users that "data can't move to/from CERN disks and offsite T2 disks“ and that jobs can not be submitted to CAF unless logged on lxplus (political limitations, not technical, other experiments do those things) • difficult to defend policies that make little technical sense • grid access to CAF would allow to user same /store/result instance (only have ~1FTE overall to operate the service) Analysis Operations 2009 Post-Mortem
Operations: CAF CRAB Server • CRAB Server for general CAF users is not really "like the grid one, but with AFS" • requires dedicated test/validation and support • Operations deals mostly with failures,they are different here • dedicated development effor not addressed here • Double the work, add <10% of resources • Does not hide the "unreliable grid", but the highly reliable local batch system • CERN may go down, but difficult that lxplus works and lxbatch does not. While on grid N sites go up/down as an habit • On grid the problem is to make grid invisible, here it is to make CRAB server invisible • On grid we look forward to pilot jobs to overlay our own global batch scheduler over the N sites, making them look like a single batch system. CAF has that already via LSF. What do we really gain with the server ? • De-facto last priority for support now • Fill-up of local disk detected by ops, not users. None complained when turned off for a week to address that. Analysis Operations 2009 Post-Mortem
Operations: changes • Transitions • changes in offline (SL5, pythong 2.6..) are often felt like done when CMSSW is out. But transition in CRAB and overall infrastrucutre needs to be taken into account more fully • Last December had to throw in new CRAB release in a hurry to cope with SL5, grid integration issues solved ~last day (with recipy from operations) and lead to "horror code“. DPM sites still not working with CMS distribution, using unofficial patch • New version is always better, but we operate current one • Backporting of fixes to production release expensive for developers • Answering same problem day after day expensive for us • Adding servers, falling back, resilience • Long lead time to get new servers in, esp. at CERN • Need good advance planning, no sudden changes and robust servers. Crab server failures are not transparent, take ~half of CMS workload with it. UPS ? Analysis Operations 2009 Post-Mortem
3. Concerns Analysis Operations 2009 Post-Mortem
Concerns: Data Distribution • data distribution largely a success • but we have not dealt with cleanup and working with "full sites" yet • are space bookeeping tools adequate ? • User’s output data stageout still major problem, likely to be so for a while • Effort on new versions, validation, support, documentation etc. delayed deploy/test campaing on /store/temp/user • Have to catch up fast Analysis Operations 2009 Post-Mortem
Concerns : Need to reduce daily support load • Software is 10% coding and 90% support • Crab is very powerful, flexible, touches everything from CMSSW to DBS to grid, from MC production to data splitting and calibration access • We can not offer expert advice and debugging on all currently implemented features, can not provide help on use cases we have not explored ourselves • Already started to cut, e.g. multicrab • Will make a list of what we are comfortable in knowing how to use, and only support that • Could use better separation of CMSSW/DBS/CRAB/GRID parts (true for us, users and possibly for developers) • Clean segmentation allows to "attach problems to specific area" • need to become able to operate in the mode: • "watch it while it runs and figure out overall patterns" Analysis Operations 2009 Post-Mortem
Concerns: CAF • Don’t see how we can give it the needed level of support with planned resources • Provocative proposals follow: • so far CAF Server approach used for two very different things • Incubator for workflows that will move to T0 (alca, express…): predicted/predictable load, few workflows, few people, more development/integration then operations hand over • General tool for the CAF community to do all kind of CRAB work, entice users to move from bsub to crab –submit: have to support all possible kind of loads in time-critical way, continous self-inflicted DOS risk drop server and only support direct crab to LSF submission • (a piece of) CAF should be integrated with distributed infrastructure, be able to use with same tools running same workflows • submit with same crab server • use same /store/result instance • access to phedex transfers Analysis Operations 2009 Post-Mortem
Concerns: too many features • Each small “addition” generates more work forever • However useful and well motivated • Would like to focus on what’s needed, not what’s nice • Uncomfortable in making this selection ourselves • But of course ready to do • Feature requests (exp. New ones) should be championed by some (group of) users providing clear use scenarios relevant in the short term, and willing to take on education of the community • Dumping something in our lap that we in analysis ops are not familiar with, is not likely to turn out in user’s satisfaction • We can not be/become expert on all possible workflows from user’s produced MC to DQM • The simple “give me your script and I’ll make it run it on the grid in N copies” is still more work then we can deal with at the moment Analysis Operations 2009 Post-Mortem
Concerns: changes • Changes in offline • Changes in offline do not stop at CMSSW • SL5, new python, new root, new… all affect also CRAB and running on remote sites. • Making CRAB work and making CMSSW work at all CMS Tier2’s should be prerequisito for release validation. Not a race by ops and crab developes after the fact • Changes in CRAB • As seen in other projects developers think in terms of next (and next-to-next) release, while we need to operate with current one (and are asked to validate next) • A concern mostly because of large looming transition to WMAgent, with unknown (to us) timing, duration, functionalities change • What features/fixes could/should be backported to production release instead ? Analysis Operations 2009 Post-Mortem
Concerns: user MC • See an extraordinary amount of user-made MC • Support for large workflows is difficult, also difficult to predict and control side effects on other users • Setting bad example/habit of “DataOps will take long, so let’s do it ourselves” • Why is this not going via DataOps ? • Communication issue ? Perception of bureocratic overload ? Private code ? Unsupported generators ? • We are not prepared/equipped to offer good support workflows that are not based on datasets located at T2’s Analysis Operations 2009 Post-Mortem
Conclusions • Learn a lot in 2009, built competence in the new group • MC exercised us at scale • Learnt little from work on beam data • Computing resources appear well usable and not saturated • Effort by “analysis ops metric” particularly appreciated, we start having the overall vision • Central placement of useful data at T2’s working well so far • Ready to face more beam data • Have listed our main concerns and desires • Cut on scope of crab support to reduce daily load • Reduce crab complexity and coupling to all possible tools • Focus on stageout issues (50% of all failures) • Push large single workflows into DataOps • Add more access options to CAF to better integrate with grid • Focus CAF CRAB server on critical workload • More care and coordination in transitions, operations a stakeholder in next release planning Analysis Operations 2009 Post-Mortem