170 likes | 332 Views
ADC Weekly Meeting , May 8 2012 Annecy 2012 Technical Interchange Meeting Highlights. Simone Campana – CERN IT/ES. Introduction. 2 days meeting (Wed PM to Fri AM) 4 sessions Data Management Production System Analysis Networking Plus one session with invited speakers
E N D
ADC Weekly Meeting, May 8 2012 Annecy 2012 Technical Interchange Meeting Highlights Simone Campana – CERN IT/ES
Introduction • 2 days meeting (Wed PM to Fri AM) • 4 sessions • Data Management • Production System • Analysis • Networking • Plus one session with invited speakers • Intel, EPFL (Miguel Branco) • Many thanks to session conveners for the material ADC Weekly, 8/5/2012
TIM April 2012Data Management Session Highlights and Action Items for ADC weekly, CERN, 8 May 2012 S. Campana, V. Garonne, I. IUeda
Data Management • Storage Federations • xrootd the only realistic solution for the medium term • Use case focuses on failover for data access • More advanced use cases can be explored in future (“repairing” data, file level caching) • CMS experience in pre-production (failed access recovery) • CMS spent lot of time (and will spend more) in CMSSW I/O tuning (reducing #reads and increasing #hits in read-ahead). Key for success in WAN access. • ATLAS experience in USATLAS R&D • Automated tools for WAN tests on top of HC • Integration of xroot federation with Panda is in progress • Many open questions • Security, monitoring, content publication • MB recommended to create topical working groups • ATLAS will try to expand the experience with xroot federations outside the US. ADC Weekly, 8/5/2012
Data Management • Transfer Services • FTS will remain the baseline Transfer Service • FTS3 will cure known architectural issues • Channel concept, plugin support for protocols • FTS3 prototype in June, multi VO testing • Point2Point protocols • gridFTP as baseline, new version and session reuse will help reducing overheads • Xrootdis an alternative. Needs to be supported on all systems (see also discussion on federations) • HTTP is a serious option. Needs more integration and testing • SRM • Functionalities will be slowly replaced • Core set of functionalities will remain (access to MSS) • Positive experience with BestMAN+gridftp+Lustre at OU SWT2 • Interesting analysis from DDM Tracer data. Further studies suggested. ADC Weekly, 8/5/2012
Data Management • Rucio • Architecture and Prototype API now available • Rucio Demo in June, prototype in October • Case sensitivity • Would like to move to Case Sensitive datasets and file names in DDM (UNIX like) • No strong online and offline objections, will try to agree at June SW week • Rucio scope • Proposal presented, but possible issues for the usage of “Campaigns” • Is being re-thought, DDM team will present a new proposal soon • Naming convention for files at sites in Rucio • Controversial discussion (less intuitive organization of files at sites for local access) • Being re-iterated within ADC and with Data Prep and PhysCoord (ICB?) ADC Weekly, 8/5/2012
TIM April 2012ProdSys Session Highlights and Action Items for ADC weekly, CERN, 8 May 2012 K.De, A.Filipcic, A.Klimentov, R.Walker and A.Vaniachine
Production System and Grid Data Processing • Progress since TIM in Dubna • APF status • PY factory to be replaced • Still manual config files • Pending integration with AGIS • Fair share policy implementation • HLT Task Request • Real time definition of tasks and jobs • Multi-cloud production widely used • Tier-2s usage • Short term plans • Jobs submission vs resources heterogeneity • AKTR, et al overload • processing 10+k tasks requests with 90+k output datasets • Previous overload happen about a year ago – at the time of TIM in Dubna • Not clear why these rare events (overload and TIM) are correlated in time • Monitoring and better integration with SSB Alexei Klimentov – TIM Highlights
Dynamic Job Definition (JEDI) • JEDI core foundations • No predefined (and pre-assigned jobs) • Task Request: database templates • “Late” datasets registration • Reassessment of PandaDB and ProdDB • Understand benefits of redundancy • Separation of concerns • Task post processing • If you do not like the name “JEDI” the alternative is “PDJD” … • Panda Dynamic Job Definition Alexei Klimentov – TIM Highlights
Dynamic Evolution for Tasks (DEfT) • Rate of task requests grows exponentially • Linear growth in users and support requests • Growing list of requirements and use cases • New use cases: HLT, FTK, user analysis tasks • First ideas about new architecture and how JEDI and DEfT will be developed • ProdSys technical meeting in Lubljana (June 2012) to discuss JEDI and DEfT development Alexei Klimentov – TIM Highlights
ProdSys session II • Rucio/DDM and ProdSys/PanDA overlaps • What we want to keep and what we want to drop • Multi-core jobs • Ready for full Grid Production in simple scernario • glideinWMSstudies • Work in progress to find limits in various components Alexei Klimentov – TIM Highlights
TIM April 2012Distributed Analysis Session Highlights and Action Items for ADC weekly, CERN, 8 May 2012 F. Barreiro, D. Benjamin, D. Van Der Ster
ATLAS&CMS Common Analysis Framework • Initiative from CERN IT-ES, ATLAS and CMS • Assess potential of using common analysis solution based on PanDAandglideInWMS • Currently at the end of Feasibility Studyhttp://cern.ch/go/9mNC • Compare and analyze experiments’ workflows and architectures • Indentify dependencies, what can be reused and potential show-stoppers • Study and compare sub-components: Server sides, PanDA pilot and pilot factories, GlideInWMS • Evaluate integration scenarios for PanDA and GlideInWMSensuring no loss of functionality • Prepare final document with conclusions and proposal for Proof-of-Concept • To be validated by the experiments • In case of green light used as input for coming Functionality and Operations Studies
Improving Job Efficiency • Server Side Retries • Only 20% of failures are “retriable” • Normally OK at 2nd attempt, 3rd attempt useless • Non retriable failures are mostly “athena” • Well… something else, but masked by athena. • Work will be done for accounting those properly • proot • Main goal is to catch failures and categorize them properly (beside setting correctly the root env) • This is difficult if you do not “own” the event loop • So, now an EventLoop package and its grid driver have been developed ADC Weekly, 8/5/2012
Server Side Tasks • Current issues • Many actions today happen client side => slowness • Data discovery, job splitting, DS registration, retry • No task concept in Panda => complicated bookkeeping • User interest is in task rather than subjobs • Start moving client functionalities to server side • Simplify client tools, centralize functionalities, improve bookkeeping • Introduce Task concept in Panda (Task/Jobset table) • Modify clients to submit tasks/Jobsets (instead of subjobs) • Implement subjob definition server side • Evolve Panda server to handle subjobs and task/jobdef synchronization in DB • Change bookkeeping tools • Interact with task/jobdef table directly • Send retry commands to be executed by the server • Move toward server-side task management • Straightforward once job submission is mover server-side • Missing piece is task chaining
Pilot Plans/Ideas • Moving to “experiments” plugins • Refactor/clean pilot code • Provides a better platform for many contributors • Job recovery simplified • Could be used outside US (UK interest) • Could be used for analysis (to be evaluated) • StageIN/OUT • StageOUT retry to the T1 (instead of local): under development • StageIN retry from another source: leverage xrootdfederation • ErrorDiagnostic class in development and DEBUG mode for pilots • Avoid “grepping” logfiles, modularize etc … • Peeking capability • Many others … help needed. • Common solution initiative should bring in more contributors
Conclusions • A very productive workshop • Some subjects probably deserved a bit more time for discussion • ADC software is nowhere “frozen” • Needs to keep up with the demand • Strong focus on commonalities for long term sustainability • Several ideas/plans will be followed up in the next months in ADCDev and ADCOps • Plus dedicated workshops (e.g. Prodsys in Lubjiana) ADC Weekly, 8/5/2012