1 / 16

LCG –France Tier2 & AF

Summary of the discussions and announcements regarding Tier2 operations, expectations, experiences, and coordination at the GDB meeting on October 8, 2008. Topics include storage, monitoring, batch systems, and coordination with experiments.

hildebrandt
Download Presentation

LCG –France Tier2 & AF

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LCG –France Tier2 & AF Réunion de Coordination – GDB 8 Octobre Hélène CORDIER

  2. Contents/announcements http://indico.cern.ch/conferenceDisplay.py?confId=20234 • Confirm pledges for 2009 • 2010 -2013 pledges by October 20th • IB mentions that data taking will be ready for May-June 2009 - no changes in the plans for next year unless other message from the accelerator is given. GDB 8 Octobre 2008

  3. Adresses Tier2 issues • Each VOs expectations on usage of Tier2s • LCG-FR and GRIF Tier 2 experience • Accounting • Monitoring • Storage • Batch systems • M/w deployment and installation • CCRC’09 GDB 8 Octobre 2008

  4. Experiments and Tiers2 specifics 1/2 • LHCb /Ph Charpentier Tier2s mainly for simulation Thinking about analysis – Trial sites • CMS/M Kasemann Tier2s : centrallly MC, group analysis Individuals’ analysis at ‘local’ Tier2 4 level of storage ( ExperimentUser) CMS site commissioning tool soon in production Tier2 Coordinators Giuseppe Bagliesi/INFN and Ken Bloom/Nebraska – Do all T2s know (at least one of) them? Savannah usability questioned by sites. « extra  resources » for analysis w/o extra burden on central operations Not that Clear John Gordon GDB 8 Octobre 2008

  5. Experiments and Tiers2 specifics 2/2 • ATLAS/K.Bors Very hierarchical model: Each T2 can do a full top analysis Specific functional tests GGUS tickets to Tier2s Multi-user pilot jobs in Panda? • Alice/Y.Schulz MC process and data storage End user analysis T2~T1 computationally wise Mandatory VOBox, shared FS for VO sw, xrootd SE Coordination with central operations/ Regional coordinationT2s federation Vertical operations Mixed operations John Gordon GDB 8 Octobre 2008

  6. Tier2 experience from LCG-FR T2-T3/FC • How to follow the experiments activities at the region level ? Each experiment seams to have its own approach. Some answers and more information were given: • LHCb : centralized approach • CMS mesh : direct contacts with all sites • ATLAS Cloud fits well with LCG-France regional approach provided you have experts • Alice : efficiency of the Alice Task Force • Towards LCG service… • We are now operating grid services with EGEE operational model • We have a generic operational model for baselines services • Next step : Transition to NGI • Needs for a specific LCG operational model included all Tiers ? What are LCG specificities compared to the EGEE operations ? • Needs for an LCG-France operational model compatible with EGGE (a future NGI) and W-LCG ? GDB 8 Octobre 2008

  7. Jamie Shiers answer’s – EGEE’08 • Whilst it is understood that this builds extensively on the operations infrastructure(s) of the underlying grids, there are additional layers that have proven at least valuable (so far…) These include: • Daily operations con-calls; week-days at 15:00 Geneva time, notes distributed same business day and widely read by Tier1 (Tier2?) sites, experiments and WLCG management • Weekly service summary to WLCG Management Board, quarterly service summary to Overview Board • Additional follow-up of service problems (beyond SAM service availability at MB): service issues that “violate” MoU target(s) trigger a post-mortem, which should normally be available by time of following MB • The experiments also have extensive operations teams & infrastructures, e.g. • WLCG Collaboration workshops: 200-300 people; • Hold jointly with other events, e.g. CHEP, EGEE’09(?), where possible • ATLAS “jamborees”: closer to 100… GDB 8 Octobre 2008

  8. Jamie Shiers answer’s – EGEE’08 • The Experiment Dashboards, VO-specific SAM tests, together with other experiment-specific monitoring really are used to “run the show” • CMS: 15’ daily to check status of all sites! • IMHO – no point to attempt to optimize this (already great) result just yet – get more experience with real data taking! • Very close collaboration between Grid Support team and experiments / sites • Very close collaboration with Grid & Service experts at CERN and elsewhere GDB 8 Octobre 2008

  9. Accounting/JG • EGEE now has a SAM test (APEL-pub) which becomes critical if a site doesn’t publish for a month. COD in EGEE will raise tickets against a site. • APEL client configuration may not be straightforward due to local batch configurations. See the APEL wiki, Raise a GGUS ticket. • Tier2 sites list maintained by lcg.office@cern.ch • CPU Usage is compared with WLCG Pledges: CPU capacity estimated by megatable soon : to check  Work underway on accounting on storage installed capacity - IP are being released dCache v.1.9.1 • APEL publishes UserDN and FQAN (VOMS proxy) • legal issues raised by M.Jouvin & F.Chollet  LCG-FR accounting group 9 GDB 8 Octobre 2008

  10. Monitoring for Tier2s/JC • Multi-level monitoring framework (James Casey) http://indico.cern.ch/getFile.py/access?contribId=344&sessionId=53&resId=4&materialId=slides&confId=32220 • Introduction to Nagios (Ronald Starink) http://indico.cern.ch/materialDisplay.py?contribId=239&sessionId=54&materialId=slides&confId=32220 • Grid Site Monitoring with Nagios (Emir Imamagic) http://indico.cern.ch/getFile.py/access?contribId=240&sessionId=54&resId=0&materialId=slides&confId=32220 • And a great tutorial based on a live demo: Grid Site Monitoring Demo (Steve Traylen) • SAM as we know it today will morph into something that uses NAGIOS as the submission framework • We are already testing probes that use the messaging system in Validation • We are gaining Nagios experience • Currently working on ‘ how to do alarming’ and storage of test results • WLCG and EGEE publish monthly availability reports GDB 8 Octobre 2008

  11. Storage in Gridpp Tier 2/GC Monitoring of Storage : Grid storage operations in GridPP Two types of interests • 1 GridPP operations (across all T2s) • 2 Site operations (within aT2) Multiple sources of useful information 1 BDII (i.e., SRM endpoints, spaceinformation) 2 SAM database of test results 3 The SRMs themselves • Developed some basic tools (or modified existing ones) to collect this information and visualise it for operations staff and sites. • Python scripts insert information into an SQL DB. • GraphTool (B. Bockleman) then used to query the DB and visualise the information. GDB 8 Octobre 2008

  12. Storage in Gridpp Tier 2/GC • Grid storage operations in GridPP Deployed mw versions, sites’ space token configuration, space availability over time, Visualising SAM tests • GridppDpmMonitor : Site Monitoring of storage/dcache billing graphs DN transfer successes, Client Host Transfer Quality GDB 8 Octobre 2008

  13. Batch systems (ST) Review of batch systems in use on WLCG/EGEE WLCG + community support for batch systems Existing Documents Recent Changes in the EGEE Deployment Individual Queues for VOs – VO queues not needed since VOViews are now available. § Supported by YAIM and the Middleware (WMS) YAIM and Middleware now support – Mapping a FQAN to a particular uid pool/gid pair. – Submitting a FQAN to particular queue § Due to the publication of that queue. – Publishing a VOView for that FQAN. Missing Documents and Instructions Introduction of CREAM, Job Tracking and handling GDB 8 Octobre 2008

  14. M/W deployment and Installation/OK FTS/SL4 Target - full deployment of 2.1 on SL4. CREAMfor direct submission with no proxy renewal, CREAM is basically ready-- WMS submission will come with ICE, timescale months- Target – maximum availability // lcg-CE. WMSStatus: Patched WMS ( fixing bug #39641 & bug #32345) within 1 week (Now?!) Target: This patch should be deployed. ICE to submit to CREAM : Not required for certification - ICE will be added in a subsequent update (but better before Feb. 2009) WN/SL5 Target – available on the infrastructure in parallel to SL4 -- python2.5 and alternative compiler stuff can be added subsequently. Multiple parallel versions of middleware available on the WN Target - advisable to introduce this relatively shortly after the bare bone WN on SL5. glexec/SCAS Target - enabling of multi-user pilot jobs via glexec. This could conceivably be via another means than SCAS, but this would have to be decided asap. Glue2Target - get the new schema deployed to the BDIIs … to get a working but unpopulated Glue2 infosys in parallel to 1.3. Info providers subsequently upgraded gradually. Heterogeneous CE publishing Target –Set of changes to rationalise publishing of heterogeneous computing resources is envisaged. First phase is deployment of new tools, enabling simply the current situation. Subsequent phases take advantage of new tools. gridftp2 patches These are being back ported to VDT1.6 ; Important or dCache and FTS Publishing of detailed service versions Small improved information providers are in certification GDB 8 Octobre 2008

  15. Main events HEPiX Fall Meeting Taipei, October 20-24th LHC Startup Jamboree, CERN, October 21st CCRC09 Planning Workshop CERN, November 13-14th OGF25 & EGEE User Forum, Catania, March 2-6th 2009 4th WLCG Collaboration Workshop Prague, March 21-22nd 2009 CHEP09 Prague, March 22-27th 2009 ISGC, Taipei, April 21-23rd 2009 EGEE09, Barcelona, September 21-25th 2009 GDB 8 Octobre 2008

  16. CCRC09’ agenda under construction /JS • lessons learned / experience gained from 2008 – together with specific proposals for things that must be tested as part of 2009 preparation : This includes things not fully tested in 2008 plus scale tests that realistically reflect 2009 needs Out of this must come some clear goals and metrics – to be measured and reported • We should also include “area roadmaps” – as has regularly been performed, but beyond middleware & storage-ware, including other major service components • E.g. the famous “Critical Service” follow-up, Database services, analysis-related services, monitoring & dashboard, experiment requirements according to agreed template… Some key “WLCG operations” issues, including but not limited to: • 2009 Review Workshops; Pre-CHEP (already booked), pre-EGEE first discussions Next year should be an exceptional year – the steady-state must be more stream-lined maybe co-locating with other events is the answer anyway… • Post-mortem & escalation procedures; • “SMART MoU” targets; • “Computing Run Coordinator” roles – something more sustainable than Harry & I and “volunteers”, plus understanding internal structures / procedures within experiments GDB 8 Octobre 2008

More Related