1 / 14

GDB - February 2014 Summary

GDB - February 2014 Summary. Jeremy’s notes Agenda: http:// indico.cern.ch / event /272618/. Introduction (M Jouvin ). Please check 2014 meeting dates. March 12 th – CNAF Bologna (register) WLCG workshop (1 st /2 nd week July). Barcelona. Possibly 8 th -9 th July .

megan
Download Presentation

GDB - February 2014 Summary

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. GDB - February 2014 Summary Jeremy’s notes Agenda: http://indico.cern.ch/event/272618/

  2. Introduction (M Jouvin) • Please check 2014 meeting dates. • March 12th – CNAF Bologna (register) • WLCG workshop (1st/2nd week July). Barcelona. Possibly 8th-9th July. • GDB actions: https://twiki.cern.ch/twiki/bin/view/LCG/GDBActionInProgress • Future (pre-)GDB topics welcome • Upcoming: By introducing a pay‐per‐usage scheme as part of • funding model the funding agencies will have the information to be able to measure the • level of usage of a service and whether it justifies their investments. In addition, if the • pay‐per‐usage model is implemented by giving some of the financial control to the users • then they will favour those services which offer better value‐propositions. • Site Nagios testing – any feedback? • OSG Federation workshop: https://indico.fnal.gov/conferenceDisplay.py?confId=7207 • HEPiX May 19-23rd May. Annecy: EGI CF 19-23rd May. Helsinki.

  3. HEP SW Collaboration (I Bird) • Performance now a limiting factor. • CPU technology trends. More transistors but not easy to use them. • Most s/w designed for sequential processing. Migrating to multi-threaded not easy. Target geant and root. • Concurrency Forum est. 2yrs ago. Towards Open Scientific Software Initiative. • Components such asGeantand ROOT should be part of a modular infrastructure. • HEP S/W Collab: goal to build /maintain libraries… • Establish a formal collaboration to develop open scientific software packages guaranteed to work together (inc. frameworks to assemble apps). • Workshop 3rd-4th April 2014

  4. IPv6 Update (D Kelsey) • WG meeting 23/24 Jan 2014 (included CERN cloud and OpenStack.) • Progress in various areas. CERN campus wide deployment in March (some dhcpv6 issues): http://ipv6.web.cern.ch/ • PerfSONAR very useful… works IC. Run dual stack? • IPv6 file transfer test bed. Decayed a bit. • ATLAS testing (Alastair): AGIS. Simple tests then HC. Squid 2.8 not IPv6 compatible. • Plan to get mesh working again. Site deployments. Move to use SRM/FTS… • Define use-cases • Barrier to move for some sites if availability affected going to dual stack etc. • Software survey shows 15/66 ‘services’ known to be fully compliant. • Pre-release of dCache 2.8.0 has IPv6 fixes. • Want to survey sites – when will they run out of IPv4 and be capable of IPv6. pre-GDB meeting in June.

  5. Future of SLC (J Polok) • CentOS team joining Red Hat in open standards team. Not RHEL. • CentOS Linux platform is not changing • Impact for SL5/6: Source packages may have to be generated from git repositories. • No other changes – releases stay as now • SL(C) 7 options being discussed • May rebuild from source as for 5 and 6 OR create a Scientific Centos variant OR adopt Centos core. • Approaches: 1. Keep process: build from source with our actual tool chain. 2. Create SIG for our variant. 3. SL become an add-on repository to CentOS core. • Centos 7 Beta in preparation. RHEL7 production due in summer. Source RPMs not guaranteed after summer. • Need to ensure risks for 5 and 6 covered.

  6. Ops coordination report (S. Campana) • Input based on pre-GDB Ops Coordination meeting. • gLexec: CMS SAM test not yet critical. Still 20 sites have not deployed. • perfSONAR: It is a service. Site w/o or at an old release will feature in report(s) to MB. • Tracking tools evolution – Savannah to JIRA. JIRA still lacking GGUS some functionality • SHA-2 migration: progress with VOMS-admin but manual process needed. New host certs soon. • Machine/Job features: Prototype ready. Options for clouds being looked at. • Middleware Readiness: Model will rely on experiments & frameworks + sites deploying test instances + monitoring. MB will discuss process for ‘rewarding’ site participation.

  7. Ops Coordination - cont • Baseline enforcement: Looking at options to monitor and then automate for campaigns • WMS decommissioning: Shared/CMS instances end in April. SAM will use till June. • Multi-core deployment: ATLAS & CMS different usage. Trying prototypes. Torque/Maui a concern. • FTS3 deployment: FTS3 works well. Few instances needed – 3 or 4 for resilience. • Experiment Computing Commissioning: Experiment plans for 2014 discussed. Conclude no need for common commissioning exercise. • Conclusion – some deployment areas being escalated.

  8. High memory jobs (J Templon) • NIKHEF observations • Which high mem problem!? Virtual memory usage in GB. Pvmem 4096MB. User jobs and some prod jobs high usage. These don’t ‘ask’ for the memory. Link multi-core and high mem. • Pvmem – ulimit on process – allows handling of out-of-mem signal (not kill) • Different ways to ask for more memory in job… few work. Inconsistencies arise. • Situation being reviewed.

  9. SAM test scheduling (Luca Magnoni) • SAM: framework to schedule checks (Nagios) via dedicated plug-in (probes = scripts/executables) • Categories: Public grid/cloud services (custom probes); job submission (via WMS); WNs (via job payloads). • Job submission – to include direct CREAM and condor-G • Remote testing assumes deterministic execution. There are granularity issues (CE vs site) and not always agreement between site and experiment views. • Can test with different credentials. Jobs can timeout whe VO out of share. Site availability determined by experiment critical profiles. • Most timeouts looked to be on WMS side! • New Condor-G and CREAM probes for job submission coming • Aim to provide web UI/API for users • Looking at options to replace Nagios for scheduler • Test submission via other frameworks (e.g. HC) being investigated – ATLAS want a hybrid approach, CMS do not support framework approach.

  10. New transfer dashboard (A Beche) • Reviewed history of data transfer monitoring. Separate web API/UI for FTS, FAX, AAA. Added in ALICE and EOS. • Plan to federate. Data split into schemas: FTS, XRootD and high optimization. • Data retention policies differ – raw and statistics • Dashboard now aggregates over APIs • Plan for a map view

  11. WLCG monitoring coordination (Pablo Saiz) • Consolidation group: reduce complexity; modular design; simplify ops and support; common dev and core. • Need more site input. • Timeline – starting to deploy. • Survey & tasks. Tasks in JIRA: https://its.cern.ch/jira/browse/WLCGMON • 1. Application support(for jobs, transfers, infrastructure…) • 2. Running the services(moving to AI, Koji, SL6, puppet…) • 3. Merging applications (SSB+SAM; SSB+REBUS; HC+Nagios…). Idea is to reduce to make maintenance easier. Many infrastructure monitoring tools - schema copes with several use-cases. http://wlcg-sam-atlas.cern.ch/ • 4. Technology evaluation • Nagios plug-in for sites developed by PIC • SAM/SUM -> SAM3 (for SUM background see https://indico.cern.ch/event/285330/contribution/3/material/slides/1.pdf) • Next steps: https://its.cern.ch/jira/browse/WLCGMON

  12. Data Preservation Update (J Shiers) • Things are going well. • Workshop. • Increasing archive growth. • Annual cost of WLCG is 100M euro. • Need 4 staff: documentation; standards; .. • DPHEP portal core. Digital library. Sustainable software+virtualisationtech+validation frameworks. Sustainable funding. Open data.

  13. LHCOPN/LHCONE evolution workshop (E. Martelli) • Networking stable. Key. Growth with technology evolution ok. • New sites in areas where network under-developed. • ATLAS: Expect bursty traffic. US sites-> 40/100 • CMS: Mesh will increase traffic. • LHCb: no specific concerns. • More bandwidth needed at T2s. Connectivity needs to improve to Asia – capacity and rtts. • Demands for better network monitoring & LHCONE operations. • P2P-link-on-demand (over provisioning vscomplexity (L3VPN))

  14. perfSONAR (Shawn McKee) • Sites to use “mesh” configuration • Metrics will adjust over time • 85% sites with PS have issues to resolve (firewalls, versions…). • Likely go with MaDDash (Monitoring and Debugging Dashboard) • Checking of primitive services – OMD (Open Monitoring Distribution) • For test instance…. WLC*** WLC*** • Context between all sites … • 3.3.4 release will mean only one machine needed • Alerting – high-priority but complicated

More Related