200 likes | 222 Views
Learn about existing efforts and policies for addressing security, accounting, and incident handling in global Grid operations, with emphasis on collaboration and common standards.
E N D
Hélène Cordier EGEE/WLCG Operations IN2P3 Computing Centre Lyon (France) - helene.cordier@in2p3.fr Grid INTER-Operations
Contents • Existing Common Interests in solving mainly 2 issues so far: • Security and accounting issues, monitoring workflow efforts are diverse. • Existing efforts at inter-project level involving: • Grid Interoperability Now (GIN, as a workgroup from OGF) https://forge.ogf.org/sf/go/projects.gin/wiki • Existing efforts at project level involving: • EGEE, WLCG and OSG • NDGF, PRAGMA, TERAGRID and NAREGI • Existing efforts at IN2P3-CC: • IGTMD • Concerns and Updates
Joint Security Policy Group Certification Authorities EUGridPMA IGTF and so one. Grid Acceptable Use Policy (AUP) common, general and simple AUP for all VO members using many Grid infrastructures e.g. EGEE, OSG, SEE-GRID, DEISA, national Grids… Incident Handling and Response defines basic communications paths defines requirements (musts) for IR not to replace or interfere with local response plans Incident Response Certification Authorities Audit Requirements Usage Rules Security & Availability Policy VOSecurity Application Development & Network Admin Guide User Registration & VO Management Security & Policy Grid Security Policy (v5.7) : https://edms.cern.ch/document/428008/4 Grid Site Operations Policy (v1.4): https://edms.cern.ch/document/819783/1 Virtual Organisation Operations Policy (v1.0): https://edms.cern.ch/document/853968/1
Usage record working group Mandate : In order for resources to be shared, sites must be able to exchange basic accounting and usage data in a common format. This working group proposes to define a common usage record based on those in current practice. The record format will be specific enough to facilitate information sharing among grid sites, yet general enough that the usage data can be used for a variety of purposes - traditional usage accounting, service usage monitoring, perfomance tuning, etc. This group will therefore be concentrating on collecting and disseminating resource consumption data. We will not be addressing how that data is to be collected by the resource sites, nor how it will be used by its recipients.
Accounting • Tools needed to collect and report information on resource utilization • Intended audience: site managers, virtual organization managers, grid operators, funding agencies,… • Need to define common ways of measuring resource consumption • Including usage of same units • LCG/EGEE • CPU usage information (per user or per VO) provided by each site and stored in a central repository : Reports (charts and numeric data) available through a web interface • Next step: collect information on storage utilization. • Developed and operated by Grid Operations Centre (UK) and CESGA (SWE).
Site monitoring High-Level Model
Site monitoring (cont’d) We can’t/won’t impose a solution on sites , as they might/should have something Already. Specification based approach allows our probes fit into any fabric monitoring system : Data Exchange format allows higher-level services consume the data regardless of fabric monitoring system https://twiki.cern.ch/twiki/bin/view/LCG/GridMonitoringDataExchangeStandard WLCG Monitoring Working Groups since January 23rd 2007: System Management Working Group – SMWG /J. Casey, I. Neilson https://twiki.cern.ch/twiki/bin/view/LCG/SystemManagement Grid Service Monitoring Working Group – GSMWG / A. Forti, M. Jouvin https://twiki.cern.ch/twiki/bin/view/LCG/GridServiceMonitoringInfo System Analysis Working Group – SAWG / J. Andreeva, P. Saiz https://twiki.cern.ch/twiki/bin/view/LCG/SystemAnalysisMonitoringInfo [Rob Quick, Workshop on Grid services Monitoring HPDC’07 – June 27th 2007]
CIC Operations Portal • Web portal for integrating all the tools and sources of operations-related information into one single place • Developed and operated by CC-IN2P3, failover instance at CNAF • http://cic.gridops.org/ • Provides and maintains an integrated operations dashboard for grid on duty operator • Provides mechanisms for keeping information needed for appropriate hand over between operators on duty • Easy access to appropriate contact information on every actor involved in the operations of the grid • Provides communication tools
Tracking incidents via GGUS • Incident tracking model • Unique channel for opening tickets • End-users : e.g job submission failures, data transfer failed • Operators : e.g job submission failures • Classification and 1rst assignment done by the ticket process manager • Tickets are assigned to support units - one per domain of expertise • Grid operators, applications, federations, m/w experts,.. • OSG : ggus@tick.globalnoc.iu.edu Automatic helpdesk/ XML Format Exchange 4 tickets created by cms users from June 27th • WLCG/EGEE • Central incident tracking tool : https://gus.fzk.de/ • Same tool used by grid operators and end users via e-mail and web interface • Sites failing the tests receive are assigned a ticket • Escalation procedure for solving site-related problems • Involves the regional operator and the site operator • Interface with ticket handling tools used by sites/federations (if needed) • Tools for collecting metrics on the responsiveness of support units
The ENOC • The EGEE Network Operations Centre (ENOC): • Single point of contact between EGEE and the NRENs • Where EGEE and the network can exchange operational information • Network support unit in GGUS ENOC
IGTMD Grid Interoperability and Massive Data Transfer • 3 years, started in Feb 2006 • Renater, ENS, CC-IN2P3, FNAL-unfunded Goals • Disk to disk Bulk data transfer • Replication and referring mechanisms • Information Sytem and job management interoperability • Grid control and monitoring • Usage of statistics and accounting data
IGTMD Roadmap • Network: items 1and 2 • 2* 1 Gb/s CC-IN2P3/FNAL on October 16th 2006 – LCG/EGEE • Tests on Massive Data transfer – CC-IN2P3/FNAL • Interoperability: item 3 • Access to grid resources through standard APIs – LCG/EGEE • State-of-the art cf. JTR – October17th; • RoadMap on the IGTMD face-to face meeting May 4th • Inter-operations: items 4 to 5 • Tests suite relevancy to US sites – EGEE • Operations and Daily Monitoring of services – EGEE • Usage Records and accounting – OGF
Concerns and updates • Achieve a real 24x7 production quality-like service : Failover mechanisms • Increase automation of daily monitoring tools and alarms treatment. • OGF20—GIN JOBS - EGEE/TERAGRID/OSG/NORDUGRID/DEISA • https://forge.ogf.org/sf/wiki/do/viewPage/projects.gin/wiki/WorkerNodeEnvironmentOGF20 • 29/08/2005 http://edms.cern.ch/document/630962 • 29/03/2007 mail from Laurence Field on GIN-JOB • GIN-OPS : Savannah and Ninf-G • GIN-IS :EGEE-NDGF and EGEE-OSG not updated since 17 Août 2006 • GIN-data :idem • GIN-auth : AUP for the gin.gg.org VO since 12/06.