120 likes | 134 Views
Learn about handling emergencies in LCG network security incidents, classifications, response procedures, and essential requirements for incident management.
E N D
Planning for LCG EmergenciesHEPiX, Fall 2005SLAC, 13 October 2005 David KelseyCCLRC/RAL, UKd.p.kelsey@rl.ac.uk
LHC Tier 0/1/2 Network Architecture David Kelsey, LCG Emergencies
Background • Computing and Networking is essential • Tier 0 (CERN) and 12 Tier 1 critical for data taking • 10 Gbps Optical Private link to each T1 • The T1’s collectively keep a second copy of the raw data • The T1’s play vital role in (re)processing and providing access to derived data • During data taking, can cope with Tier 0 - Tier 1 link down for 12 hours to < few days. All T1’s down – very bad! • LCG MoU requires avg T1 uptime during data taking: 99% • LCG TDR says • “Special attention needs to be paid to the security aspects of the Tier-0, the Tier-1s and their network connections to maintain these essential services during or after an incident so as to reduce the effect on LHC data taking.” • LCG also essential for analysis • Need to keep the Grid running at all times • Therefore must deal quickly with incidents David Kelsey, LCG Emergencies
Security Incident Response • Joint (LCG/EGEE) Security Policy Group & EGEE Operational Security Coordination Team • Based Security Incident Response Policy and procedures on work of Open Science Grid • Agreement on Incident Response See https://edms.cern.ch/document/428035/ • Sites must • Take local action to prevent disruption • Report to local security officers • Report to others via Grid Incident Response mail list • “Volunteer” incident response team created when needed David Kelsey, LCG Emergencies
Incident classification • High: (team leader required) • The incident could lead to exploitation of the trust fabric, i.e user and host identities, or the incident could lead to instability of the overall Grid, or a denial-of-service is in progress against all replicas of a given Grid service. • Medium: (team leader required if widespread) • The incident affects an instance of a Grid service, but Grid stability is not at risk, or a denial-of-service affects one replica of a given Grid service, or a local attack compromised a privileged user account. • Low: (team leader probably not required) • A local attack comprised individual user, non-privileged credentials, or a denial-of-service attack or compromise affects only local grid resources. David Kelsey, LCG Emergencies
Emergency procedures • JSPG discussed this at last meeting (Sep 2005) • Started from point of view of Security incidents • But quickly realised that other disasters are also likely, so should deal with these too • Very early overview of the issues at this point • Certainly no plan yet • Invite feedback from HEPiX • There must be lots of site-based plans • JSPG will produce a draft emergency plan (and address policy issues) • Grid Operations and OSCT will need to define the details David Kelsey, LCG Emergencies
JSPG discussion topics • What is the scope? • LCG vs EGEE? • Critical: Tier 0/1, data taking, data integrity • Inter-site information flow • This is the critical point to be tackled • Users, Sys Admins and Managers • External information • including interface(s) to the Press • How do we keep the infrastructure operational? • Is this the aim? • What do we take down? • And who decides? • Can optical private networks remain up? • And are they sufficient for LCG data taking? • How do we deal with Tier 2 problems? David Kelsey, LCG Emergencies
LCG/EGEE Emergency Procedures Denise Heagerty CERN
When are emergency procedures required? • Emergency procedures are required to cover the following cases: • Incident response plans cannot be followed: critical parts of the infrastructure are unavailable (e.g. mailing lists) • Incident response plans are inappropriate: E.g. need to rapidly inform large parts of the community beyond the security contacts or incident communication channels are compromised • Examples • Major power cut at Site A lasted several days • Cable cut network access to Site B • Major worm disrupted network access at Site C • Security incident blocks user access to accounts at Site D • Wide area exploit of the (homogeneous) security fabric David Kelsey, LCG Emergencies
What is needed in an emergency? • Out of band communication channels • Alternative service providers (Internet, telephony) • Alternative contact details (e-mail, chat, …) • Alternative technology • Clear decision-making roles • There is no time for consensus during a crisis • Usual decision making process needs to be bypassed • Clear information flow and roles • For at least management, users, the press • Reduce the risk of mis-communication • Disaster Recovery Plan • Definition of critical infrastructure to kept running or repaired quickly • Dependencies and sequence must be clear for restoring services • Mailing lists (at CERN) are key to restoring communication David Kelsey, LCG Emergencies
Some ideas to stimulate discussion • Define an emergency advisory committee? • Members, mandate • Goal is to ensure rapid and appropriate decisions • Assure information flow • E.g. update DNS servers to point to temporary (web) servers • Pre-record messages on telephone help services • Prepare alternative communication channels • E.g. commercial conference call facilities • Alternative Internet providers (e-mail addresses, chat, phone,…) • When/do we return to normal Incident Response? David Kelsey, LCG Emergencies
Final words • LCG needs a written plan • Clear definition of roles • Operations staff need to know what to do • Training • The sites need to agree to policy and procedures • Recognise the powers of operations staff • Sites already have their own internal plans • Now trying to extend to the Grid • Feedback and advice is welcome! David Kelsey, LCG Emergencies