Emergency Preparedness for LCG Network Security Incidents

Planning for LCG EmergenciesHEPiX, Fall 2005SLAC, 13 October 2005 David KelseyCCLRC/RAL, UKd.p.kelsey@rl.ac.uk

LHC Tier 0/1/2 Network Architecture David Kelsey, LCG Emergencies

Background • Computing and Networking is essential • Tier 0 (CERN) and 12 Tier 1 critical for data taking • 10 Gbps Optical Private link to each T1 • The T1’s collectively keep a second copy of the raw data • The T1’s play vital role in (re)processing and providing access to derived data • During data taking, can cope with Tier 0 - Tier 1 link down for 12 hours to < few days. All T1’s down – very bad! • LCG MoU requires avg T1 uptime during data taking: 99% • LCG TDR says • “Special attention needs to be paid to the security aspects of the Tier-0, the Tier-1s and their network connections to maintain these essential services during or after an incident so as to reduce the effect on LHC data taking.” • LCG also essential for analysis • Need to keep the Grid running at all times • Therefore must deal quickly with incidents David Kelsey, LCG Emergencies

Security Incident Response • Joint (LCG/EGEE) Security Policy Group & EGEE Operational Security Coordination Team • Based Security Incident Response Policy and procedures on work of Open Science Grid • Agreement on Incident Response See https://edms.cern.ch/document/428035/ • Sites must • Take local action to prevent disruption • Report to local security officers • Report to others via Grid Incident Response mail list • “Volunteer” incident response team created when needed David Kelsey, LCG Emergencies

Incident classification • High: (team leader required) • The incident could lead to exploitation of the trust fabric, i.e user and host identities, or the incident could lead to instability of the overall Grid, or a denial-of-service is in progress against all replicas of a given Grid service. • Medium: (team leader required if widespread) • The incident affects an instance of a Grid service, but Grid stability is not at risk, or a denial-of-service affects one replica of a given Grid service, or a local attack compromised a privileged user account. • Low: (team leader probably not required) • A local attack comprised individual user, non-privileged credentials, or a denial-of-service attack or compromise affects only local grid resources. David Kelsey, LCG Emergencies

Emergency procedures • JSPG discussed this at last meeting (Sep 2005) • Started from point of view of Security incidents • But quickly realised that other disasters are also likely, so should deal with these too • Very early overview of the issues at this point • Certainly no plan yet • Invite feedback from HEPiX • There must be lots of site-based plans • JSPG will produce a draft emergency plan (and address policy issues) • Grid Operations and OSCT will need to define the details David Kelsey, LCG Emergencies

JSPG discussion topics • What is the scope? • LCG vs EGEE? • Critical: Tier 0/1, data taking, data integrity • Inter-site information flow • This is the critical point to be tackled • Users, Sys Admins and Managers • External information • including interface(s) to the Press • How do we keep the infrastructure operational? • Is this the aim? • What do we take down? • And who decides? • Can optical private networks remain up? • And are they sufficient for LCG data taking? • How do we deal with Tier 2 problems? David Kelsey, LCG Emergencies

LCG/EGEE Emergency Procedures Denise Heagerty CERN

When are emergency procedures required? • Emergency procedures are required to cover the following cases: • Incident response plans cannot be followed: critical parts of the infrastructure are unavailable (e.g. mailing lists) • Incident response plans are inappropriate: E.g. need to rapidly inform large parts of the community beyond the security contacts or incident communication channels are compromised • Examples • Major power cut at Site A lasted several days • Cable cut network access to Site B • Major worm disrupted network access at Site C • Security incident blocks user access to accounts at Site D • Wide area exploit of the (homogeneous) security fabric David Kelsey, LCG Emergencies

What is needed in an emergency? • Out of band communication channels • Alternative service providers (Internet, telephony) • Alternative contact details (e-mail, chat, …) • Alternative technology • Clear decision-making roles • There is no time for consensus during a crisis • Usual decision making process needs to be bypassed • Clear information flow and roles • For at least management, users, the press • Reduce the risk of mis-communication • Disaster Recovery Plan • Definition of critical infrastructure to kept running or repaired quickly • Dependencies and sequence must be clear for restoring services • Mailing lists (at CERN) are key to restoring communication David Kelsey, LCG Emergencies

Some ideas to stimulate discussion • Define an emergency advisory committee? • Members, mandate • Goal is to ensure rapid and appropriate decisions • Assure information flow • E.g. update DNS servers to point to temporary (web) servers • Pre-record messages on telephone help services • Prepare alternative communication channels • E.g. commercial conference call facilities • Alternative Internet providers (e-mail addresses, chat, phone,…) • When/do we return to normal Incident Response? David Kelsey, LCG Emergencies

Final words • LCG needs a written plan • Clear definition of roles • Operations staff need to know what to do • Training • The sites need to agree to policy and procedures • Recognise the powers of operations staff • Sites already have their own internal plans • Now trying to extend to the Grid • Feedback and advice is welcome! David Kelsey, LCG Emergencies

Emergency Preparedness for LCG Network Security Incidents

Emergency Preparedness for LCG Network Security Incidents

Presentation Transcript

October 2005

October 2005

Fall 2005

October 2005

October 2005

Undulator Metrology Catherine LeCocq, SLAC October 27, 2005

CMPE 150 Fall 2005 Lecture 13

October 2005

13 OCTOBER 2005

October-2005

CS3 Fall 2005

October 2005

HEPiX Spring 2005

LCG/EGEE Security Operations HEPiX, Fall 2004 BNL, 22 October 2004

LCG/EGEE Security Update HEPiX, Fall 2004 BNL, 18 October 2004

October 13-14, 2005 Martinique

FALL 2005

Fall 2005-

2005 MTBPS 25 October 2005

October 2005

October 2005