130 likes | 280 Views
LCG Incident Response. Ian Neilson LCG Security Officer Grid Deployment Group CERN. Background. LCG – Large Hadron Collider ( L HC) C omputing G rid Computing environment for the 4 LHC experiments ALICE, ATLAS, CMS, LHCb LHC operation in 2007
E N D
LCG Incident Response Ian Neilson LCG Security Officer Grid Deployment Group CERN GGF12 – 20 Sept 2004 - 1
Background • LCG – Large Hadron Collider (LHC) Computing Grid • Computing environment for the 4 LHC experiments • ALICE, ATLAS, CMS, LHCb • LHC operation in 2007 • Required 12-14 PetaBytes/year, equivalent 70,000 PCs compute • * LCG1/2003 * LCG2/2003-4 * EGEE • 70+ sites in Europe, USA, Asia, S. America …… • 7000+ CPUs • 6000GB+ Storage • Software certification, testing, deployment group • Distributed GOCs • UK • http://goc.grid-support.ac.uk/gridsite/gocmain/monitoring/ • Taiwan • http://goc.grid.sinica.edu.tw/goc/ www.cern.ch/lcg GGF12 – 20 Sept 2004 - 2
Grid monitoring GGF12 – 20 Sept 2004 - 3
EGEE - Enabling Grids for E-science in Europe • 12 federations with 70 partner institutions • 2 year + 2 project • Operate a service grid facility for e-science • Initial built on LCG2 infrastructure • Re-engineer a robust middleware layer • glite • Attract new users • Research and Industry • Broader focus than HEP: Biomedical, Earth Science …….. www.cern.ch/egee GGF12 – 20 Sept 2004 - 4
GOC Guides Policy – the Joint Security Group Incident Response Certification Authorities Audit Requirements Usage Rules Security & Availability Policy Application Development & Network Admin Guide User Registration http://cern.ch/proj-lcg-security/documents.html GGF12 – 20 Sept 2004 - 5
Incident Response Policy • Agreement on Incident Response • June 2003 for LCG1 • What is an incident? • Security investigation causing service interruption • Suspected misuse of resources beyond site • “Reasonable possibility” of stolen credentials • Not to expire or be revoked within 3 days • Classifications • Identity theft • Suspected / Probable / Confirmed • Actions • Misuse / Enforcement / Restoration / Escalation GGF12 – 20 Sept 2004 - 6
Incident Response - Communications • Site enrolment collects 2 entries per site • Registration questionnaire • Site Contacts mail list • Closed list of named individuals • email, telephone • CSIRT list mail • List-of-lists (Open) • 1 entry per site • Updated list circulated to contacts list as sites enrol • Pointers to policy documents for responsibilities • Channels • Users - local site contacts (& GOC) • Contacts - discussion and information exchange • CSIRT - incident notification, update • Roll-out - system administrators GGF12 – 20 Sept 2004 - 7
Incident Response – management issues • LCG “community” known at CERN, EGEE community is broader • User enrolment is well controlled, site enrolment is not • Incomplete questionnaires • Personal instead of list • List instead of personal • Undeliverable addresses • Delayed delivery • Moderated delivery • Enrolment information not circulated • SPAM, SPAM, SPAM, SPAM • Lists need active management! • Can we “see” all the sites? • CERN/GOC view • VO “private” information systems GGF12 – 20 Sept 2004 - 8
Incident response – operational issues • Recognising and reporting • What is a local CSIRT? • Scale of coverage • 24x7 site/campus network operations team • Department Security Officer • LCG system administrator • Who is a security contact? • as above • Intersection with local CSIRT procedures • Local quarantine and analysis • Keeping emergency channels clear • Discussions, cross-postings GGF12 – 20 Sept 2004 - 9
Incident response – near-term • JSG, EGEE MWSG/JRA3, OSG, …… • Site and VO registration policy and process • Control gathering, distribution and management of data • Sites need to understand requirements and responsibilities • Coverage, access, audit • Needs to be actively managed (? Self managed) • Operational Security Co-ordination Team (OSCT) • Ownership of security incidents • From notification to resolution • Liaise with national/institute CERTs • Ownership of known problems • Liaise with development & deployment groups • Co-ordination of monitoring • Post-mortem analysis • Team of experts GGF12 – 20 Sept 2004 - 10
Security Co-ordination • How does OSCT map onto EGEE operations structures? • Resource Centres (lots) • Regional Operations Centres - ROC (~9) • Core Infrastructure Centres - CIC (~5) • Operations Management Centre - OMC (1) • Co-ordination with Open Science Grid ……… • Adopt same co-ordinating model GGF12 – 20 Sept 2004 - 11
2004 Security Service Challenges • Objectives • Evaluate the effectiveness of current procedures by simulating a small and well defined set of security incidents. • Use the experiences of a) in an iterative fashion (during the challenges) to update procedures. • Formalise the understanding gained in a) & b) in updated incident response procedures. • Provide feedback to middleware development and testing activities to inform the process of building security test components. • Exercise response procedures in controlled manner • Non-intrusive • Compute resource usage trace to owner • Run a job to send an email • Storage resource trace to owner • Run a job to store a file • Disruptive • Disrupt a service and map the effects on the service and grid GGF12 – 20 Sept 2004 - 12
LCG/EGEE Incident Response Thank You Thank you to UK PPARC GGF12 – 20 Sept 2004 - 13