150 likes | 283 Views
Security in an Agile Infrastructure. CERN Computer Security Team 2013/12/12. Why?. In the “old” CC world, we had a good level of security (with some areas of “ bricolage ” and suboptimal configurations ). Many have worked hard to keep this level up .
E N D
Security in an Agile Infrastructure CERN Computer Security Team 2013/12/12
Why? In the “old” CC world, we had a good level of security(with some areas of “bricolage” and suboptimal configurations).Many have worked hard to keep this level up. This has spared us from major incidents in the recent years.And from major consequences.
Well done so far we have • There was on average one root compromise per month in the WLCG during the last ten years. Passwords are harvested en gros. These compromises periodically affected CERN users; CERN power users/admins are as exposed as anybody else.
The Consequences (I) • Loss of reputation: Downtimes can have bad adverse effects(one minute of AFS down stirred already noise within BE like in 2011) • Damage to data: Misconfigurations or successful attacks can lead to data deletion (as just happened recently with 15TB of CMS EOS data lost) • Manipulation of source code or configuration files: Compromise of Git or PuppetDB can render our whole software base compromised. The costs for going through all recent commits and code changes are enormous (fortunately, CERN spared so far, but Linux itself has been hit hard in 2011) • Reinstallation of services: Even in an Agile Infrastructure, reinstallation requires lots of due diligence, thorough testing, and involvement of the whole service team (LXPLUS/LXBATCH regularly reinstalled after some critical vulnerabilities have been published)
The Consequences (II) • Change of passwords: Password exposure always requires re-initialisation of passwords, and, as many dependencies are not always understood, require carefulness (changing Oracle passwords is always a challenge to the DB owners and to BE/CO) • Change of secrets: Every secret that leaks need to be changed. Some are “easily” renewable, others are not (e.g. the LXPLUS SSHD private key) • Forensics: Investigation for cause of incident require involvement of Security Team but also of affected service providers. This can take time and usually trigger other costs (like reinstallation and password change)
Restart. The new Agile Infrastructure is a game changer.Lots of new tools, lots of fringe effects, lots of paradigm changes. Partially we gain control, partially we lose control. Are we back to square one?
New Security Needs Vincent (IT/CSO)
Some Working Areas • Puppet “Facts” executed as root: Puppet module owners (150+ from all over CERN) can run arbitrary commands as root on any host just by defining “facts”. JIRA AI-3081 by Gavin. Pursuing upstream. Assessing cost to fix in parallel. • Secrets exposed by Git:ai-admins owning a VM (170+ from all over CERN) can access the secret keys and passwords used by any other host as hiera-gpg is inherently broken. JIRA AI-3266 with Ben & Vincent going for “Teigi” at probably low costs. • Puppet environment hijack:ai-admins can add an “override” to existing environments and become root. git-hook might help, under discussion. Given that the usage of Puppet within CERN ought to grow, the more systems and functionality are added, the more shared modules will be created, the more people will be added to these e-group. For sure, these won’t only be members of the IT department.And for sure, some have left again their AI role…
The Threat • Finger-trouble by any ai-admin when modifying facts can impact on a multitude of CC services(not for the first time where unintentional mistakes created downtime) • Frustrated colleagues would have an open door to take revenge against the IT department, an experiment or CERN (happened already in e.g. BE and GS) • Malicious attackers can take over the CC. All they need is a password of any ai-admin and the patience to understand our environment(passwords regularly lost to sophisticated attacks at CERN and other sites) Given it takes an average of six months before incidents are discovered, if any CERN account with admin rights is abused, a complete reinstallation of the CC will become mandatory to be sure that the incident is contained and resolved.
Reflect. How much Agile Security do we want? While we might all agree that having too much security is not optimal and probably too expensive, having too little security impliessignificant hidden costs to be borne by the IT department and CERN should a compromise occur.
Act! Maintain an Agile security footprint as we’d had for the “old” CC.A new infrastructure comes with new security needs(and with new costs). We must continue to invest in order to be sparedfrom major incidents. And from major consequences.
On Short Term Meet regularly to discuss/decide on Agile Security measures • Already started bottom-up but needs to be widened in scope. • see also next slides Harden hosts under our control • Default toggle “on” of tight iptables,RPMverify and syslog. • Preferred “on” for netlog& Snoopy (but we have some digestion troubles ATM). Identify who needs full default access to all nodes • (PF or FP will panic if they learn that “most IT” has access to their data) • Procurement? 1stline HW support? CC operators? CC admins? • Align with CC access policy and have a justified & permanent need. • How is IPMI/root access done? P2P? SSH via [LX|LXVO|AI]ADM (as in the past)? Deploy 2FA AuthNfor Foreman/Judy and AIADMs • Required for everyone? Only super-users? How to deal with API calls? • Both Intranet/Internet or only Internet? Does that make sense?
On the Long Run: Agile Infrastructure Review security & perform impact analysis of all Agile components & tools • (5 remote code execution vulnerabilitiesfor Puppet reported 1st half 2013!) • Discuss findings on Thursday meetings;turn valid ones into JIRA tickets. Compartmentalize access to hostgroups & modules • Reduce the wide rights of “ai-admins”. • Move towards a system of least-privileges wherehostgroupadmins can do everything to their hostgroups, but nothing outside… • …and where modules can only be altered by their admins. Have traceability on all(?) actions Deploy logging infrastructure for 100.000+ VMs • elasticSearch and Storm to the rescue(?!). • Can we really scale up to such a size??? • Pending with the Security Team, but help/ideas appreciated!
On the Long Run: OpenStack Review security of CERN’s OpenStackdeployment • Perform gap analysis to http://docs.openstack.org/sec/ • Prioritize, agree, deploy. • Identify areas of improvement. • Provide Security Baseline for Hypervisors and VMs.(5 “important” CVEs for KVM in 2013!) Keep traceability of user-controlled VMs • Log contact information, date of usage, assigned IP, … • Have snapshots of VMs at start, end, in between (?!) and archive them. Deploy VM segregation • Separate critical services from user-controlled VMs(avoid having mail, EDH, DFS on the same HV than VOboxes). • Keep homogenous services (like LXBATCH) together(in order to simplify outer perimeter firewall & gate rules). • LCG vs GPN>.
John, please answer me! Thanks!