110 likes | 251 Views
ai - config -team report. 28/08/2014. Puppet run incident. What we know: Puppet runs start to fail when a puppetdb query in base starts timing out Puppetdb postgres backend maxes out cpu with this one class of query responsible for majority of load
E N D
ai-config-team report 28/08/2014
Puppet run incident • What we know: • Puppet runs start to fail when a puppetdb query in base starts timing out • Puppetdbpostgres backend maxes out cpu with this one class of query responsible for majority of load • Load balancers become overloaded with queue • Spiral of death: LB stops responding to lbd, DNS entry removed, ENC not reachable, comes back, puppetdbreplace_facts storm, PDB slows to crawl, repeat
ai-pdb raw /v3/facts --query '["and", ["in", "certname", ["extract", "certname", ["select-facts", ["and", ["=", "name", "hostgroup_2"], ["=", "value", "adm"]]]]], ["in", "certname", ["extract", "certname", ["select-facts", ["and", ["=", "name", "hostgroup_0"], ["=", "value", "bi"]]]]], ["in", "certname", ["extract", "certname", ["select-facts", ["and", ["=", "name", "hostgroup_1"], ["=", "value", "inter"]]]]]]'
Actions and plans • CRM-623: remove the allow ssh from aiadm rule which included “$aiadm_nodes = query_nodes('hostgroup_0="bi" and hostgroup_1="inter" and hostgroup_2="adm"', ipaddress)” • Reduced number of fact-names (thanks Dan), cleaned up foreman (thanks Nacho) • Longer term: reduce amplification effect from load balancers • Read only puppetdb for API access
Things we don’t know • What triggers the problem? Normally load on dbis minimal • Perhaps updating new facts? New fact name across lots of plant this week. Looking at previous events • Will engage upstream, but we are behind on puppetdb versions due to dependencies
Other activity • Postgresdbod slave for puppetdb • so far no stable replication • Updates to puppetdb & postgres modules to support r/o puppetdb • Raising issues with upstream for foreman issues with hostgroup filtering in new version • New teigi::secret::sub_file type testers required
In QA • CRM-401 add an option to enable UDT for gridftp servers • CRM-567 Smartd Puppet Module • CRM-575 Add smartd to the base pluginsync whitelist • CRM-576 Including the smartd module into the hardware module • CRM-577 Deploy blockdevice driver monitoring in QA for EL5 and EL6 • CRM-611 Update of site.pp to support 10-deep hostgroup • CRM-613 Drop alarmed fact from sapp_puppetmaster • CRM-615 Removing megacli and adding storcli for vendors transtec and viglen • CRM-620 New cern_hwcontract function to extract contractid from hwdb cache • CRM-622 New 'ssds' facts • CRM-623 Emergency backout of allow ssh from aiadm
QA-Prod • CRM-591 Do not clobber ADFS-metadata.xml with puppet. • CRM-595 Enable buildMap="1" for new (3.5) shibboleths when memcache is enabled. • CRM-604 facter 1.7.4 -> 1.7.6 upgrade. • CRM-605 Upgrade mcollective filemgr, package and service plugins • CRM-606 Add fact to expose the tenant name • CRM-607 Drop active installation nrpe mco! llective plugin • CRM-608 New Redhat/7.yaml hiera file. • CRM-609 Add CentOS (7) support to osrepos. • CRM-610 CentOS as valid OS name • CRM-612 Update of hiera config to support 10-deep hostgroups • CRM-616 ai-tools 8.2-1 • CRM-617 Update module to upstream version 1.7.9 • CRM-618 RHEL5 repo fixes for osrepos