420 likes | 431 Views
A detailed monitoring system for the OpenStack services of the Cancer Genome Collaboratory project, ensuring seamless performance and reliability. Utilizes open-source tools and custom checks for optimal efficiency.
E N D
In-depth monitoring for Openstack services George Mihaiescu, Senior Cloud Architect Jared Baker, Cloud Specialist
The infrastructure team George Mihaiescu • Cloud architect for the Cancer Genome Collaboratory • 7 years of Openstack experience • First deployment - Cactus • First conference - Boston 2011 • Openstack speaker at Barcelona, Boston and Vancouver conferences Jared Baker • Cloud specialist for the Cancer Genome Collaboratory • 2 years of Openstack experience • 10 years MSP experience • First deployment - Liberty • First conference (and speaker - Boston 2017
Ontario Institute for Cancer Research (OICR) • Largest cancer research institute in Canada, funded by the government of Ontario • Together with its collaborators and partners supports more than 1,700 researchers, clinician scientists research staff and trainees
Cancer Genome Collaboratory Project goals and motivation • Cloud computing environment built for biomedical research by OICR, and funded by government of Canada grants • Enables large scale cancer research on the world’s largest cancer genome dataset currently produced by the International Cancer Genome Consortium (ICGC) • Entirely built using open-source software like Openstack and Ceph • Compute infrastructure goal to provide 3,000 cores and 15 PB storage • A system for cost-recovery
No frills design • Use high density commodity hardware to reduce physical footprint & related overhead • Use open source software and tools • Prefer copper over fiber for network connectivity • Spend 100% of the hardware budget on the infrastructure that supports cancer research, not on licenses or “nice to have” features
Openstack controllers • Three controllers in HA configuration (2 x 6 cores CPU, 128 GB RAM, 6 x 200 GB Intel S3700 SSD drives) • Separate partitions for OS, Ceph Mon and MySQL • Haproxy (SSL termination with ECC certs) and Keepalived • 4 x 10 GbE bonded interfaces, 802.3ad, layer 3+4 hash • Neutron + GRE, HA routers, no DVR
Networking • Ruckus ICX 7750-48C top-of-rack switches configured in a stack ring topology • 6 x 40Gb Twinax cables between the racks, providing 240 Gbps non-blocking redundant connectivity (2:1 oversubscription ratio)
Rally – end-to-endtests Rally test that runs every hour and does end-to-end check • Starts a VM • Assigns floating IP • Connects over SSH • Pings an external host five times Alert if the check fails, takes too long to complete or packet loss is greater than 40% It sends runtime info to Graphite for long term graphing. Grafana alerts us if average runtime is above a threshold.
Rally – RBD volume performance test Another Rally check monitors RBD volume (Ceph based) write performance over time: • it boots an instance from a volume • it assigns a floating IP • it connects over SSH • it runs a script that writes a 10 GB file three times • it captures the average IO throughputat the end • it sends throughput info to Graphite for long term graphing • it alerts if the average runtime is above the threshold
Rally smoke tests & load tests
Dockerized monitoring stack We run a number of tools in containers: • Sflowtrend • Prometheus • Graphite • Collectd • Grafana • Ceph_exporter • Elasticsearch • Logstash • Kibana
Ceph Monitoring IOPS
Ceph Monitoring Performance & Integrity
Zabbix • 200+ hosts • 38,000+ items • 15,000+ triggers • Performant • Reliable • Customizable https://github.com/CancerCollaboratory/infrastructure
Zabbix The Zabbix Agent (client) • CPU • Disk I/O • Memory • Filesystem • Security • Services running • HW Raid card • Fans, temperature, power supply status • PDU power usage
Zabbix Custom checks • When security updates are available • When new cloud images are released • Number of IPs banned by fail2ban • Iptables rules across all controllers are in sync • Openvswitch ports tagged with VLAN 4095 (bad) • Number of Cinder volumes != Number of RBD volumes • Agg memory use per process type (e.g. Nova-api, Radosgw, etc) • Compute nodes have the “neutron-openvswi-sg-chain” openstack volume list --all -f value -c ID >> /tmp/rbdcindervolcompare rbd -p volumes ls | sed "s/volume-//" >> /tmp/rbdcindervolcompare sort /tmp/rbdcindervolcompare | uniq -u
Zabbix Openstack APIs • Multiple checks per API: • Is the process running? • Is the port listening? • Internal checks (from each controller) • External checks (from monitoring server) • Memory usage aggregated per process type • Response time, number and type of API calls
Zabbix OpenStack services memory usage
Zabbix Neutron router traffic
Zabbix Capacity planning • Total/Used vCPU • Total/Used vRAM • # of Instances • Internet traffic
Zabbix alerting • Alerts to Slack, Email, etc • Very configurable • Multiple channels • Emojis! https://github.com/ericoc/zabbix-slack-alertscript
ELK & Filebeat Embracing the chaos of logs • Powerful search • Fast • Meaningful visualizations • Great documentation
ELK & Filebeat Dashboards to suit your needs
ELK & Filebeat Filebeat tags Tagging at the source - type: log paths: - /var/log/glance/*.log tags: ["glance","openstack"] - type: log paths: - /var/log/heat/*.log tags: ["heat","openstack"] exclude_lines: ['DEBUG'] Kibana search by tags
ELK & Filebeat Monitoring the OpenStack dashboard
ELK & Filebeat Alerting Log entry [Mon Apr 23 14:18:34 2018] [pid 1736219] Loginsuccessfulforuser "admin", remote address 64.231.26.191 Logstash output plugin if [source] == "/var/log/apache2/access.log" and [login_status] =~ "successful" and ([user] =~"admin") and !([clientip] =~ "206.108.177.10") { email { from => "alert-server@domain.com" subject => "ALERT! Openstack Admin account was logged in from outside expected IP space" to => "administrators@domain.com" via => "smtp" body => "ALERT! Openstack Admin account was logged in from outside expected IP space: %{message}" port => "587" address => "smtp.domain.com" username => "username" password => "password" authentication => "plain" use_tls => true } }
Monitoring for users https://github.com/CancerCollaboratory/webstatus-update
Lessons learned • If something needs to be running, test it • Be generous with your specs for the monitoring and control plane (more RAM and CPU than you might think it will be needed) • Monitor RAM usage aggregated per process types • If your router should run in HA, verify there is ONLY ONE agent active and ONLY ONE standby • Have a check for the metadata NAT rule inside the router’s namespace • It’s possible to run a stable and performant Openstack cluster with few but qualified resources, as long as you carefully design it and choose the most stable (and absolutely needed) Openstack projects and configurations
Future plans • Add Magnum and Octavia, possibly Trove • Slowly migrate to container based control plane Openstack services, mainly for ease of upgrade • Build a bioinformatics SaaS solution making the infrastructure easier to use for less experienced cancer researchers
Thank you • Discovery Frontiers: Advancing Big Data Science in Genomics Research program (grant no. RGPGR/448167-2013, ‘The Cancer Genome Collaboratory’) • Natural Sciences and Engineering Research Council (NSERC) of Canada • the Canadian Institutes of Health Research (CIHR), Genome Canada • the Canada Foundation for Innovation (CFI) • Ontario Research Fund of the Ministry of Research, Innovation and Science.
Contact Questions? George Mihaiescu george.mihaiescu@oicr.on.ca Jared Baker jared.baker@oicr.on.ca Github repo: https://github.com/CancerCollaboratory/infrastructure.git