In-depth monitoring for Openstack services

In-depth monitoring for Openstack services George Mihaiescu, Senior Cloud Architect Jared Baker, Cloud Specialist

The infrastructure team George Mihaiescu • Cloud architect for the Cancer Genome Collaboratory • 7 years of Openstack experience • First deployment - Cactus • First conference - Boston 2011 • Openstack speaker at Barcelona, Boston and Vancouver conferences Jared Baker • Cloud specialist for the Cancer Genome Collaboratory • 2 years of Openstack experience • 10 years MSP experience • First deployment - Liberty • First conference (and speaker - Boston 2017

Ontario Institute for Cancer Research (OICR) • Largest cancer research institute in Canada, funded by the government of Ontario • Together with its collaborators and partners supports more than 1,700 researchers, clinician scientists research staff and trainees

Cancer Genome Collaboratory Project goals and motivation • Cloud computing environment built for biomedical research by OICR, and funded by government of Canada grants • Enables large scale cancer research on the world’s largest cancer genome dataset currently produced by the International Cancer Genome Consortium (ICGC) • Entirely built using open-source software like Openstack and Ceph • Compute infrastructure goal to provide 3,000 cores and 15 PB storage • A system for cost-recovery

No frills design • Use high density commodity hardware to reduce physical footprint & related overhead • Use open source software and tools • Prefer copper over fiber for network connectivity • Spend 100% of the hardware budget on the infrastructure that supports cancer research, not on licenses or “nice to have” features

Hardware architecture Compute nodes

Hardware architecture Ceph storage nodes

Openstack controllers • Three controllers in HA configuration (2 x 6 cores CPU, 128 GB RAM, 6 x 200 GB Intel S3700 SSD drives) • Separate partitions for OS, Ceph Mon and MySQL • Haproxy (SSL termination with ECC certs) and Keepalived • 4 x 10 GbE bonded interfaces, 802.3ad, layer 3+4 hash • Neutron + GRE, HA routers, no DVR

Networking • Ruckus ICX 7750-48C top-of-rack switches configured in a stack ring topology • 6 x 40Gb Twinax cables between the racks, providing 240 Gbps non-blocking redundant connectivity (2:1 oversubscription ratio)

Capacity vs. extreme performance

Upgrades

Software – entirely open source

Rally – end-to-endtests Rally test that runs every hour and does end-to-end check • Starts a VM • Assigns floating IP • Connects over SSH • Pings an external host five times Alert if the check fails, takes too long to complete or packet loss is greater than 40% It sends runtime info to Graphite for long term graphing. Grafana alerts us if average runtime is above a threshold.

Rally – RBD volume performance test Another Rally check monitors RBD volume (Ceph based) write performance over time: • it boots an instance from a volume • it assigns a floating IP • it connects over SSH • it runs a script that writes a 10 GB file three times • it captures the average IO throughputat the end • it sends throughput info to Graphite for long term graphing • it alerts if the average runtime is above the threshold

Rally– custom checks

Rally smoke tests & load tests

Zabbix and Grafana

Dockerized monitoring stack We run a number of tools in containers: • Sflowtrend • Prometheus • Graphite • Collectd • Grafana • Ceph_exporter • Elasticsearch • Logstash • Kibana

Ceph Monitoring IOPS

Ceph Monitoring Performance & Integrity

Openstack capacity usage

Sflowtrend

Zabbix • 200+ hosts • 38,000+ items • 15,000+ triggers • Performant • Reliable • Customizable https://github.com/CancerCollaboratory/infrastructure

Zabbix The Zabbix Agent (client) • CPU • Disk I/O • Memory • Filesystem • Security • Services running • HW Raid card • Fans, temperature, power supply status • PDU power usage

Zabbix Custom checks • When security updates are available • When new cloud images are released • Number of IPs banned by fail2ban • Iptables rules across all controllers are in sync • Openvswitch ports tagged with VLAN 4095 (bad) • Number of Cinder volumes != Number of RBD volumes • Agg memory use per process type (e.g. Nova-api, Radosgw, etc) • Compute nodes have the “neutron-openvswi-sg-chain” openstack volume list --all -f value -c ID >> /tmp/rbdcindervolcompare rbd -p volumes ls | sed "s/volume-//" >> /tmp/rbdcindervolcompare sort /tmp/rbdcindervolcompare | uniq -u

Zabbix Openstack APIs • Multiple checks per API: • Is the process running? • Is the port listening? • Internal checks (from each controller) • External checks (from monitoring server) • Memory usage aggregated per process type • Response time, number and type of API calls

Zabbix OpenStack services memory usage

Zabbix Neutron router traffic

Zabbix Capacity planning • Total/Used vCPU • Total/Used vRAM • # of Instances • Internet traffic

Zabbix alerting • Alerts to Slack, Email, etc • Very configurable • Multiple channels • Emojis! https://github.com/ericoc/zabbix-slack-alertscript

ELK & Filebeat Embracing the chaos of logs • Powerful search • Fast • Meaningful visualizations • Great documentation

ELK & Filebeat Dashboards to suit your needs

ELK & Filebeat Filebeat tags Tagging at the source - type: log paths: - /var/log/glance/*.log tags: ["glance","openstack"] - type: log paths: - /var/log/heat/*.log tags: ["heat","openstack"] exclude_lines: ['DEBUG'] Kibana search by tags

ELK & Filebeat Monitoring the OpenStack dashboard

ELK & Filebeat Alerting Log entry [Mon Apr 23 14:18:34 2018] [pid 1736219] Loginsuccessfulforuser "admin", remote address 64.231.26.191 Logstash output plugin if [source] == "/var/log/apache2/access.log" and [login_status] =~ "successful" and ([user] =~"admin") and !([clientip] =~ "206.108.177.10") { email { from => "alert-server@domain.com" subject => "ALERT! Openstack Admin account was logged in from outside expected IP space" to => "administrators@domain.com" via => "smtp" body => "ALERT! Openstack Admin account was logged in from outside expected IP space: %{message}" port => "587" address => "smtp.domain.com" username => "username" password => "password" authentication => "plain" use_tls => true } }

Monitoring for users https://github.com/CancerCollaboratory/webstatus-update

Lessons learned • If something needs to be running, test it • Be generous with your specs for the monitoring and control plane (more RAM and CPU than you might think it will be needed) • Monitor RAM usage aggregated per process types • If your router should run in HA, verify there is ONLY ONE agent active and ONLY ONE standby • Have a check for the metadata NAT rule inside the router’s namespace • It’s possible to run a stable and performant Openstack cluster with few but qualified resources, as long as you carefully design it and choose the most stable (and absolutely needed) Openstack projects and configurations

Future plans • Add Magnum and Octavia, possibly Trove • Slowly migrate to container based control plane Openstack services, mainly for ease of upgrade • Build a bioinformatics SaaS solution making the infrastructure easier to use for less experienced cancer researchers

Thank you • Discovery Frontiers: Advancing Big Data Science in Genomics Research program (grant no. RGPGR/448167-2013, ‘The Cancer Genome Collaboratory’) • Natural Sciences and Engineering Research Council (NSERC) of Canada • the Canadian Institutes of Health Research (CIHR), Genome Canada • the Canada Foundation for Innovation (CFI) • Ontario Research Fund of the Ministry of Research, Innovation and Science.

Contact Questions? George Mihaiescu george.mihaiescu@oicr.on.ca Jared Baker jared.baker@oicr.on.ca Github repo: https://github.com/CancerCollaboratory/infrastructure.git

In-depth monitoring for Openstack services

In-depth monitoring for Openstack services

Presentation Transcript

OpenStack

Study on OpenStack

OpenStack upgrades

OpenStack Mission

OpenStack

Openstack user survey

Does Hypervisor Matter in OpenStack

Ceph Storage in OpenStack

OpenStack

OpenStack Demo

Monitoring Openstack – The Relationship Between Nagios and Ceilometer

Monitoring water services in ghana

OpenStack on SmartOS

Monitoring Services In Fayetteville

openstack certification | openstack training | openstack courses

Intelligent and In-depth Monitoring of your IT Infrastructure

OpenStack Pike

Openstack

Monitoring and Analyzing Your OpenStack Cloud

DDC Monitoring Services In Toronto