1 / 46

Operational lessons from running Openstack and Ceph for cancer research at scale

Operational lessons from running Openstack and Ceph for cancer research at scale. George Mihaiescu, Senior Cloud Architect Jared Baker, Cloud Specialist. OICR. Largest cancer research institute in Canada, funded by the government of Ontario

gladys
Download Presentation

Operational lessons from running Openstack and Ceph for cancer research at scale

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Operational lessons from running Openstack and Ceph for cancer research at scale George Mihaiescu, Senior Cloud Architect Jared Baker, Cloud Specialist

  2. OICR • Largest cancer research institute in Canada, funded by the government of Ontario • Together with its collaborators and partners supports more than 1,700 researchers, clinician scientistsresearch staff and trainees • OICR hosts the ICGC's secretariat and its data coordination centre

  3. ICGC - International Cancer Genome Consortium

  4. Cancer Genome Collaboratory Project goals and motivation • Cloud computing environment built for biomedical research by OICR, and funded by government of Canada grants • Enables large scale cancer research on the world’s largest cancer genome dataset currently produced by the International Cancer Genome Consortium (ICGC) • Entirely built using open-source software like Openstack and Ceph • Compute infrastructure goal to provide 3,000 cores and 15 PB storage • A system for cost-recovery

  5. Genomics

  6. Genomics workloads • Users first download large files (150 - 300 GB), then they run workflows that analyze the data for days, or even weeks • Resulting data can be as large as the input data (alignment), or much smaller (mutation calling, 5-10 GB) • It is recommended that the workloads are independent, so one VM failure doesn’t affect multiple analyses • Newly designed workflows and algorithms are packaged as Docker containers for portability

  7. Genomics workloads

  8. Genomics workloads

  9. Genomics workloads

  10. Capacity vs. performance

  11. Wisely pick your battles

  12. No frills design • Use high density commodity servers to reduce physical footprint & related overhead • Use open source software and tools • Prefer copper over fiber for network connectivity • Spend 100% of the hardware budget on the infrastructure that supports cancer research, not on licenses or “nice to have” features

  13. Other design constraints • Limited datacenter space (12 racks) • Fixed hardware budget with high data storage requirements • There are no local backups for the large data sets and re-importing the data, though possible is not desirable (+500 TB takes time to reimport over the Internet)

  14. Hardware architecture Compute nodes

  15. Hardware architectureCeph storage nodes

  16. Control plane • Three controllers in HA configuration (2 x 6 cores CPU, 128 GB RAM, 6 x 200 GB Intel S3700 SSD drives) • Operating system and Ceph Mon on the first RAID 1 container • Mariadb/Galera on the second RAID 1 container • Ceilometer with Mongodb on the third RAID 1 container • Haproxy (SSL termination) and Keepalived • 4 x 10 GbE bonded interfaces, 802.3ad, layer 3+4 hash • Neutron + GRE, HA routers, no DVR

  17. Networking • Brocade ICX 7750-48C top-of-rack switches configured in a stack ring topology • 6 x 40Gb Twinax cables between the racks, providing 240 Gbps non-blocking redundant connectivity (2:1 oversubscription ratio)

  18. Software – entirely open source

  19. Custom object storage client developed at OICR • A client-server application for both uploading and downloading data using temporary pre-signed URLs from multiple object storage systems • Core features • Support for encrypted and authorized transfers • High-throughput: multi-part parallel upload/download • Resumable downloads/uploads • Download-specific features • Support for BAM slicing • Support for Filesystem in Userspace (FUSE) https://github.com/icgc-dcc/dcc-storage https://hub.docker.com/r/icgc/icgc-storage-client/

  20. Cloud usage • 57,000 instances started in the last 2 years • 6,800 in the last three months • 50 users in 16 research labs across three continents • More than 500 TB (1.5 PB) stored in Ceph

  21. In-house developed usage reporting app

  22. Openstack Upgrades Ubuntu 14.04

  23. Ceph Upgrades

  24. Security Updates

  25. ELK Ops dashboard

  26. ELK Ops dashboard

  27. Deployments • Evolving each deployment • Open to improvements • Avoid being tedious

  28. Operations details • On-site spares and technicians • Let Ceph heal itself • Monitor everything • Can you script that?

  29. ARA- Ansible Run Analysis

  30. VLAN based networking

  31. Ceph Monitoring IOPS

  32. Ceph Monitoring Performance & Integrity

  33. Ceph Monitoring Radosgw throughput

  34. Ceph Monitoring Rebalancing - network traffic

  35. Ceph Monitoring Rebalancing - cpu

  36. Ceph Monitoring Rebalancing - memory

  37. Ceph Monitoring Rebalancing - iops

  38. Ceph Monitoring Rebalancing - disk

  39. Rally Smoke tests & Load tests

  40. Rally Grafana integration

  41. Capacity usage

  42. Lessons learned • If something needs to be running, test it • Simple tasks sometimes are not • Be generous with your specs for the monitoring and control plane (more RAM and CPU than you might think it will be needed) • More RAM and CPU on the Ceph storage nodes allow you to have larger nodes and not be affected by small memory leaks • Monitor RAM usage aggregated per process types • It’s possible to run a stable and performant Openstack cluster with few but qualified resources, as long as you carefully design it and choose the most stable (and absolutely needed) Openstack projects and configurations.

  43. Future plans • Upgrade to Ubuntu 16.04 and Openstack Newton • Build a new and larger environment with a similar design, but a leaf-spine networking design • Investigate the stability of a container-based control plane (Kolla)

  44. Thank you • Discovery Frontiers: Advancing Big Data Science in Genomics Research program (grant no. RGPGR/448167-2013, ‘The Cancer Genome Collaboratory’) • Natural Sciences and Engineering Research Council (NSERC) of Canada • the Canadian Institutes of Health Research (CIHR), Genome Canada • the Canada Foundation for Innovation (CFI) • Ontario Research Fund of the Ministry of Research, Innovation and Science.

  45. Contact Questions? George Mihaiescu george.mihaiescu@oicr.on.ca Jared Baker jared.baker@oicr.on.ca www.cancercollaboratory.org

More Related