Operating OpenStack cloud for a high reliable ISP platform

Operating OpenStack cloud for a high reliable ISP platform November 6, 2017 Kiyoshi Muranaka Yukinori Sagara Kazutaka Morita

Introductions Kiyoshi Muranaka NTT DATA POC Planner of CiRCUS/MAPS cloud Did Infra/App team coordination. Now operating this system and starts to plan POC for next releases Kazutaka Morita NTT Labs Research Engineer Joined this project as NTT DATA member during a few years Did Detail Architecture Design, Quality improvement, High Availability Design, Developed pretty fine installer Yukinori Sagara NTT DATA Chief Architect of CiRCUS/MAPS cloud Did Basic/Detail Architecture Design. Applying OpenStack to Mission critical Enterprise Systems

Agenda Our Project CiRCUS/MAPS Cloud Architecture OpenStackDeployment OpenStackOperations Next Steps

Our Project

CiRCUS/MAPS CiRCUS/MAPS is NTT docomo’s “sp-mode/i-mode” ISP service backend Very famous service in Japan, over one third of people are using this infrastructure Achieved 99.99999% service availability (just 3 seconds down in a year) [1] NTT docomo has few OpenStack cloud deployments and this one is different from what was presented in the previous summit [2] [1] http://www.bcm.co.jp/site/2007/02/ntt-docomo/0702-ntt-docomo.pdf [2] https://www.openstack.org/videos/barcelona-2016/expanding-and-deepening-ntt-docomos-private-cloud

Project Background In CiRCUS/MAPS data center, many equipment's (such as servers, switches, storage, software) has EOL / EOS after every 5 years Requirement to migration large number of servers Re-think architecture for increasing volume of network traffic and use resource efficiently Our Project is migrating existing servers from old infrastructure to a new infrastructure Need to respect existing infrastructure policies during migration phase

Our OpenStack Cloud CiRCUS/MAPS serves typical Pets applications (Cattles vs Pets) [1] It categorized typical Mode 1system as Gartner said bimodal model [2] Mode1: emphasizing safety and accuracy Mode2: emphasizing agility and speed (OpenStack matches this use case) Our cloud is NOT a hyper scale cloud, but is demanded very very high reliability. This is our cloud characteristic We have challenged to use OpenStack to build the required infrastructure [1] http://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/ [2] https://www.gartner.com/it-glossary/bimodal/

CiRCUS/MAPS OpenStack Cloud after we launched Commercial service started on it from autumn 2016 No major troubles except small ones which we will introduce to you later Now we are operating over 300 VMs running on over 250 Compute nodes spread across 7 Datacenters in different geographical regions of Japan Last month, Our designed cloud won the Red Hat Innovation Award APAC 2017!! [1] [1] https://www.redhat.com/ja/about/press-releases/rhjapan-2017-red-hat-innovation-award-apac-japan

CiRCUS/MAPS Cloud Architecture

CiRCUS/MAPS Cloud Architecture We will explain the CiRCUS/MAPS cloud, just focus on a few characteristic points Overall Summary Network Restrictions VM allocation issue for high reliable service

1. Overall Summary We are using RHOSP Kilo Our basic designing and POC started in 2015, hence we are using very old version :) OpenStack components Nova, Keystone, Glance, Neutron, Cinder Small start is very importantWe haven’t used Heat, Ceilometer, especially this version’s Ceilometer is difficult to use because of MongoDB All VMs boot using Cinder Block Storage only No use image boot at all We are using FC-multipath. Backend storage is EMC VNX Single tenant use Operators create new VMs on scheduled plan We don’t need to take care unexpected operations we haven’t tested Others If we have some options, select most safety one(e.g. we selected Linux Bridge, not Open vSwitch)

2. Network Restrictions CiRCUS/MAPS existing Network policies/requirements is too strict Need to connect on L2 directly from OpenStack VMs to External Nodes Server allocation Request processing servers(VMs): Must put on the front of L3 switches Management servers(Controller Nodes): Must put on the back of L3 switches Network routing over L3 switches Management NW: are allowed. Service/Tenant NW: are NOT ALLOWED(so we could not use DHCP relay, 169.254.169.254 metadata request routing) VM Compute Node Controller Node Many switches includes L3 (FW, Security appliances, ..) Mobile NW Traffic VM L3 L3 : ・・・ VM Compute Node Controller Node Must communicate on L2 directory, No L3 virtual router VM Tenant NW: cannot connect DHCP & Metadata -> How initialize VM?

Solution: Connect VM on L2 directly using Provider network We have used “Provider network”, and specify Fixed IP when we create VM’s port VM can communicate on L2 directly with external node w/o virtual router Inside of OpenStack Outside of OpenStack Controller/Network Node VM’s traffic go out directly on L2 w/o Virtual Router Virtual Router We don’t use Virtual Router External node (L3 SW/LBs) External node (L3 switches/LBs) Tenant NW Compute Node Linux Bridge add VLAN Tag Virtual Switch Extend Tenant NW to outside of OpenStack VM Management NW

Initialize VMs without DHCP and Metadata In this project, following requirements exists originally Want to use external IPAM, avoid to use DHCP It is for cooperate withexisting Monitoring system Set up VM’s NW I/F config files (and udev file) using scripts We can specify MAC/Fixed IP address with neutron port-create command $ neutron port-create --fixed-ipip_address=ipaddr--mac-address macnetwork We don’t need to think how to communicate DHCP,But still need to think about Metadata how to pass it without Tenant NW

Pass Metadata on Management NW Using ConfigDrive, we can pass metadata information on Management NW no use Tenant NW In VM, cloud-init service gets Metadata from ConfigDrive Tenant NW L3 Switch Controller Node 169.254.169.254 Metadata Request are not routed Compute Node VM Getting Metadata on Management NW ConfigDrive Management NW

ConfigDriveand Live migration Problem In Kilo release, Live(Block) migration has a problem when we are using Cinder block storage with Ephemeral storage at the same time [1][2] Block migration will be executed Most bad case, it occurs data corruption Our all VM are volume boot instance. They use Cinder block storage ConfigDrive is treated as Ephemeral storage But Live migration is very important feature in the point of operation views Compute Node Cinder block storage Compute Node data copy by block migration VM VM ConfigDrive LUN ConfigDrive data access [1] https://etherpad.openstack.org/p/cinder-live-instance-migration-with-volume [2] http://lists.openstack.org/pipermail/openstack-dev/2015-May/064427.html

Solution: Boot VM twice We have cleared this problem with dividingVM boot phase into two parts Boot for initializationBoot VM withConfigDrive, initialize boot volume with metadata information, after that, shutdown and delete the VM with keeping initialized volumes Boot for productionBoot VM withoutConfigDrive, uses already initialized boot volume. No longer needs metadata. (In our system, don’t need to get metadata after VM launch) Doing these steps, we can launch VM without ConfigDrive, and we can do Live migration

3. VM allocation issue for high reliable service Most of our services are Load balancer scalable type application We launch many VMs for a service We need to avoid VMs to be allocated on only few Compute Nodes(If those nodes accidentally break, it could bring service disruption) ServerGroupAntiAffinityFilter is not enough, because it cannot deploy VMs over the number of target Host Aggregate’s compute nodes If Compute Node 1 break, Service-A become unavailable When A Host Aggregate has only 3 compute nodes, 4th VM could not be deployed All Service-A VMs Service B, C, D VM VM VM VM VM A-VM A-VM A-VM B-VM C-VM D-VM Compute Node Compute Node Compute Node Compute Node 1 Compute Node 2 4th VM

Solution: Distributed VM allocation We would like to deploy VMs distributed equally, over the number of Compute Nodes From Mitaka, we can resolve this problem with Soft-affinity(ServerGroupSoftAntiAffinityWeigher) feature, it will resolve this problem[1] Our version is Kilo, so we are using many ServerGroup layers as little ugly work around without any modifications Compute Node Compute Node Compute Node Service-B VM Service-B VM ServerGroup-B-1 Need to know current VM allocation before launching a new VM, and create/specify proper ServerGroup Service-A VM Service-A VM ServerGroup-A-2 Service-A VM Service-A VM Service-A VM ServerGroup-A-1 [1] https://blueprints.launchpad.net/nova/+spec/soft-affinity-for-server-group

OpenStack Deployment

Prior Examination Hold an explanatory meeting with app deployment teams Flow of P2V migration How to operate VMs(create, delete, live migrate, evacuate) Identify new constraints causedby virtualization at an early stage Determine VMs requirements Which flavor? Which network? How many VMs? If required, create new additional flavors, networks Q. Does network interruption during live migration cause any problems? Infra teams App teams A. Over X amount of seconds network interruption may cause some application errors.

Task Planning Organize the tasks step by step To deploy our OpenStack cloud, each team (Infrastructure teams,app development teams, etc.) is assigned multiple tasks Execute the tasks such that there is zero impact to the services Improve the efficiency of OpenStack deployment Identify and separate out tasks that can be executed in chronological order and tasks which can be executed simultaneously Install controller nodes in chronological order (and safely) Install compute nodes simultaneously (and efficiently) • * In reality, • more teams • more detailed processes Steps Teams Add all Compute nodes Into the system Install Controller nodes Install Compute nodes

Executing and Ensuring Safety Migrate our systems very safely Reason: Our systems are mission-critical (No service impact is allowed) Preparation: Hold a strict rehearsal before we deploy any services Execution: Deploy to the live environments by a two-man cell Periods: Risky operations arenot executed during the daytime Confirm the normality of OpenStack function Run startup checks in each OpenStack system After the startup checks are complete, start the VMs on which actual business applications are deployed(If the startup checks fail, VMs meant for business will not be started)

OpenStack Operations

OpenStack Operations - Operators Operators keep monitoring the system continuously 24x7 We take a corrective action immediately when an error messageis detected in the system Enable OpenStack operation from an existing systemforexecuting commands Because it is a very large project, changing the operation methods is a heavy burden Prepare a wrapper script for the existing system (by shell scripts) In order to improve maintainability, making scripts follows theexisting mechanism OpenStack operations are possible just like before! Everything is working OK. … Oh, an error appeared! We will check it immediately. Infra/App teams Operators

OpenStack Operations - Scale Out By deploying extra VMs in advance, it is possible to address scale out issues swiftly e.g. IP assignment, NW routing, FC zoning, etc. Get the benefits when we need to scale out in a short time due to the increase in demand Achieved a significant reduction in response time for scale out! OpenStack Compute OpenStack Controller Create VMs Extra VM VM Extra VM FC zoning(set up in advance) EMC VNX NW Equipment (L2 / L3 switch, Load balancer, etc.) NW setting(set up in advance) Create LUNs LUN LUN

OpenStack Operations - Server Hardware Maintenance If a software update with reboot is required We will live migrate the VMs to a different Compute node We can avoid service degradation during server maintenance If a hardware failure occurs We will evacuate the VMs to a different Compute node Conventionally, in the event of a hardware failure, the service degeneration continuesuntil the parts replacement work is completed We can minimize service degradation caused by a hardware failure

OpenStack Operations - Active Solutions Simulation of power line failure Testing of power line failure may cause the machine to break down Send a SIGKILL to a qemu process instead of VMs After that, we will test the recovery procedure using the nova evacuate command Collecting OpenStack logs Created a script to collect OpenStack logs for troubleshooting Specify target logs and time periods We can collect logs from many nodes quickly $ kill -9 qemu pid

Case Study Network connection handling IP address assignment NIC offload settings Partition layout Monitor metrics Dead SCSI paths Evacuation

Case Study 1 - Network connection handling Case Each VM accepts tens of thousands of network requests per second. We've tuned TCP kernel parameters such as tcp_tw_reuse on "VMs". Pitfall Needs iptables tuning for host machines, though we don't need L3 networks for physical layer. Antidote Increase nf_conntrack_max. Decrease nf_conntrack_timeout. Add NOTRACK rules to iptables. OpenStack Compute VM VM L2 NW (Linux bridge) iptables Iptables drops packets even for L2 NW Lots of connections to VMs Load balancer

Case Study 2 - IP address assignment Case We assign IP addresses to neutron ports. We don't need Neutron L3 features, though. Pitfall Neutron cannot assign multiple IPs to one port with linuxbridge L2 agent. If security group is enabled, we cannot access to unassigned IPs on VMs. Antidote Disable security group feature Create multiple ports instead of using IP aliases OpenStack Compute VM 10.0.1.4 192.168.1.9 10.0.1.5 Multiple IPs are not allowed. eth0 eth1 L2 NW (Linux bridge)

Case Study 3 - NIC offload settings Case We decided NIC offload settings based on our tests. Hopefully, we will not meet any problems with our planned workloads. Pitfall Neutron disables LRO automatically, and it leads to network interruption. Monitoring system detects it as network failure. Antidote Disable LRO by default. OpenStack Compute OpenStack Compute VM VM boot L2 NW (Linux bridge) eth0 eth0 GSO ON TSO ON LRO ON GRO ON GSO ON TSO ON LRO OFF GRO ON Offload setting Offload setting

Case Study 4 - Partition layout Case Designed partition layout based on our capacity planning Logs: /var/log VM images: /var/lib/images Etc... Pitfall Temporary directory for VM image conversion: /var/lib/cinder/conversion Antidote Prepare enough space for /var/lib/cinder/conversion Set image_conversion_dirin cinder.conf to change the directory location OpenStack Controller Local disk Temporary file Create temporary file (disk full) Create volume from temporary file EMC VNX /var/lib/images (Glance image directory) LUN (Cinder volume) Glance image Can’t we convert directly?

Case Study 5 - Monitor metrics Case We designed monitor metrics based on OpenStackmanual. Storage usage The number of LUNs Etc... Pitfall The number of initiators Antidote Know constraints of storage devices from storage specification It'd be nice if the vendor would share knowledge to the OpenStack documents. OpenStack Compute OpenStack Compute OpenStack Compute The number of initiators HBA HBA HBA HBA HBA HBA port port port port EMC VNX port port EMC VNX LUN LUN LUN LUN LUN The number of LUNs Space used

Case Study 6 - Dead SCSI paths Case Create and delete several VMs simultaneously. Pitfall Many dead SCSI paths after we deleted VMs. $ ls /dev/sd* | wc -l 117 Antidote Remove dead SCSI paths with udevrules 1) Create many VMs simultaneously. OpenStack Compute 2) Delete the VMs simultaneously. OpenStack Compute VM VM VM VM sda sdb sdc sdd sdc sdd 3) SCSI device files remain. They are not usable, though. OpenStack Compute sda sdb sdc sdd sdc sdd

Case Study 7 - Evacuation Case Evacuate VMs. Restart the failed compute node and add it to OpenStack again. Pitfall We can see invalid resources of the evacuated VMs on the failed node. Antidote Remove invalid VM resources manually. Clean install the failed compute node in any cases. OpenStack Compute OpenStack Compute VM A VM B • Instance directory • VM A • VM B sda sdb sdc Evacuate VM A and B, and add the failed compute after it was fixed. OpenStack Compute OpenStack Compute VM A VM B • Instance directory • VM A • VM B • Instance directory • VM A • VM B sda sdb sdc sda sdb sdc Invalid VM resources.

Next Steps

OpenStack Upgrade ? Queens Upgrade from Kilo to Queens or above! Across 6 releases or over... Many OpenStack systems and Compute nodes No service impact is allowed The plan is under consideration Deploy some different OpenStack systems? Migrate VMs from the old one to the new one? Installation Tools We are using the Packstack installation utility now We would use Red Hat OpenStack Platform Director next Can we safely deploy?

Life Cycle Red Hat OpenStack Platform Life Cycle Production support - 5 years max (depending on the version) https://access.redhat.com/support/policy/updates/openstack/platform If we add compute nodes in a few years after OpenStack deployment… End-of-life of RHOSP comes first Need to select either to upgrade or to have no support Either way, It’s a difficult choice Under Consideration OpenStack upgrade intervals (N→N+x) When to add compute nodes? Upgrading? No support? RHOSP Release X Phase1 (18 months) Phase2 (18 months) Extended Life Support (24 months) Year 7 Year 5 Year 6 Year 3 Year 4 Year 1 Year 2

Questions?

Operating OpenStack cloud for a high reliable ISP platform

Operating OpenStack cloud for a high reliable ISP platform

Presentation Transcript

Cloud Operating Systems

CENTURYLINK CLOUD PLATFORM

OpenStack Cloud Platform for the Hosting Industry

Cloud- TM: A distributed transactional memory platform for the Cloud

Aneka A Cloud Computing Platform

Accelerating Cloud Innovation with OpenStack®

Aneka A Platform for Enterprise Grid/Cloud Computing

CloudNaaS : A Cloud Networking Platform for Enterprise Applications

How a major ISP built a new anti-abuse platform

Operating a distributed IaaS Cloud

Reliable Cloud Storage

OpenStack High Availability

Google Cloud Platform

Cloud Broker - Essentials for a Cloud Service Brokerage Platform

business operating platform

openstack certification | openstack training | openstack courses

A Cloud Computing Platform-AWS.

Monitoring and Analyzing Your OpenStack Cloud

Operating a distributed IaaS Cloud for BaBar

Google Cloud Platform