620 likes | 1.23k Views
Windows Azure Internals. Mark Russinovich Technical Fellow Windows Azure Session 3-058. Agenda. Windows Azure Datacenter Architecture Deploying Services Inside IaaS VMs Maintaining Service Health The Leap Day Outage and Lessons Learned. Windows Azure Datacenter Architecture.
E N D
Windows Azure Internals • Mark Russinovich • Technical Fellow • Windows Azure • Session 3-058
Agenda • Windows Azure Datacenter Architecture • Deploying Services • Inside IaaS VMs • Maintaining Service Health • The Leap Day Outage and Lessons Learned
Windows Azure Datacenters • Windows Azure currently has 8 regions • At least two per geo-political region • 100,000’s of servers • Building out manymore
The Fabric Controller (FC) • The “kernel” of the cloud operating system • Manages datacenter hardware • Manages Windows Azure services • Four main responsibilities: • Datacenter resource allocation • Datacenter resource provisioning • Service lifecycle management • Service health management • Inputs: • Description of the hardware and network resources it will control • Service model and binaries for cloud applications Datacenter Fabric Controller Service Server Kernel Process Word SQL Server Exchange Online SQL Azure Windows Kernel Fabric Controller Server Datacenter
Datacenter Clusters • Datacenters are divided into “clusters” • Approximately 1000 rack-mounted server (we call them “nodes”) • Provides a unit of fault isolation • Each cluster is managed by a Fabric Controller (FC) • FC is responsible for: • Blade provisioning • Blade management • Service deployment and lifecycle Datacenter network FC FC FC Cluster 1 Cluster 2 … Cluster n
Inside a Cluster TOR TOR TOR TOR TOR • FC is a distributed, stateful application running on nodes (servers) spread across fault domains • Top blades are reserved for FC • One FC instance is the primary and all others keep view of world in sync • Supports rolling upgrade, and services continue to run even if FC fails entirely Spine FC3 FC5 FC4 FC2 FC1 FC3 … … … … … … … … … … Nodes Rack
Datacenter Network Architecture DLA Architecture (Old) Quantum10 Architecture (New) DC Router DC Routers DCR DCR Access Routers BL BL BL BL Aggregation + LB … Spine Spine Spine Spine AGG AGG AGG AGG AGG AGG LB LB LB LB LB LB LB LB LB LB LB LB … TOR TOR TOR TOR 20Racks 20Racks 20Racks 20Racks TOR 40 Nodes 40 Nodes TOR TOR 40 Nodes TOR 40 Nodes 40 Nodes TOR TOR 40 Nodes 40 Nodes TOR TOR 40 Nodes TOR 40 Nodes TOR 40 Nodes TOR 40 Nodes TOR 40 Nodes TOR 40 Nodes TOR 40 Nodes TOR 40 Nodes Digi Digi Digi Digi Digi Digi Digi Digi Digi Digi Digi Digi Digi Digi Digi … … … … … … 30,000 Gbps 120 Gbs APC APC APC APC APC APC APC APC APC APC APC APC APC APC APC
Tip: Load Balancer Overhead i • Going through the load balancer adds about 0.5ms latency • When possible, connect to systems via their DIP (dynamic IP address) • Instances in the same Cloud Service can access each other by DIP • You can use Virtual Network to make the DIPs of different cloud services visible to each other 10.2.3.4 Instance 0 Load Balancer 65.123.44.22 Instance 1 0.5ms 10.2.3.5
Provisioning a Node Fabric Controller Windows Deployment Server Image Repository • Power on node • PXE-boot Maintenance OS • Agent formats disk and downloads Host OS via Windows Deployment Services (WDS) • Host OS boots, runs Sysprep /specialize, reboots • FC connects with the “Host Agent” Maintenance OS Windows Azure OS Role Images Role Images Role Images Role Images Maintenance OS Parent OS PXE Server Node Windows Azure OS FC Host Agent Windows Azure Hypervisor
Service Deploying a Service to the Cloud:The 10,000 foot view System Center App Controller Windows Azure Portal • Package upload to portal • System Center App Controller provides IT Pro upload experience • Powershell provides scripting interface • Windows Azure portal provides developer upload experience • Service package passed to RDFE • RDFE sends service to a Fabric Controller (FC) based on target region and affinity group • FC stores image in repository and deploys service RDFE Service RESTAPIs US-North Central Datacenter Fabric Controller
RDFE • RDFE serves as the front end for all Windows Azure services • Subscription management • Billing • User access • Service management • RDFE is responsible for picking clusters to deploy services and storage accounts • First datacenter region • Then affinity group or cluster load • Normalized VIP and core utilization A(h, g) = C(h, g) /
FC Service Deployment Steps • Process service model files • Determine resource requirements • Create role images • Allocate compute and network resources • Prepare nodes • Place role images on nodes • Create virtual machines • Start virtual machines and roles • Configure networking • Dynamic IP addresses (DIPs) assigned to blades • Virtual IP addresses (VIPs) + ports allocated and mapped to sets of DIPs • Configure packet filter for VM to VM traffic • Programs load balancers to allow traffic
Service Resource Allocation • Goal: allocate service components to available resources while satisfying all hard constraints • HW requirements: CPU, Memory, Storage, Network • Fault domains • Secondary goal: Satisfy soft constraints • Prefer allocations which will simplify servicing the host OS/hypervisor • Optimize network proximity: pack nodes • Service allocation produces the goal state for the resources assigned to the service components • Node and VM configuration (OS, hosting environment) • Images and configuration files to deploy • Processes to start • Assign and configure network resources such as LB and VIPs
Deploying a Service Role B Worker Role Count: 2 Update Domains: 2 Size: Medium Role A Web Role (Front End) Count: 3 Update Domains: 3 Size: Large www.mycloudapp.net www.mycloudapp.net Load Balancer 10.100.0.185 10.100.0.36 10.100.0.122
Deploying a Role Instance • FC pushes role files and configuration information to target node host agent • Host agent creates VHDs • Host agent creates VM, attaches VHDs, and starts VM • Guest agent starts role host, which calls role entry point • Starts health heartbeat to and gets commands from host agent • Load balancer only routes to external endpoint when it responds to simple HTTP GET (LB probe)
Inside a Deployed Node Physical Node Guest Partition Guest Partition Guest Partition Guest Partition Role Instance Role Instance Role Instance Role Instance Trust boundary Guest Agent Guest Agent Guest Agent Guest Agent Host Partition Image Repository (OS VHDs, role ZIP files) FC Host Agent Fabric Controller (Primary) Fabric Controller (Replica) Fabric Controller (Replica) …
PaaS Role Instance VHDs • Differencing VHD for OS image (D:\) • Host agent injects FC guest agent into VHD for Web/Worker roles • Resource VHD for temporary files (C:\) • Role VHD for role files (first available drive letter e.g. E:\, F:\) Role Virtual Machine C:\ Resource Disk Dynamic VHD D:\ Windows Differencing Disk E:\ or F:\ Role Image Differencing Disk Role VHD Windows VHD
Inside a Role VM OS Volume Resource Volume Role Volume Guest Agent Role Host Role Entry Point
Tip: Keep It Small i Core Package • Role files get copied up to four times in a deployment • Instead, put artifacts in blob storage • Break them into small pieces • Pull them on-demand from your roles 1 Portal 2 RDFE 3 FC Data 4 1 2 Auxiliary Files Server
Virtual Machine (IaaS) Operation • No standard cached images for IaaS • OS is faulted in from blob storage during boot • Sysprep /specialize on first boot • Default cache policy: • OS disk: read+write cache • Data disks: no cache VM Disk Blob Virtual Disk Driver Local RAM Cache Node Local On-Disk Cache
IaaS Role Instance VHDs Role Virtual Machine C:\ OS Disk D:\ Resource Disk Dynamic VHD E:\, F:\, etc. Data Disks RAM Cache Blobs Local Disk Cache Blob
Tip: Optimize Disk Performance i • Each IaaS disk type has different performance characteristics by default • OS: local read+write cache optimized for small working set I/O • Temporary disk: local disk spindles that can be shared • Data disk: great at random writes and large working sets • Striped data disk: even better • Unless its small, put your application’s data (e.g. SQL database) on striped data disks
In-Place Update Front-End-1 Middle Tier-3 Middle Tier-1 Middle Tier-2 Front-End-2 • Purpose: Ensure service stays up while updating and Windows Azure OS updates • System considers update domains when upgrading a service • 1/Update domains = percent of service that will be offline • Default is 5 and max is 20, override with upgradeDomainCount service definition property • The Windows Azure SLA is based on at least two update domains and two role instances in each role Front-End-1 Front-End-2 Middle Tier-1 Middle Tier-2 Middle Tier-3 Update Domain 1 Update Domain 2 Update Domain 3
Tip: Config Updates vs Code Updates i • Code updates: • Deploys new role image • Creates new VHD • Shutdown old code and start new code • Config updates: • Notification sent to role via RoleEnvironmentChanging • Graceful role shutdown/restart if no response, including startup tasks • For fast update: • Deploy settings as configuration • Respond to configuration updates
Node and Role Health Maintenance • FC maintains service availability by monitoring the software and hardware health • Based primarily on heartbeats • Automatically “heals” affected roles/VMs
Guest Agent and Role Instance Heartbeats and Timeouts Guest Agent Heartbeat Timeout 10 min Guest Agent Heartbeat 5s 25 min Guest Agent Guest AgentConnect Timeout Role Instance Heartbeat 15s Load Balancer Heartbeat 15s Role Instance “Unresponsive” Timeout 30s Load Balancer Timeout 30s Indefinite 15 min Role InstanceLaunch Role Instance Role InstanceReady (for updates only) Role InstanceStart
Fault Domains and Availability Sets • Avoid single points of physical failures • Unit of failure based on data center topology • E.g. top-of-rack switch on a rack of machines • Windows Azure considers fault domains when allocating service roles • At least 2 fault domains per service • Will try and spread roles out across more • Availability SLA: 99.95% Front-End-2 Front-End-1 Front-End-2 Front-End-1 Middle Tier-2 Middle Tier-1 Middle Tier-2 Middle Tier-1 Middle Tier-3 Middle Tier-3 Fault Domain 2 Fault Domain 1 Fault Domain 3
Moving a Role Instance (Service Healing) • Moving a role instance is similar to a service update • On source node: • Role instances stopped • VMs stopped • Node reprovisioned • On destination node: • Same steps as initial role instance deployment • Warning: Resource VHD is not moved • Including for Persistent VM Role
Service Healing Role B Worker Role Count: 2 Update Domains: 2 Size: Medium Role A – V2 VM Role (Front End) Count: 3 Update Domains: 3 Size: Large www.mycloudapp.net www.mycloudapp.net Load Balancer 10.100.0.185 10.100.0.36 10.100.0.191 10.100.0.122
Tip: Three is Better than Two i • Your availability is reduced when: • You are updating a role instance’s code • An instance is being service healed • The host OS is being serviced • The guest OS is being serviced • To avoid a complete outage when two of these are concurrent: deploy at least three instances Front-End-2 Front-End-1 Front-End-2 Front-End-1 Middle Tier-2 Middle Tier-1 Middle Tier-2 Middle Tier-1 Middle Tier-3 Fault Domain 2 Fault Domain 1 Fault Domain 3
Tying it all Together: Leap Day • Outage on February 29 caused by this line of code:validToYear = currentDate.Year + 1; • The problem and its resolution highlights: • Network Operations and monitoring • DevOps “on call” model • Cluster fault isolation • Lessons we learned
On-Call • All developers take turns at third-tier support for live-site operations
Phase 1: Starting a Healthy VM Application VM Host OS Guest Agent Host Agent Public Key Private Key Hypervisor Create a “transport cert”
Phase 1: The Leap Day Bug App VM App VM Host OS Application VM Guest Agent Host Agent Guest Agent Guest Agent Hypervisor After 3 attempts… Existing healthy VMs continue to run (until migrated) All new VMs fail to start (Service Management) After 25 minutes…
Phase 1: Cascading Impact… or “normal” hardware failure Cascade protection threshold hit (60 nodes) All healing and infra deployment stop! Deploying an infra update or customer VM Normal “Service healing” migrates VMs Leap day starts… VMs cause nodes to fail The cascade is viral…
Phase 1: Tenant Availability Customer 1: Complete Availability Loss Customer 2: Partial Capacity Loss Customer 3: No Availability Loss
Overview of Phase 1 • Service Management started failing immediately in all regions • New VM creation, infrastructure deployments, and standard hardware recovery created a viral cascade • Service healing threshold tripped, with customers in different states of availability and capacity • Service Management deliberately de-activated everywhere
Recovery • Build and deploy a hotfix to the GA and the HA • Clusters were in two different states: • Fully (or mostly) updated clusters (119 GA, 119 HA, 119 OS…) • Mostly non-updated clusters (118 GA, 118 HA, 118 OS…) • For updated clusters, we pushed the fix on the new version. • For non-updated clusters, we reverted back and pushed the fix on the old version
Fixing the updated clusters… 119 HA v2 119 HA v2 119 HA v2 Fixed 119 Package 119 GA v2 119 GA v2 119 GA v2 119 Networking Plugin 119 Networking Plugin 119 Networking Plugin 119 OS 119 OS 119 OS VM VM VM 119 GA v1 119 GA v1 119 GA v1 119 HA v1 119 HA v1 119 HA v1 119 Networking Plugin 119 Networking Plugin 119 Networking Plugin 119 OS 119 OS 119 OS Network Network
Attempted fix for partially updated clusters…Phase 2 begins 118 HA v2 118 HA v2 118 HA v2 Fixed 118 Package 119 Networking Plugin 119 Networking Plugin 119 Networking Plugin 118 OS 118 OS 118 OS VM VM VM 118 GA v1 118 GA v1 118 GA v1 118 HA v1 118 HA v1 119 HA v1 118 Networking Plugin 118 Networking Plugin 119 Networking Plugin 118 OS 118 OS 119 OS Network Network
Overview of Phase 2 • Most clusters were repaired completely in Phase 1 • 7 clusters were moved into an inconsistent state (119 Plugin/Config with 118 Agent) • Machines moved into a completely disconnected state
Recovery of Phase 2Step 1 119 HA v2 119 HA v2 119 HA v2 Fixed 119 Package on 118 Cluster 119 Networking Plugin 119 Networking Plugin 119 Networking Plugin 119 OS 119 OS 119 OS VM – 1 VM – 2 VM – 3 VM – 1 VM – 2 VM – 3 VM – 4 VM – 4 VM – 5 118 GA v1 118 GA v1 118 GA v1 118 GA v1 118 GA v1 118 GA v1 118 GA v1 118 GA v1 118 GA v1 118 HA v2 118 HA v2 118 HA v2 119 Networking Plugin 119 Networking Plugin 119 Networking Plugin 118 OS 118 OS 118 OS Network Network