1 / 45

Windows Azure Compute

Windows Azure Compute. Brad Calder General Manager Windows Azure. Large-Scale Workload Patterns. “Growing Fast“ Docs.com on Facebook . “On and Off “ RiskMetrics. Inactivity Period . Compute . Compute . Average Usage. Usage. Average. Time . Time .

tauret
Download Presentation

Windows Azure Compute

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Windows Azure Compute Brad Calder General Manager Windows Azure

  2. Large-Scale Workload Patterns “Growing Fast“ Docs.com on Facebook “On and Off “ RiskMetrics Inactivity Period Compute Compute Average Usage Usage Average Time Time • On and off workloads (e.g. batch job) • Over provisioned capacity is wasted • Successful services needs to grow/scale • Keeping up w/growth is big IT challenge • Can be hard to predict growth “Unpredictable Bursting“ Twitter “Predictable Bursting“ Walmart Compute Compute Average Usage Average Usage Time Time • Services with micro seasonality trends • Peaks due to periodic increased demand • Wasted capacity off season • Unexpected/unplanned peak in demand • Sudden spike impacts performance • Hard/costly to over provision for extreme cases

  3. Computing Demand Daily Fluctuation BING SEARCHES – JAPAN VS. UK Query Volume Source: Microsoft

  4. Computing Demand Yearly Variability •target.com•walmart.com •toysrus.com•barnesandnoble.com •turbotax.com•taxcut.com •hrblock.com•taxact.com ~10x normal load (Tax season) ~4x normal load (Holiday shopping) Jan 2009 Jan 2010 Jan 2009 Jan 2010 Source: Alexa Source: Alexa

  5. What is a “Cloud”? • Cloud: on-demand, scalable, compute and storage resources Self Server Provisioning Cloud Provisioning Demand Demand Time Time Overprovisioned Underprovisioned

  6. What is Under the Covers of a Service Business logic Overprovision for peak traffic … Add compute/storage capacity on the fly Live upgrades and OS patches Service “glue” Metering and billing infrastructure Monitoring and alerting infrastructure Reliable/Secure storage and computation Datacenter (Power and Cooling) Respond to hardware failures Buy and provision hardware

  7. What is Windows Azure?An operating system for the cloud: …. …… Service 1 Service 2 Service 3 Service N

  8. Cloud Terminology • Infrastructure as a Service (IaaS): basic compute and storage resources • On-demand servers • Amazon EC2, VMWarevCloud, etc • Platform as a Service (PaaS): cloud application infrastructure • On-demand application-hosting environment • Google AppEngine, Salesforce.com, Windows Azure, etc • Software as a Service (SaaS): cloud applications • On-demand applications • Office 365, GMail, etc

  9. IaaS Developer/Ops 2) Choose image, then create and configure VM(s) for application 1) Choose image, then create VM for DBMS and configure DBMS 4) Install application 3) Provision database, then create tables and add data 5) Configure load balancer 6) Manage VMs and DBMS (e.g., deploying new OS images in VMs) Library VM Images Load Balancer Data Web Server Application DBMS Operating System Operating System Operating System VM VM

  10. PaaS Developer/Ops 1) Provision database, then create tables and add data 2) Deploy application w/ service model 3) Automated Service Management Web Server Load Balancer DBMS Operating System Operating System Operating System VM Data VM Application

  11. Windows Azure • Windows Azure is an OS for the data center • Handles resource management, provisioning, and monitoring • Manages application lifecycle • Allows developers to concentrate on business logic • Provides common building blocks for distributed applications • Reliable queuing • Simple unstructured and structured storage • SQL storage • Application services like access control, caching, and connectivity

  12. Windows Azure Platform Windows Azure Applications Windows Azure Middleware Services AppFabric Service Bus AppFabric Caching AppFabric Access Control Server Windows Azure Data Services SQL Azure Windows Azure Storage Windows Azure CDN Windows Azure Networking Fabric Controller Windows Azure Compute

  13. Fabric Controller • Owns all the hardware in the data center • Uses the inventory to host services • Similar to what a per machine operating system does with applications • Provisions the hardware as necessary • Maintains the health of the hardware • Deploys applications to free resources • Maintains the health of those applications

  14. Windows Azure Fabric Controller VM Fabric Agent VM VM WS08 Hypervisor Software control Highly-available Fabric Controller Hardware control Load-balancers Switches

  15. Scaling with the Fabric Controller Service Model

  16. Scaling • There are two basic scaling models: Scale Out Scale Up Compute Compute Compute Compute Compute Compute

  17. Scaling Lessons • Use few, well-defined scaling units • Define scaling boundaries • Scale out those units as needed

  18. Scale-Out Applications Network Load Balancer Scale Out Stateless Front End Stateless ‘Worker’ Scale Out Already Provided Scalable Storage Azure Queues Partitioned RDBMS (SQL Azure) Shared Filesystem (Azure Blobs) Key/ValueDatastore (Azure Tables)

  19. The Windows Azure Service Model • A Windows Azure application is called a “service” • Definition information • Configuration information • At least one “role” • A role is the scaling boundary withina service • Roles are like DLLs in the service “process” • Collection of code with an entry point that runs in its own virtual machine • Virtual machine is scale unit • Role code runs in a virtual machine • Role scales by instances of a virtual machine size Durable Store LB Front End Middle Tier

  20. Multi-Tier Cloud Application • A cloud application is typically made up of different components • Front end: e.g. load-balanced stateless web servers • Middle worker tier: e.g. order processing, encoding • Backend storage: e.g. Azure Blobs, Azure Tables, SQL Azure • Multiple instances of each for scalability and availability • Requires at least 2 instances of each to achieve the SLA Front-End Windows Azure Storage,SQL Azure Front-End Middle-Tier Load Balancer HTTP/HTTPS Cloud Application

  21. Service Model and Role Contents • Definition: • Role name • Role type • VM size (e.g. small, medium, etc.) • Network endpoints • Configuration: • Number of instances • Number of update and fault domains • Code: • Web/Worker Role: Hosted DLL and other executables • VM Role: VHD Service Model Role: Front-End Definition Type: Web VM Size: Small Endpoints: External-1 Configuration Instances: 2 Update Domains: 3 Fault Domains: 2 Role: Middle-Tier Definition Type: Worker VM Size: Large Endpoints: Internal-1 Configuration Instances: 3 Update Domains: 3 Fault Domains: 2

  22. Service Model Files • Service definition is in ServiceDefinition.csdef • Service configuration is in ServiceConfiguration.cscfg • CSPack program Zips service binaries and definition into Service Package File (service.cscfg)

  23. ServiceDefinition.csdef <?xmlversion="1.0"encoding="utf-8"?> <ServiceDefinitionname="Sample"xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition"upgradeDomainCount=“3"> <WorkerRolename="Middle-Tier"vmsize="Large"> <Endpoints> <InternalEndpointname="Internal-1"protocol="tcp" /> </Endpoints> </WorkerRole> <WebRolename="Front-End"vmsize="Small"> <Sites> <Sitename="Web"> <Bindings> <Bindingname="Endpoint1"endpointName="External-1" /> </Bindings> </Site> </Sites> <Endpoints> <InputEndpointname="External-1"protocol="http"port="80" /> </Endpoints> <Imports> <ImportmoduleName="Diagnostics" /> </Imports> </WebRole> </ServiceDefinition>

  24. ServiceConfiguration.cscfg <?xmlversion="1.0"encoding="utf-8"?> <ServiceConfigurationserviceName="Sample"xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceConfiguration"osFamily=“2"osVersion="*"> <Rolename="Middle-Tier"> <Instancescount="3" /> </Role> <Rolename="Front-End"> <Instancescount="2" /> <ConfigurationSettings> <Settingname="Microsoft.WindowsAzure.Plugins.Diagnostics.ConnectionString"value="DefaultEndpointsProtocol=https;AccountName=myaccount;AccountKey=key"/> </ConfigurationSettings> </Role> </ServiceConfiguration>

  25. Windows Azure Push-button Deployment Allocation across fault and update domains • Step 1: Allocate VMs/nodes, VDIPs/VIPs • Across fault domains • Across update domains • Step 2: Place role images on nodes • Step 3: Start roles in VM instances • Step 4: Configure load-balancers • Step 5: Maintain desired number of role instances • Failed roles automatically restarted • Node failure results in new VMs automatically allocated Load-balancers

  26. FC Automated Management • Windows Azure FC monitors the health of roles • FC detects if a role dies • Restart the role to bring it back to a healthy state • If a failed node can’t be recovered, FC migrates role instances to a new node • A suitable replacement location is found • Existing role instances are notified of the configuration change

  27. Availability andFault/Upgrade Domains

  28. Availability Service Level Agreements (SLA) • Windows Azure Platform SLAs: • Compute External Connectivity: 99.95% (2 or more instances) • Storage Availability: 99.9% • SQL Azure Availability: 99.9%

  29. Maintaining Availability: Assume Failure • Hardware fails • 3-5% of servers experience failures annually • Software fails • Inevitable in any evolving, complex system • Tolerating failure means: • Redundancy where possible • Need to build in retries and backoff • Fast recovery • Big red buttons

  30. Hardware Redundancy Datacenter Routers Aggregation Routers and Load Balancers Agg Agg Agg Agg Agg Agg LB LB LB LB LB LB LB LB LB LB LB LB Top of Rack Switches TOR TOR TOR TOR TOR TOR TOR TOR TOR TOR TOR TOR TOR TOR TOR … … … … … … Racks Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes Nodes PDU PDU PDU PDU PDU PDU PDU PDU PDU PDU PDU PDU PDU PDU PDU Power Distribution Units Top of Rack Switch is a Single Point of Failure

  31. Maintaining Availability: Fault Domains • Avoid single points of failure • Unit of failure based on data center topology is a rack • E.g. top-of-rack switch on a rack of machines • Windows Azure considers fault domains when allocating service roles • At least 2 fault domains per service • Will try and spread roles out across more Front-End-2 Front-End-1 Front-End-2 Front-End-1 Middle Tier-2 Middle Tier-1 Middle Tier-2 Middle Tier-1 Middle Tier-3 Middle Tier-3 Fault Domain 1 Fault Domain 2 Fault Domain 3

  32. Update Domains • Update domains specifies what percentage of your service you will take offline for an upgrade • Specify the # of update domains for your service • Default is 5 and max is 20 • Roles are evenly assigned an update domain • Used to update only one domain at a time • Rolling update Fault domains Upgrade domains allocated across fault domains

  33. Service Deployment and Maintenance

  34. Containing Failure: Datacenter Clusters • Datacenters are divided into “clusters” • Approximately 1000 rack-mounted servers (we call them “nodes”) • Provides a unit of fault isolation • Each cluster is managed by a Fabric Controller (FC) Datacenter network FC FC FC … Cluster 1 Cluster 2 Cluster n

  35. The Fabric Controller (FC) • The “kernel” of the cloud operating system • Manages datacenter hardware • Manages Windows Azure services • Four main responsibilities: • Datacenter resource allocation • Datacenter resource provisioning • Service lifecycle management • Service health management • Inputs: • Description of the hardware and network resources it will control • Service model and binaries for cloud applications

  36. Service Deployment Steps • Process service model files • Determine resource requirements • Create role images • Allocate compute and network resources • Prepare nodes • Place role images on nodes • Create virtual machines • Start virtual machines and roles • Configure networking • Dynamic IP addresses (DIPs) assigned to nodes • Virtual IP addresses (VIPs) + ports allocated and mapped to sets of DIPs • Configure packet filter for VM to VM traffic within service • Program load balancers to allow traffic to external endpoints

  37. Service Resource Allocation • Goal: allocate service components to available resources while satisfying all hard constraints • Size of VM • HW requirements: CPU, Memory, Storage, Network • Upgrade domains • Fault domains • Secondary goal: Satisfy soft constraints • Optimize network proximity: pack different roles into same node

  38. Deploying a Service Role B Middle-Tier Role Count: 3 Update Domains: 3 Size: Large Role A Front-End Role (Front End) Count: 2 Update Domains: 3 Size: Medium www.mycloudapp.net www.mycloudapp.net Load Balancer 10.100.0.36 10.100.0.122 Upgrade domain Fault domain

  39. Inside a Deployed Node Physical Node Guest Partition Guest Partition Guest Partition Guest Partition Role Instance Role Instance Role Instance Role Instance Trust boundary Guest Agent Guest Agent Guest Agent Guest Agent Host Partition Image Repository (OS VHDs, role ZIP files) FC Host Agent Fabric Controller (Primary) Fabric Controller (Replica) Fabric Controller (Replica) …

  40. Detection: Load Balancer Operation • FC programs load balancers (LB) to “probe” guest agent (GA) every 15 seconds • If the guest misses two probes, the LB stops forwarding traffic • The role can report “busy” status to the GA • GA stops responding to probes • LB keeps an idle connection open for 60s • Use keep-alive commands if the connection needs to be open longer

  41. Recovery: Server and Role Health • FC maintains service availability by monitoring the software and hardware health • Based primarily on heartbeats • Automatically “heals” affected roles

  42. Updating Your Service • There are two update types: • In-place: used for large scale services and used to updated services with local state • VIP swap: for ease of testing and fail-back for smaller services • In-place (rolling) update: • Role instances updated one update domain at a time • Two modes: automatic and manual

  43. In-Place Update Front-End-1 Middle Tier-3 Middle Tier-1 Middle Tier-2 Front-End-2 • Purpose: Ensure service stays up while updating • Used by Windows Azure OS updates • System considers update domains when upgrading a service • 1/Update domains = percent of service that will be offline • Default is 5 and max is 20 • The Windows Azure SLA is based on at least two update domains and two role instances of each role Front-End-1 Front-End-2 Middle Tier-1 Middle Tier-2 Middle Tier-3 Update Domain 1 Update Domain 2 Update Domain 3

  44. Windows Azure Compute Summary • Platform as a Service is all about reducing management and operations overhead • The Windows Azure Fabric Controller is the foundation for Windows Azure’s PaaS • Provisions machines • Deploys services • Configures hardware for services • Monitors service and hardware health

More Related