1 / 44

Building Scalable and Reliable Applications with Windows Azure

SVC08 . Building Scalable and Reliable Applications with Windows Azure. Brad Calder Director/Architect Microsoft Corporation. Challenges for Building Scalable Cloud Services. High Availability Application and hardware failures Scalability Scale out to meet peak traffic demands

kuri
Download Presentation

Building Scalable and Reliable Applications with Windows Azure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SVC08 Building Scalable and Reliable Applications with Windows Azure Brad Calder Director/Architect Microsoft Corporation

  2. Challenges for Building Scalable Cloud Services • High Availability • Application and hardware failures • Scalability • Scale out to meet peak traffic demands • Lifecycle management • Upgrading, monitoring and debugging

  3. Agenda • Data Scalability • Scalable Computation and Workflow • Lifecycle Management – Upgrade and Versioning

  4. Data Building Blocks • Volatile Storage • Local storage • Caches (e.g., AppFabric Cache and memcached) • Persistent Storage • Windows Azure Storage • Blobs • Tables • Queues • Drives • SQL Azure • Relational DB

  5. Fundamental Storage Abstractions • Blobs – Provide a simple interface for storing named files along with metadata for the file • Tables– Provide structured storage. A Table is a set of entities, which contain a set of properties • Queues– Provide reliable storage and delivery of messages for an application

  6. Storage Account Performance at Commercial Availability • Capacity • 100 TB • Throughput • Up to a hundreds megabytes per second • Transactions • Up to thousands requests per second • For high-throughput content use Windows Azure CDN for Blobs • 18 locations globally (US, Europe, Asia, Australia and South America)

  7. Scalable Storage: Partitioning and Load Balancing • We group your Blobs, Entities, and Messages into Partitions • Automatic load balance partitions across our servers • Monitor the usage patterns of partitions and servers • Adjust what objects are grouped together as needed to further split the load across servers

  8. Automatic Load Balancing - Assignment VIP Legend - Partition FE FE FE - Server Load Master System Master System Master System Reassign Partitions Offload Partitions BE1 BE2 BE3 BE4 Distributed File System • Time between offload to reload is on the order of seconds • Time to decide to load balance is on order of minutes • Goal is to only reassign a partition only if the system has to

  9. Automatic Load Balancing - Split VIP Legend - Partition FE FE FE - Server Load Master System Master System Master System Split and Offload Assign Partition BE1 BE2 BE3 BE4 Distributed File System

  10. Partitioning of Data Objects • Load balancing is an internal concept to Windows Azure Storage • Partitioning enables scalability • What matters to the application is the partitioning key used for objects • All objects with the same partition key value are always grouped into the same partition • Partition Key used • Blobs – Blob Name • Entities – Application defined Partition Key for Table • Messages – Queue Name

  11. Choosing a Table Partition Key • Granularity of Entity Group Transactions • Make the partition key only as big as you need it for atomic batch transactions • Spread out load across partitions • More partitions – makes it easier to automatically balance load • The two extremes • Store all entities with same Partition Key value • Every entity has a different Partition Key value See Jai Haridas talk, PDC09-SVC09 Table Deep Dive, for more details

  12. Per Object/Partition Performance at Commercial Availability • Throughput • Single Queue and Single Table Partition • Up to 500 transactions per second • Single Blob • Small reads/writes up to 30 MB/s • Large reads/writes up to 60 MB/s • Latency • Typically around 100ms (for small trans) • But can increase to a few seconds during heavy spikes while load balancing kicks in

  13. Improving Latency • Use a cache in your application layer to provide 10 ms latencies • Can be very beneficial for user interactive apps • Have caching layer serve dominate requests (e.g., AppFabric Cache, memcached) • You control the size and customize the cache • Fill cache misses from cloud storage

  14. Agenda • Data Scalability • Scaling Computation and Workflow • Lifecycle Management – Upgrade and Versioning

  15. Compute Service Model – What is Described? VIP • The topology of your service • Types of roles and their binaries • How the roles are connected • Configuration of the service • How many instances of each role type • Application specific configuration settings • How many update domains you need Web Role Worker Role

  16. Best Practices for Scaling Out Compute • Due to application/node failure or roles being upgraded • Use multiple instances of each role type so availability is not affected • Scaling out means deploying more roles as load increases • Each instance of a role type performs the same task and looks identical

  17. Web + Worker Role Service Model VIP Worker Role Web Role Windows Azure Storage (Blob, Table, Queue)

  18. Web + Worker Role Service Model Worker Role Worker Role Web Role VIP Worker Role Web Role Worker Role Worker Role Windows Azure Storage (Blob, Table, Queue)

  19. Basic Workflow Pattern • Break job into work items (optional “Map” step) • Feed the work items to the worker roles • Worker resolves the work item • Aggregate work item results (optional “Reduce” step)

  20. Loosely Coupled Work with Queues • Worker-Queue Model • Load work in a queue • Many workers consume the queue Input Queue (Work Items) Azure Queue Worker Role Web Role Worker Role Web Role Worker Role Web Role Worker Role

  21. Queue Workflow Concepts • Windows Azure Queue Provides • Guarantee delivery (two-step consumption) • Worker Dequeues Message and mark it as Invisible • Worker Deletes Message when finished processing it If Worker role crashes, message becomes visible for another Worker to process • Doesn’t guarantee “only once” delivery • Doesn’t guarantee ordering • Best effort FIFO • Make work items idempotent • Work is repeatable and can be done multiple times Worker Role Web Role Input Queue (Work Items) Worker Role Azure Queue Web Role Worker Role Web Role Worker Role

  22. Basic Workflow Pattern Worker Role Worker Role Web Role Input Queue (Work Items) Input Queue (Work Items) Worker Role Worker Role Web Role Azure Queue Azure Queue Worker Role Worker Role Web Role Worker Role Worker Role Job Manager

  23. Workflow Job Manager • Job Manager • Generating the Load • Divide the job into work items • Distributing the load • Send work items to Workers via a Queue • Monitor progress • Monitor the load distribution • Manage resources • Number of workers, queues, etc • Aggregate results • Take individual work item results and aggregate

  24. Job Manager Workflow Pattern Worker Role Worker Role Worker Roles Input Blob Store Input Queue (Work Items) Job Manager Azure Queue Large Job Output Queues (Item done) Output Blob Store Azure Queue

  25. RiskMetrics Case Study • Focused on financial risk management • Need to run daily financial and market simulations • They use the Job Manager Workflow model • Currently feed the work items to 2,000 Worker roles • Plan to run 10,000+ Worker roles next year • Results are queued back to the Job Manager, aggregated, and sent back to company • They needed higher throughput from a single queue, so they looked at two approaches

  26. Scaling Queue Throughput • Batch Work Items into Blobs • Group together many work items into a Blob • Queue up pointer to blob OR • Use Multiple Queues • Job Manager • Responsible for adding and removing queues • Workers • Determine what queues to use • Random via List Queues or assign queues by Job Manger

  27. Continuation for Long Running Jobs • Want to continue on failover • High level approach • Break into smaller and repeatable steps • Record progress after each step • Query progress after failover • Resume from the failed step Progress Table Intermediate persistent state

  28. Continuation for Long Running Jobs Upon Failover: Read Progress and resume • Want to continue on failover • High level approach • Break into smaller and repeatable steps • Record progress after each step • Query progress after failover • Resume from the failed step Progress Table Intermediate persistent state

  29. Agenda • Data Scalability • Scaling Computation and Workflow • Lifecycle Management – Upgrade and Versioning

  30. In-Place Rolling Upgrades • Upgrade domains • Breaks your roles evenly over a set of upgrade domains • Rolling Upgrade • Walk each upgrade domain one at a time • Upgrade just the roles in the current domain • Benefits • Minimizes availability loss • Only one domain of roles restarted at a time • Allows local state to persist across upgrade • Catches application upgrade issues early • Detect upgrade issues after first few domains SERVICE Web Role – 6 instances Workers – 9 instances UD0 3 5 UD1 3 4 Upgrade Domains

  31. Versioning with Rolling Upgrades • Always assume you will have old and new running side by side in your service • Version everything • Protocols, Schemas, Messages, Data Objects, etc • Two common scenarios • Protocol change between two roles • Table schema change

  32. Protocol Change with Rolling Upgrade • Have 2 roles talking protocol V1 • Want to switch them over to protocol V2 without losing availability when using rolling upgrade • Two step process • Upgrade roles to understand new and old protocols • Once done all nodes know how to speak the old and new version. • All nodes still initiate contact sending old protocol version • But if they receive the new version they will respond with it • Then trigger the use of the new version, either: • Release an upgrade that starts speaking the new version OR • Send out a dynamic configuration change to start using new version

  33. Protocol Change via Rolling Upgrade • Step 1: Upgrade roles to understand both versions, and initiate only old version • Step 2: Trigger the use of the new version Web Role Web Role Web Role Binary Versions: Version 1 Web Role Web Role Web Role Web Role Web Role Web Role Version 1.5 Version 2 Cache Role Cache Role Cache Role Protocol Versions: UD1 UD0 UD2 UD1 UD0 UD2 Cache Role Cache Role Cache Role Cache Role Version 1 Cache Role Cache Role Version 2

  34. Table Schema Change • Have a version property in each entity • Types of Schema Change • Add Non-key Properties • Perform two step upgrade process • Use “IgnoreMissingProperties” • Remove Non-key Properties • Perform two step upgrade process • Use “IgnoreMissingProperties” and “ReplaceOnUpdate” • Change in Partition Key or Row Key • Copy all entities to new primary key

  35. Adding Additional Property Client V1 • Release a new version V1.5 of client • Use the new class with additional properties • Automatically populates the new property with default value on insert/update Client V1 Client V1.5

  36. Schema Change – Upgrade to V1.5 Client Client V1.5 Client V1 • V1.5 Client • Has class with new property in it • If Entity version is V1 • Store the default value in the new property • Do not upgrade the version of the entity • V1 Client • Ignores the new property, since it using “IgnoreMissingProperties” Default Client V1 Client V1.5

  37. Schema Change – Upgrade to V2 Client Client V2 Client V1 Client V1.5 • V2 Client • Update all entities to V2 and start putting real values in new property • V1.5 Client • If Entity version is V1 • Store the default value in the new property, and don’t change version • If Entity version is V2 • Use the new value and update it Default Value2 Default Value1 2 Client V1 Client V1.5 Client V2

  38. Table Schema Rolling Upgrade Summary • Code V1 • Always uses version 1 • Code V1.5 • Creates version 1 • Processes an existing entity based on its current version 1 or 2, and doesn’t convert any entities • Inserts default value for property for version 1 • Code V2 • Converts to version 2 and always version 2

  39. Takeaways • Data Performance • Leverage partitioning • Scaling Computation • Loosely coupled workflow with queues • Upgrade and Versioning • With in-place rolling upgrades, always assume old and new running side by side • Version everything and use the two step process

  40. Call To Action • Sign up for the Windows Azure CTP • Go to https://windows.azure.com • Redeem your CTP token • Visit the Windows Azure developer web site • Go to http://dev.windowsazure.com • Go to the Windows Azure lounge • Try out the Hands on Labs • Meet members of the Windows Azure team

  41. YOUR FEEDBACK IS IMPORTANT TO US! Please fill out session evaluation forms online at MicrosoftPDC.com

  42. Learn More On Channel 9 • Expand your PDC experience through Channel 9 • Explore videos, hands-on labs, sample code and demos through the new Channel 9 training courses channel9.msdn.com/learn Built by Developers for Developers….

More Related