1 / 39

Building Big: Lessons Learned from Windows Azure Customers

Building Big: Lessons Learned from Windows Azure Customers. Mark Simms (@ mabsimms ), Christian Martinez Windows Azure Customer Advisory Team Building Big: Lessons Learned from Windows Azure Customers 4-554. Setting the stage.

jagger
Download Presentation

Building Big: Lessons Learned from Windows Azure Customers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Big: Lessons Learned from Windows Azure Customers Mark Simms (@mabsimms), Christian Martinez Windows Azure Customer Advisory Team Building Big: Lessons Learned from Windows Azure Customers 4-554

  2. Setting the stage • This is meant to be an interactive discussion – if you don’t ask questions, we will! • This session will be customer stories, patterns & code. • We will get deeply nerdy with .NET and Azure services. • Designing resilient large-scale services requires careful design and architecture choices • In this session we will explore key scenarios extracted from customer engagements, and what happens @ big scale. • Windows Azure Customer Advisory Team (CAT) • Works with internal and external customers to build out some of the largest applications on Azure • Get our hands dirty on all aspects of delivery; design, implementation and all too often firefighting

  3. Story time with Christian

  4. A large web site, processing asynchronous work

  5. Connected device(s) service, asynchronous processing • 100k+ connected devices publishing activity reports • Target end to end latency (including cellular link) – 8 seconds • Target throughput 5000 messages / second

  6. Connected device(s) service, asynchronous processing • Batch receiving messages for throughput • Flag completion for individual messages

  7. Serialized processing – increasing latency • Batching receive for chunky communication – needed to meet throughput goals • Processing messages in sequence drives up latency

  8. Switch to parallel processing

  9. Something isn’t right • Initial performance very smooth • App quickly spikes to 100% CPU on all cores • Execution time spikes to minutes!

  10. What does windbg say? • Most threads blocked in FindEntry of Dictionary • Using a Dictionary to look up the message handlers

  11. Something still isn’t right • Large variations in avg/max latency • After time, processing rate drops to ~5 msg / second • CPU at ~ 0%

  12. What does perf view have to say? System.Core!System.Dynamics.Utils. TypeExtensions.GetParametersCached http://channel9.msdn.com/Series/PerfView-Tutorial/Tutorial-12-Wall-Clock-Time-Investigation-Basics

  13. Asynchronous & queue based processing • Looks simple enough… • Required messaging exchange patterns for queuing (pub/sub, competing consumer) • Partitioning and load balancing (affinity) for queue resources • Latency vs. throughput – batching • Resources vs. latency – bounding concurrency of task execution • Message dispatch – dynamic vs. fixed function tables • Poison messages, retries • Idempotent processing

  14. Large website, scale-out relational data storage • (Very) Large scale website, backed by 500 Azure SQL databases • Physically collapsed web/app tiers to reduce latency • What can happen during periods of extreme success?

  15. Large website, scale-out relational data storage • Each cloud service has a single public IP (VIP) • Each Azure SQL Database cluster also has a single public IP • 120 web role instances, 500 databases • Connection pool default size = 100 • What’s the limit?

  16. Large website, leveraging external services • (Very) Large scale website, leveraging an external service for content moderation • Protected the external service dependency with a retry policy • On average called in 0.5% of service calls

  17. Unintended consequences • Too much trust in downstream services and client proxies • Not bounding non-deterministic calls • Blocking synchronous operations • No load shedding

  18. Large website, asynchronous document processing • Rich clients (mobile and desktop) publishing documents for processing • Using Shared Access Signature (SAS) tokens for direct writes to storage • Looks like a good design…

  19. Large website, asynchronous document processing • Storage account URI is “hard coded” into the client application • Need to update all 100k+ client applications to change storage account

  20. Design Choices & Challenges

  21. Exploration – Data Design • Optimize for the most stringent case • Simplicity is king • No one, true solution • Devices and Services workload – connected embedded devices and applications streaming data to the cloud • 100k+ devices, growing 50k / month • Regional affinity (North America only)

  22. Option 1: Relational – Considerations and Challenges • Cannot fulfill with a single database • Exceeds transactional throughput limit • Data growth will exceed practical size limits • Insert heavy workload • Pressure on transaction log • Partitioning keys? • Device ID, User account? • Partitioning approach • Bucket, range, lookup?

  23. Option 1: Relational – Considerations and Challenges • Periodic query spike on bulk reporting • Impact to online operations (30M+ rows) • Rebalancing • Moving data between partitions / databases • Distribution of reference data (relational model) • Keeping in sync • Impact of noisy neighbors (Azure SQL DB) • Variable latency, pushback under heavy load • Cost of management (SQL IaaS) • Cost of automation for patching, maintenance

  24. Tackling the Insert Challenge • Inserting large volumes of streaming data into a data store • Data store is governed on number of operations (transactions) • Trade consistency for throughput – enqueue, batch and publish • Get: increased throughput, shift work to ”cheap” resource (app memory) • Give up: full durability (potential data loss)

  25. Tackling the Insight Challenge • Challenge: know that your site is having issues before Twitter does • This is not a randomly chosen anecdote. • Instrument, collect, analyze - react • Best: buy your way to victory (AppDynamics, New Relic, etc) • Also need to instrument application effectively for ”contextual” data (aka, logging)

  26. Instrumenting Applications • Instrument for production logging • If you didn’t log & capture it, it didn’t happen • Implement inter-service monitoring and alerting • Nothing interesting happens on a single instance • Run-time configurable logging • Enable activation (capture or delivery) of additional channels at run-time • Getting logging right • All logging must be asynchronous • Buffer and filter before pushing to remote service or store

  27. Bringing down a production system with logging…

  28. Demo: Instrumenting Applications with Event Source

  29. Option 2: Compositional Azure Storage Querying by device By time - direct { PkRk } lookup By day - direct { Pk} max of 2880 records per partition Batch transfer by time frame Parallel download of all blobs matching timeframe pattern Adding scale capacity 20k operations per storage account, This isn’t a relational workload Per-device insert and lookup Periodic batch transfer Per-device lookup Natural fit for table storage Device ID = Pk Data type = Rk Periodic batch transfer Natural fit for blob storage Instance + Timestamp = blob id Buffer and write into blocks Roll over on time interval (10 min)

  30. Azure Storage Account - Blob

  31. Azure Storage Account - Table

  32. Azure Storage Account - Queues

  33. User centric web application • Services site for mobile device applications • 1M+ users at launch, 1M+ users added per month • Front ended by Android, iOS, Windows Phone • Personalized information feeds and data sets • Examples: browsing history, shopping cart • Assuming up to 30% of user base can be online at any point in time • Maximum response latency 250 ms @ 99th percentile

  34. Tearing apart the architecture • Where are the scalability bottlenecks? • Where are the availability and failure points? • Where are the key insight and instrumentation points?

  35. Demo: Implementing an information publishing site

  36. Recap • Know the numbers – platform scalability targets • Compute, storage, networking and platform services • Scalability == capacity * efficiency • Watch out for shared resources and contention points • At high load and concurrency “interesting” things happen • Default to asynchronous, bound all calls • Insight is power – measuring and observation of behavior • Without rich telemetry and instrumentation – down to the call level – apps are running blind • Buy your way to victory, leverage asynchronous and structured logging

  37. Resources • Failsafe: Building scalable, resilient cloud services • http://channel9.msdn.com/Series/FailSafe • Cloud Service Fundamentals - Reference code for Azure • http://code.msdn.microsoft.com/Cloud-Service-Fundamentals-4ca72649

More Related