390 likes | 564 Views
Building Big: Lessons Learned from Windows Azure Customers. Mark Simms (@ mabsimms ), Christian Martinez Windows Azure Customer Advisory Team Building Big: Lessons Learned from Windows Azure Customers 4-554. Setting the stage.
E N D
Building Big: Lessons Learned from Windows Azure Customers Mark Simms (@mabsimms), Christian Martinez Windows Azure Customer Advisory Team Building Big: Lessons Learned from Windows Azure Customers 4-554
Setting the stage • This is meant to be an interactive discussion – if you don’t ask questions, we will! • This session will be customer stories, patterns & code. • We will get deeply nerdy with .NET and Azure services. • Designing resilient large-scale services requires careful design and architecture choices • In this session we will explore key scenarios extracted from customer engagements, and what happens @ big scale. • Windows Azure Customer Advisory Team (CAT) • Works with internal and external customers to build out some of the largest applications on Azure • Get our hands dirty on all aspects of delivery; design, implementation and all too often firefighting
Connected device(s) service, asynchronous processing • 100k+ connected devices publishing activity reports • Target end to end latency (including cellular link) – 8 seconds • Target throughput 5000 messages / second
Connected device(s) service, asynchronous processing • Batch receiving messages for throughput • Flag completion for individual messages
Serialized processing – increasing latency • Batching receive for chunky communication – needed to meet throughput goals • Processing messages in sequence drives up latency
Something isn’t right • Initial performance very smooth • App quickly spikes to 100% CPU on all cores • Execution time spikes to minutes!
What does windbg say? • Most threads blocked in FindEntry of Dictionary • Using a Dictionary to look up the message handlers
Something still isn’t right • Large variations in avg/max latency • After time, processing rate drops to ~5 msg / second • CPU at ~ 0%
What does perf view have to say? System.Core!System.Dynamics.Utils. TypeExtensions.GetParametersCached http://channel9.msdn.com/Series/PerfView-Tutorial/Tutorial-12-Wall-Clock-Time-Investigation-Basics
Asynchronous & queue based processing • Looks simple enough… • Required messaging exchange patterns for queuing (pub/sub, competing consumer) • Partitioning and load balancing (affinity) for queue resources • Latency vs. throughput – batching • Resources vs. latency – bounding concurrency of task execution • Message dispatch – dynamic vs. fixed function tables • Poison messages, retries • Idempotent processing
Large website, scale-out relational data storage • (Very) Large scale website, backed by 500 Azure SQL databases • Physically collapsed web/app tiers to reduce latency • What can happen during periods of extreme success?
Large website, scale-out relational data storage • Each cloud service has a single public IP (VIP) • Each Azure SQL Database cluster also has a single public IP • 120 web role instances, 500 databases • Connection pool default size = 100 • What’s the limit?
Large website, leveraging external services • (Very) Large scale website, leveraging an external service for content moderation • Protected the external service dependency with a retry policy • On average called in 0.5% of service calls
Unintended consequences • Too much trust in downstream services and client proxies • Not bounding non-deterministic calls • Blocking synchronous operations • No load shedding
Large website, asynchronous document processing • Rich clients (mobile and desktop) publishing documents for processing • Using Shared Access Signature (SAS) tokens for direct writes to storage • Looks like a good design…
Large website, asynchronous document processing • Storage account URI is “hard coded” into the client application • Need to update all 100k+ client applications to change storage account
Exploration – Data Design • Optimize for the most stringent case • Simplicity is king • No one, true solution • Devices and Services workload – connected embedded devices and applications streaming data to the cloud • 100k+ devices, growing 50k / month • Regional affinity (North America only)
Option 1: Relational – Considerations and Challenges • Cannot fulfill with a single database • Exceeds transactional throughput limit • Data growth will exceed practical size limits • Insert heavy workload • Pressure on transaction log • Partitioning keys? • Device ID, User account? • Partitioning approach • Bucket, range, lookup?
Option 1: Relational – Considerations and Challenges • Periodic query spike on bulk reporting • Impact to online operations (30M+ rows) • Rebalancing • Moving data between partitions / databases • Distribution of reference data (relational model) • Keeping in sync • Impact of noisy neighbors (Azure SQL DB) • Variable latency, pushback under heavy load • Cost of management (SQL IaaS) • Cost of automation for patching, maintenance
Tackling the Insert Challenge • Inserting large volumes of streaming data into a data store • Data store is governed on number of operations (transactions) • Trade consistency for throughput – enqueue, batch and publish • Get: increased throughput, shift work to ”cheap” resource (app memory) • Give up: full durability (potential data loss)
Tackling the Insight Challenge • Challenge: know that your site is having issues before Twitter does • This is not a randomly chosen anecdote. • Instrument, collect, analyze - react • Best: buy your way to victory (AppDynamics, New Relic, etc) • Also need to instrument application effectively for ”contextual” data (aka, logging)
Instrumenting Applications • Instrument for production logging • If you didn’t log & capture it, it didn’t happen • Implement inter-service monitoring and alerting • Nothing interesting happens on a single instance • Run-time configurable logging • Enable activation (capture or delivery) of additional channels at run-time • Getting logging right • All logging must be asynchronous • Buffer and filter before pushing to remote service or store
Option 2: Compositional Azure Storage Querying by device By time - direct { PkRk } lookup By day - direct { Pk} max of 2880 records per partition Batch transfer by time frame Parallel download of all blobs matching timeframe pattern Adding scale capacity 20k operations per storage account, This isn’t a relational workload Per-device insert and lookup Periodic batch transfer Per-device lookup Natural fit for table storage Device ID = Pk Data type = Rk Periodic batch transfer Natural fit for blob storage Instance + Timestamp = blob id Buffer and write into blocks Roll over on time interval (10 min)
User centric web application • Services site for mobile device applications • 1M+ users at launch, 1M+ users added per month • Front ended by Android, iOS, Windows Phone • Personalized information feeds and data sets • Examples: browsing history, shopping cart • Assuming up to 30% of user base can be online at any point in time • Maximum response latency 250 ms @ 99th percentile
Tearing apart the architecture • Where are the scalability bottlenecks? • Where are the availability and failure points? • Where are the key insight and instrumentation points?
Recap • Know the numbers – platform scalability targets • Compute, storage, networking and platform services • Scalability == capacity * efficiency • Watch out for shared resources and contention points • At high load and concurrency “interesting” things happen • Default to asynchronous, bound all calls • Insight is power – measuring and observation of behavior • Without rich telemetry and instrumentation – down to the call level – apps are running blind • Buy your way to victory, leverage asynchronous and structured logging
Resources • Failsafe: Building scalable, resilient cloud services • http://channel9.msdn.com/Series/FailSafe • Cloud Service Fundamentals - Reference code for Azure • http://code.msdn.microsoft.com/Cloud-Service-Fundamentals-4ca72649