600 likes | 674 Views
Mark Simms (@ mabsimms ) Principal Program Manager Windows Azure Customer Advisory Team. Resilent Cloud Applications. Session Objectives. Designing resilient large-scale services requires careful design and architecture choices
E N D
Mark Simms (@mabsimms) Principal Program Manager Windows Azure Customer Advisory Team Resilent Cloud Applications
Session Objectives Designing resilient large-scale services requires careful design and architecturechoices This session will explore key patterns & practices for highly available cloud services, illustrated with customer examples Interactivity rocks -> please ask questions throughout!
Setting the stage Scalability Availability Insight
Setting the stage Maximize service availability for consumers Ensure customers (and client devices) can access and use the service Minimize impact of failure on consumers Degrade gracefully, isolate faults, fallback to alternate delivery paths Maximize performance and capacity Services that are “live”, but cannot handle desired/required demand are not available
Musings on application design • Traditional web service design (N-tier) • Make “everything stateless”
Musings on application design • Traditional web service design (N-tier) • Make “everything stateless” • Separate logic from data (state) • Leverage specialized external state services • Cache, load balancer, relational database, document database, key/value store, etc
Musings on application design • No service is an island • Dependencies on other internal and external services • Trading time-to-market and agility for control
What’s in a workload? #1: without the relational database the application cannot fulfill any workloads #2: the relational database is an external service, subject to partial availability
Decompose by Workload Applications are compromised of one or more workloads Products like SharePoint and Windows Server are designed with this principle in mind Each with different profiles, requirements and boundaries Management, Availability, Operational, Cost, Health, Security, Capacity, etc. Decomposition allows for workload specific optimization Technology selections, scalability and availability approaches, etc.
What are the “9”s • Study Windows Azure Platform SLAs: • Compute External Connectivity: 99.95% (2 or more instances) • Compute Instance Availability: 99.9% (2 or more instances) • Storage Availability: 99.9% • SQL Azure Availability: 99.9%
The Truth About 9s Contoso API Composite Fabrikam API Duwamish API Composite TailSpin API Northwind API 99.95% SLA 99.99% SLA 99.99% SLA 99.99% SLA 99.99% SLA 99.99% SLA 99.99% SLA SLA = *
Define Your SLAs Sports API 99.99% All the time 100% During Games 0% When No Game 99% All the Time Live Scores + Commentary Team, Player, League Stats
Design for Failure Given enough scale, time and pressure all components or services will fail • Your application will experience 1..N failures How will your application behave? • Gracefully handle failure modes, continue to deliver value • Not so gracefully … • Fault types: • Transient. Temporary service interruptions, self-healing • Enduring. Require intervention.
Failure Scope Regions may become unavailable Connectivity Issues, acts of nature Region Service Entire Services May Fail Service dependencies (internal and external), configuration and code issues Node Individual Nodes May Fail Connectivity Issues (transient failures), hardware failures,
Handling Transient and Enduring Failures • Use fault-handling frameworks that recognize transient errors • Make it part of the background ”noise” • Appropriate retry and backoff policies
Handling Transient and Enduring Failures • At some point, your request is blocking the line • Fail gracefully, and get out of the queue! • Anti-patterns: • Too much trust in downstream services and client proxies • Not bounding non-deterministic calls • Blocking synchronous operations
Circuit Breaker at Netflix Error RateThreshold Criteria A request to a remote service times out On Thread pool and bounded task queue used to interact with a service dependency are at 100% Off Client library used to interact with a service dependency throws an exception
Circuit Breaker at Netflix - Fallbacks Custom fallback Client library can provide an invokable callback method. Can also use locally available data on API server (cookie or cache) to generate a fallback response Fail SilentReturn a null value. Useful if the data is optional Fail FastWhen data is required and there’s no good fallback. Negative UX impact, but keeps API healthy
Deployment Redundancy Within a Datacenter Across Data Centers Across On Premise and Cloud Across Cloud Providers Traffic Management
Failure Points definition: design elements that can cause an outage. Focus on identifying design elements that are subject to external change. For example: • Database connection • Website connection • Configuration file • Registry key Categories of common Failure Points: • ACLs, Database access, External web site/service access, Transactions, Configuration, Capacity, Network
Failure Modes definition: a predictable root cause of the outage that occurs at a Failure Point. Examples of failure modes: • Configuration file is not in correct location • Too much traffic overusing resources • Database reaches maximum capacity The following would not be considered a failure mode: • Product bugs • Symptoms of problems • Informational occurrences
Failure Mode Example • Potential Failure Points: • Database Server • Database • Table • Configuration File public intGetBusinessData(string[] parameters) { try { varconfig = Config.Open(_configPath); var conn = ConnectToDB(config.ConnectString); var data = conn.GetData(_sproc, parameters); return data; } catch (Exception e) { WriteEventLogEvent(100, E_ExceptionInDal); throw; } } • Potential Failure Modes: • DB Server not responding • DB offline • DB access denied • Sproc execute denied • DB doesn’t exist • DB timeout on connect • Index corrupt • Database corrupt • Table doesn’t exist • Table corrupt • Config file missing or invalid
Capturing Insight • Log all internal/external “transactions” (database, web services, etc) • Application context (module/component) • Host context (server/role/instance/process) • Timing information (start/stop/duration) • Activity identifier • Consolidate logs to central system / dashboard for health monitoring and troubleshooting
Capturing Insight Capture timing and context information through helper delegates (background noise) Capture contextual errors (inner exceptions, etc) on error Logging library is asynchronous (fire-and-forget) to avoid blocking
Many Options Windows Azure Diagnostics
Designing for Insight Instrument for production logging If you didn’t capture it, it didn’t happen Implement inter-service monitoring and alerting Capture and quantify inter-service behavior and activity Run-time configurable logging Enable activation (capture or delivery) of additional channels at run-time
Updating Configuration • For a production service configuration == code • Need rigorous ALM process for rolling out (and rolling back) updates to both.
Updating Services “We want global, simultaneous production rollouts of our new code” Are you sure about that? Production rollouts: • Running N, N+1 concurrently • Rolling load over to N+1, ability to fallback
What is a health model? Managed Entity Aspect Operational Condition Logical piece of an application A component that makes sense to an operator Each entity has a health state Entities can be external or internal Multiple instances of an entity may exist Break down health state by functional team Must be mutually exclusive Group by organizational responsibility e.g. security, performance, backup May be specific or non-technology e.g. orders shipped. Defines level of operation currently available Normal state is fully functional Well designed applications may support partial operation e.g. read only
Troubleshooting Workflow Detection Is there a problem? Classification What’s not working, how bad is it? Diagnosis Why is there a problem? Recovery What needs to be done to fix it? Verification Is the problem really gone?
Resources • Failsafe: Guidance for Resilient Cloud Architectures (http://msdn.microsoft.com/en-us/library/jj853352.aspx) • Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services • (http://msdn.microsoft.com/en-us/library/windowsazure/jj717232.aspx) • Designing and Deploying Internet Scale Services • https://www.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdf
Scale Unit of Scale Workloads Messaging Collaboration Productivity Resources* 4 x Web Servers ( 8 CPU) 100 GB Database 10 GB Blob Storage Demands 10K Active Users 1K Concurrent Users <2 second response time (*) Other details such as operational demand, resources and workloads omitted for simplicity
Scale by Units Demand & Resources 400K 100K Time
Example Bottom Ramp Peek Workload 1 Workload 2 J F M A M J J A S O N D
Data Partitioning Decomposition and Partitioning Understanding the 3 Vs Horizontal Partitioning Vertical Partitioning Hybrid Partitioning
Understanding the 3Vs Volume How large is the data today? Velocity How fast is it growing? Variety What type(s) of data are involved?
Understanding Queryability What? What types of queries are done and what data set(s) and transformations are required to deliver them? When? How often must the data be queried? In real time or once a day, month, quarter, or year?