1 / 46

Lessons Learned: Building Scalable Applications with the Windows Azure Platform

SVC32. Lessons Learned: Building Scalable Applications with the Windows Azure Platform. Simon Davies Windows Azure TSP Microsoft Corporation. Agenda. Objectives of this session Thoughts on scalabilty in the cloud Real World Lessons Learned Thuzi RiskMetrics Summary

herbert
Download Presentation

Lessons Learned: Building Scalable Applications with the Windows Azure Platform

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SVC32 Lessons Learned: Building Scalable Applications with the Windows Azure Platform Simon Davies Windows Azure TSP Microsoft Corporation

  2. Agenda • Objectives of this session • Thoughts on scalabilty in the cloud • Real World Lessons Learned • Thuzi • RiskMetrics • Summary • Questions and Answers

  3. Scalability in the Cloud • Scalability==work\resources • Windows Azure makes adding AND REMOVING resources dynamic • This – along with the business model -changes things • Capacity planning becomes dynamic • Utilisation levels are important • Definition of scale is different depending on application type and workload arrival characteristics

  4. Scaling Facebook Apps in the Azure Cloud partner Jim Zimmerman CTO / Lead Developer Thuzi.com

  5. Who is Thuzi? • We develop customized viral marketing solutions, utilizing a variety of technologies that engage users and measure results. • We ensure maximum scalability through exploiting the latest virtual computing by using Microsoft's Azure Platform and Tools

  6. Facebook Viral Application Needs • Support for thousands of users virtually overnight … our models predicted geometric adoption • The success of one of our clients could not be the failure for others … requirement for distinct computing environments for each Thuzi customer • Our job is to turn social media data into real business information … must have a robust back end for reporting detailed analytics

  7. Facebook Viral Application Needs • Thuzi builds cool social media web apps and we don’t know much about running data centers … besides we didn’t want to purchase extra servers “just in case” • A consistent user experience was mandatory … social media users don’t like to wait

  8. Hosting Options • Our own data center – Is too expensive and with unpredictable growth, hard to plan for • Google – Didn’t have a familiar programming environment • Amazon – Could use Windows VM’s, but did not have as many features as we wanted • Azure - Familiar Microsoft Technologies

  9. Technology

  10. Outback DEMO

  11. The Results

  12. Fan Growth over Time

  13. Lessons Learned • Trace everything! • Errors, Debug Info • You will upgrade later if as you start to ask questions about how your app is behaving • Track Perf Counters • CPU Usage, Req/sec, memory usage • Use Worker roles to move data from Queues to table storage and SQL Azure • SQL is easier to report on • Table storage allows more scalability • Deployment • Upgrade Manually • When moving to production, use the VIP Swap feature

  14. Tracing • config.DiagnosticInfrastructureLogs.ScheduledTransferLogLevelFilter = LogLevel.Error; • config.DiagnosticInfrastructureLogs.ScheduledTransferPeriod = TimeSpan.FromMinutes(5); • config.Logs.ScheduledTransferLogLevelFilter = LogLevel.Error; • config.Logs.ScheduledTransferPeriod = TimeSpan.FromMinutes(5);

  15. Performance Monitoring • varcpuUsage = new PerformanceCounterConfiguration(); • cpuUsage.CounterSpecifier= @"\Processor(_Total)\% Processor Time"; • cpuUsage.SampleRate= TimeSpan.FromSeconds(5); • varpccMemory = new PerformanceCounterConfiguration(); • pccMemory.CounterSpecifier= @"\Memory\Available Mbytes"; • pccMemory.SampleRate= TimeSpan.FromSeconds(5); • varrequestsPerSec = new PerformanceCounterConfiguration(); • requestsPerSec.CounterSpecifier= @"\ASP.NET Applications(__Total__)\Requests/Sec"; • requestsPerSec.SampleRate= TimeSpan.FromSeconds(5); • config.PerformanceCounters.DataSources.Add(cpuUsage); • config.PerformanceCounters.DataSources.Add(pccMemory); • config.PerformanceCounters.DataSources.Add(requestsPerSec); • config.PerformanceCounters.ScheduledTransferPeriod= TimeSpan.FromMinutes(5);

  16. Deployment • Upload new package to staging • Wait for all roles to be ready • Use VIP Swap to upgrade and deploy to production • Rewrites the load balancer to swap staging with production • If anything is wrong, you can swap back

  17. Tools Needed • Needed to be able to manage records in table storage for testing • Needed to be able to download logs from table storage for tracing and perf counters • Azure Storage Explorer ( Codeplex ) - Free • Cloud Storage Studio - Cost • Do linq queries against table storage to get specific info when needed.

  18. In Summary • Azure provides Thuzi a competitive advantage … so please don’t tell the other social media marketing companies and let us enjoy our 15 minute advantage

  19. Building Scalable Applications using Windows Azure: RiskMetricsRiskBurst™ partner Rob Fraser and Phil Jacob RiskMetrics Group www.riskmetrics.com

  20. RiskMetrics RiskBurst™ • RiskMetrics Group • Offers industry-leading products and services in the disciplines of risk management, corporate governance and financial research & analysis • Scaling on-premise computation to the Cloud • Integration of RiskMetrics extensive on-premise capability with Windows Azure • We are running on 2,000 instances on Windows Azure • We have plans to use 10,000+ instances in 2010 • What are RiskMetrics doing with so much computing power? • Calculation of financial risk • Simulate scenarios for the movement of market factors over time & price financial assets in those scenarios • Notoriously complex – can involve Monte Carlo2 for complex asset classes of the kind that the triggered the 'credit crunch‘ • Results in very high computational loads for RiskMetrics • Daily risk analysis load equivalent to calculating risk on 4 trillion US Stocks • Computational loads are characterised by high demand peaks • Strong growth trend in calculation complexity

  21. Peak Load Characteristics

  22. Growth trend in calculation complexity Risk problem complexity has doubled every 6 months Processor power doubles every 2 years Maximum Complexity of Risk Analysis Processing Request Relative Equity Equivalent Units (Log Scale) Moore’s Law

  23. Analytics Architecture: Large-Scale Data Dependent Processing vs. Distributable Work Packets Market and Pricing Data Pricer Pricer Pricer Pricer Pricer Pricer RiskServer RiskServer Load Balancer Scenario Pricing: Work Packets are self-contained RiskServer RiskServer RiskServer Scenario Generation and Aggregation: These Services dependent on high speed access to large scale data stores and caches Velocity Scenario Cache

  24. Work Packet Example:Pricing request for a Mortgage Backed Security Compute Time: 150ms - 30s

  25. Analytics Architecture: Integration of Cloud Resources? Market and Pricing Data Pricer Pricer Pricer Pricer Pricer Pricer RiskServer RiskServer Load Balancer Scenario Pricing: Work Packets are self-contained RiskServer RiskServer RiskServer Pricer Pricer Scenario Generation and Aggregation: These Services dependent on high speed access to large scale data stores and caches Pricer Pricer Velocity Scenario Cache Pricer Pricer

  26. RiskBurst™ Project Timeline

  27. RiskBurst™ An architectural pattern for large scale computational applications

  28. Architectural Pattern • Building large scale computation requires careful design • Problem: Need to avoid the Von Neumann Bottleneck • Keywords: Reason and Instrument • No changes to the application • Run on-premise on HPC Server or in cloud on Azure • Pattern has end-to-end decoupling • Horizontal scaling of decoupled components Computational Resources & Application Workload Generation Messaging & Storage Computational Resources & Application Workload Generation Messaging & Storage Computational Resources & Application Workload Generation Messaging & Storage Computational Resources & Application Workload Generation Messaging & Storage Computational Resources & Application Messaging & Storage Computational Resources & Application Messaging & Storage Messaging & Storage

  29. RiskBurst™ Workflow: Windows Azure & HPC Server Windows Azure Input Queue(s) RiskBurst™ Server Workload Receiver Batching and Sending WCF Request Input Message WCF Request WCF Request Windows Azure Output Queue(s) Worker Output Monitoring WCF Response Output Message WCF Response WCF Response Outstanding Request Timeout Sweeper WCF Error Response Scenario Generator

  30. Windows Azure Storage Component Usage Input Queues (To Do Jobs) Azure Queue Azure Queue Azure Queue RiskBurst Server Azure Queue Azure Queue Worker Role Instance Worker Role Instance Worker Role Instance Worker Role Instance Input Blob Storage Support files in Blob Storage Azure Queue Local storage Azure Queue Azure Queue Azure Queue Data Azure Queue Output Queues (Job done) Output Blob Storage

  31. Mapping to the Azure Environment • Visual Studio 2008 Azure development SDK mimics cloud • Mix code running in dev locally, with cloud resources such as Blob storage or queues • Good for features, does not assist with scale • Existing 32-bit .NET C++/CLI application with 3 third-party libraries • Initial idea - run directly in web-role – but 32-bit(!) • Run within worker role • Preserve WCF interface – no changes whatsoever to analytics app • Only changes to existing code base are: • Retrieve Cash-flow library support files from Blob storage on demand • Some diagnostic information added

  32. Getting to Cloud Resources: Bandwidth & Latency • Problem: Bandwidth to Azure gateway limited by Internet • Solution: pass by reference & blobs • Replace pass-by-value calls with pass-by-reference • Create key for scenario • Large, repeated objects (scenarios) pushed to blob storage • WCF call contains only key • Each of 1000 scenarios, used for all assets • Problem: Communications Latency • Within data centre, 20ms latency on WCF call through HPC SOA platform • Queues and Blob storage are off-device; engineering must respect this! • Work packet : 200ms computation • Solution: batch requests within input queues • But, more simultaneous work requests (threads outstanding on input)

  33. Utilizing Cloud Resources: Generating Load

  34. Utilizing Cloud Resources: Generating Load • Problem: Generating Load for Cloud Resources • Threading architecture • Workload originally generated by synchronous calls in client • Number of outstanding pricing requests = nodes x batch size • Implies large number of threads in wait states in scenario generators • Work request made asynchronous • RiskBurst™ Server Logic • Creates a balanced workload – uses a work item’s average run time • Made calls to RiskBurst™ Server asynchronous • Incoming calls create batch entry synchronously with request • Map created from message id to wait handlers • When batch full, sent on to Azure input queue • Sweeper thread gathers up output messages and uses map to associate with wait handlers • Scales well to over 1000 simultaneous requests per RiskBurst™ Server • Horizontal scale of RiskBurst™ Servers – each creates own input queue

  35. Horizontal Scaling within the Cloud • Problem: Saturation behaviour of queues • Can create situation where queues are saturated, made worse by retry logic • Complexity due to varied processing time • Controller will move busy queues to independent hardware • Use exponential back-off algorithm • Batch work items for each queue read or write (using 10 work packets per queue item) • Amortizing the cost  of IO against CPU time is key • Batch compute sizes need to be big enough both to occupy the CPU for long enough and not cause the swamping of the queues • Also, more items contained in queue item -> fewer queue hits • But, larger batches imply more simultaneous outstanding connections on client side • Variable run-time of assets – from 150ms – 30 seconds • Carry out processing concurrently with queue access • Pushing IO onto background threads is critical (the writes and the deletes are independent background tasks) • On-node caching within worker role to avoid queue reads

  36. Exception Management in Distributed Applications • Keep it simple • Large distributed system implies need to engineer robustness to failure • Distinguish between events that are random and unpredictable and poison-message kind of failures • Do not over-engineer efficient handling of occasional exceptions • Return exceptions to client application • Client can track number of attempts to process a work item • Distinguish poison messages and give up • Parallel handling on HPC Server SOA platform • Complexity from varying message processing times • Time-outs can be caused by several long-running pricings in same job • Re-try time-outs by sending all pricings in batch independently

  37. Diagnostics and Run-time Monitoring • A challenge for large scale applications, even more so for Cloud • Logging and monitoring must be switchable so as to reduce overhead • Variable level of diagnostics and logging • Requirement to filter information through decoupled architecture (on node; centralized in Azure; returned to client) • Key data for architectural pattern • Request and result queue; successful/unsuccessful read, write and delete; time taken for all operations • Empty request queue gets • Count of successful/unsuccessful work packets • % Processor Time performance counter • Cache misses • We utilized custom built solution during TAP • Nodes broadcast over service bus • Clients subscribe to trace messages • New diagnostic & monitoring package provides platform support

  38. Final Comments Integrating on-premise and cloud applications

  39. Production Services across On-Premise and Cloud • Operational Integration • Fully integrate Windows Azure capabilities with RiskMetrics Operational Infrastructure • Provisioning plus diagnostic & monitoring packages • “Outside-In” Services • Control and visibility of the services on the cloud consistent with on-premise services. • Resource View • Nodes • Queues • Blob Stores • Process View • Throughput & Performance • Traceability • Problem identification • Process linkage (intra- & inter-cloud) • Binding SLA Commitments • Operational Support Escalation

  40. RiskBurst™ on Windows Azure • Effective architectural pattern delivers key business benefits • Elastic scaling • Enhanced services • Empowered innovation • High reliability • Improved agility • Windows Azure was an obvious choice of cloud platform • Minimize impedance mismatch between on-premise and off-premise • .NET/WCF/HPC SOA in data center extended to cloud • Configure to run in either environment • Familiar development environment • Massive scalability • View of Azure as extension of OS into Cloud • Undertake work with HPC Server Team in 2010 • Ability to target either Azure-hosting WCF services or HPC Server hosted WCF services in a seamless manner • Synchronization of on-premise Velocity instance with Azure instance

  41. Prototype Development: • Stuart Hartley (University of York, UK) • Simon Davies (TAP programme) • Production Development Team: • Rich Bower (Team Lead) • Kelly Crawford (RiskBurst Server/Client) • Simon Davies (TAP Programme) • Jonathan Blair (Microsoft Consulting) • Supporting Cast: • Alistair Beagley (DPE / Azure) • Patrick Butler Monterde (TAP Programme) • Azure Product Group (Hoi Vo, Brad Calder, • Tom Fahrig, Joe Chau) • Hunter Cadzow & Analytics Development at RiskMetrics • Tom Stockdale (RiskMetrics CTO) Acknowledgements www.riskmetrics.com

  42. More Information • SVC16 Developing Advanced Applications with Windows Azure • SVC09 Windows Azure Tables and Queues Deep Dive • SVC14 Windows Azure Blobs and Drives Deep Dive • SVC08 Patterns for Building Scalable and Reliable Windows Azure Applications • Windows Azure Platform lounge

  43. YOUR FEEDBACK IS IMPORTANT TO US! Please fill out session evaluation forms online at MicrosoftPDC.com

  44. Learn More On Channel 9 • Expand your PDC experience through Channel 9 • Explore videos, hands-on labs, sample code and demos through the new Channel 9 training courses channel9.msdn.com/learn Built by Developers for Developers….

More Related