1 / 30

Building Resilient Web Services Benjamin Ravani

Building Resilient Web Services Benjamin Ravani. General Manager, Global Foundation Services Microsoft April 29, 2008. Web services operations Size and scale Challenges and opportunities Case studies and best practices Closing thoughts Realizing opportunities during design phase

Download Presentation

Building Resilient Web Services Benjamin Ravani

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Resilient Web ServicesBenjamin Ravani General Manager, Global Foundation Services Microsoft April 29, 2008

  2. Web services operations Size and scale Challenges and opportunities Case studies and best practices Closing thoughts Realizing opportunities during design phase Operationally friendly applications Agenda & Objectives

  3. Global Foundation Services Mission Enable and Deliver Winning Services • To Everyone, Everywhere

  4. The Services FoundationAcross the company, all over the world, around the clock

  5. Microsoft Challenges Growth expected to continue to increase over the next 5 years!

  6. Why Power matters … • In 2006, U.S. data centers consumed an estimated 61 billion kilowatt-hours (kWh) of energy, which accounted for about 1.5% of the total electricity consumed in the U.S. that year. The total cost of that energy consumption was $4.5 billion, which is more than the electricity consumed by all color televisions in the country and is equivalent to the electricity consumption of about 5.8 million average U.S. households. • Koomey, Jonathan. 2007. Estimating total power consumption by servers in the U.S. and the world. Oakland, CA: Analytics Press. February 15.  http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf • Data centers' power and cooling infrastructure accounts for about half of that electricity consumption; IT equipment accounts for the other half. • If the status quo continues, by 2011, data centers will consume 100 billion kWh of energy, at a total annual cost of $7.4 billion. Those levels of power consumption would also necessitate the construction of 10 additional power plants. • Environmental Sustainability

  7. Data Center Economics have changed ! • Cost of physical space was a primary consideration in data center design • Cost of power and cooling has risen to prominence • Data center managers now must prioritize investment in efficient power and cooling systems to lower the total cost of operating (TCO) of their facilities. Belady, C., “In the Data Center, Power and Cooling Costs More than IT Equipment it Supports”, Electronics Cooling Magazine (Feb 2007)

  8. Environmental Sustainability Last year beans, this year a Datacenter • Protecting our environment • Smart growth in Data Center • Make every KW count! • Invest in innovation for Energy efficiency • Examples • Hydro • Power equipment supply • Compute resource Utilization • Virtualization • Green Grid http://www.microsoft.com/environment/our_commitment/articles/green_grid.aspx

  9. Data Center "PacMan" • Land - 2% • Core & Shell Costs – 9% • Architectural – 7% • Mechanical / Electrical – 82%

  10. Case study I: Capacity planning and Internal Security – Nov 2006 • Problem and impact • Signups experiencing intermittent failures due to slow login • About 500K users experienced delays in creating/updating accounts for several hours • Root cause • Interdependent service’s user batch migration job was run, affecting overall servers' performance • Batch job had bugs, including excessive round trips to look up user data • Solution • Testing all batch jobs in a test environments first • Capacity planning cross-services/cross-groups • Increase internal Security

  11. Case study II: Protection against accidental partner’s error - March 2007 • Problem and impact • ~5 hours of login outage for 75% of users • Login failure for dependent services • Isolate the source of severe unexpected load • Root cause • A n internal service partner bug caused latency in another dependent service resulting re-authentication requests – overloading with login rate • Solution • Application Architecture - Internal partner application fix to reduce dependency • Improved monitoring - Specific to partner dependency to identify issues before they become customer-impacting • Develop throttling - Throttling by partners for more granular site control

  12. Throttling - at all layers of the systemControl incoming requests to prevent total shutdown • Network • Protect against DDOS attacks • Front-end machines • Kernel throttling - for high connection queue • IIS connections - for high connections • Interface queue throttling - for high request queue • CPU throttling - CPU threshold based • TPS throttling - for high TPS per interface • Partner level throttling - for unexpected load increase from a partner • Back-end SQL connections • throttling on number of database connections

  13. URL Reputation Service (URS) Internet Explorer 7 phishing filter

  14. URS Phishing reporting site

  15. URL Reputation Service (URS) overview • Service profile • Grown to Billions of transactions daily • Capacity model: Capable of sustaining a res. time of <0.5 sec • Managed by 3 people • Architecture • Designed with a pod-oriented architecture (POA) • A pod consists of a couple of dozen servers and a couple of load-balancers across multiple VIPs • Pods are distributed in multiple datacenters globally • Pods are globally load balanced by Intelligent traffic control for reliability and performance

  16. URL Reputation Service topology ITM NA NA NA EU NA Asia

  17. Input Model - Known phish business rules ITM NA NA NA Feedback EU Partners NA Asia • Customers feedback loop • Grading filters • Partners input • URS DB on SQL Cluster • URF distribution to all pods Grading URF URS URS

  18. Performance and global load balancing ITM • Optimizing client traffic by geography reduces latency and error rates • Send customers to closest datacenter based on source IP • Response time < 0.5 sec NA NA NA EU NA Pulse Asia

  19. Fault Tolerance - Datacenter failover ITM • Intelligent Traffic Management • based on policy- re-route traffic from unavailable datacenter(DC) to other DCs • No service downtime during a DC failure • Disaster Recovery/Business Continuity is built in NA NA NA EU NA Pulse Asia

  20. Rolling Upgrade (roll forward/backward) • Multiple VIPs per DC • Reassign 1 pod to test VIPs for deploying new bits • Rolling upgrade: Change validation process • Low risk of outage during deployment • Rollback: Easy recovery • Lower cost of test labs ITM NA NA NA EU NA Pulse Asia

  21. Closing thoughts • Security • Security trumps feature • Monitor services • SCOM (MOM) • Transaction monitoring • Operations Framework Automation, Outsourcing, pre-racked deployment • Change Management, Environments, Agility • remote management • PowerShell • Partners testing, Beta, Production • Deployment automation, rollback

  22. Closing thoughts • Capacity planning, load management • Software control not people control • Datacenter-agnostic deployment • Deploy servers where there is capacity • Standard SKUs • Fault Tolerance • Throttle incoming traffic / limit retries • Back-end servers load balancing/failover • Datacenter failover – services failover cross DC

  23. Summary • Online Services = a Software Opportunity • Minimize costs of datacenters • Environmental Sustainability • Resiliency is competitive advantage

More Related