Building Resilient Web Services Benjamin Ravani

Building Resilient Web ServicesBenjamin Ravani General Manager, Global Foundation Services Microsoft April 29, 2008

Web services operations Size and scale Challenges and opportunities Case studies and best practices Closing thoughts Realizing opportunities during design phase Operationally friendly applications Agenda & Objectives

Global Foundation Services Mission Enable and Deliver Winning Services • To Everyone, Everywhere

The Services FoundationAcross the company, all over the world, around the clock

Microsoft Challenges Growth expected to continue to increase over the next 5 years!

Why Power matters … • In 2006, U.S. data centers consumed an estimated 61 billion kilowatt-hours (kWh) of energy, which accounted for about 1.5% of the total electricity consumed in the U.S. that year. The total cost of that energy consumption was $4.5 billion, which is more than the electricity consumed by all color televisions in the country and is equivalent to the electricity consumption of about 5.8 million average U.S. households. • Koomey, Jonathan. 2007. Estimating total power consumption by servers in the U.S. and the world. Oakland, CA: Analytics Press. February 15. http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf • Data centers' power and cooling infrastructure accounts for about half of that electricity consumption; IT equipment accounts for the other half. • If the status quo continues, by 2011, data centers will consume 100 billion kWh of energy, at a total annual cost of $7.4 billion. Those levels of power consumption would also necessitate the construction of 10 additional power plants. • Environmental Sustainability

Data Center Economics have changed ! • Cost of physical space was a primary consideration in data center design • Cost of power and cooling has risen to prominence • Data center managers now must prioritize investment in efficient power and cooling systems to lower the total cost of operating (TCO) of their facilities. Belady, C., “In the Data Center, Power and Cooling Costs More than IT Equipment it Supports”, Electronics Cooling Magazine (Feb 2007)

Environmental Sustainability Last year beans, this year a Datacenter • Protecting our environment • Smart growth in Data Center • Make every KW count! • Invest in innovation for Energy efficiency • Examples • Hydro • Power equipment supply • Compute resource Utilization • Virtualization • Green Grid http://www.microsoft.com/environment/our_commitment/articles/green_grid.aspx

Data Center "PacMan" • Land - 2% • Core & Shell Costs – 9% • Architectural – 7% • Mechanical / Electrical – 82%

Case study I: Capacity planning and Internal Security – Nov 2006 • Problem and impact • Signups experiencing intermittent failures due to slow login • About 500K users experienced delays in creating/updating accounts for several hours • Root cause • Interdependent service’s user batch migration job was run, affecting overall servers' performance • Batch job had bugs, including excessive round trips to look up user data • Solution • Testing all batch jobs in a test environments first • Capacity planning cross-services/cross-groups • Increase internal Security

Case study II: Protection against accidental partner’s error - March 2007 • Problem and impact • ~5 hours of login outage for 75% of users • Login failure for dependent services • Isolate the source of severe unexpected load • Root cause • A n internal service partner bug caused latency in another dependent service resulting re-authentication requests – overloading with login rate • Solution • Application Architecture - Internal partner application fix to reduce dependency • Improved monitoring - Specific to partner dependency to identify issues before they become customer-impacting • Develop throttling - Throttling by partners for more granular site control

Throttling - at all layers of the systemControl incoming requests to prevent total shutdown • Network • Protect against DDOS attacks • Front-end machines • Kernel throttling - for high connection queue • IIS connections - for high connections • Interface queue throttling - for high request queue • CPU throttling - CPU threshold based • TPS throttling - for high TPS per interface • Partner level throttling - for unexpected load increase from a partner • Back-end SQL connections • throttling on number of database connections

URL Reputation Service (URS) Internet Explorer 7 phishing filter

URS Phishing reporting site

URL Reputation Service (URS) overview • Service profile • Grown to Billions of transactions daily • Capacity model: Capable of sustaining a res. time of <0.5 sec • Managed by 3 people • Architecture • Designed with a pod-oriented architecture (POA) • A pod consists of a couple of dozen servers and a couple of load-balancers across multiple VIPs • Pods are distributed in multiple datacenters globally • Pods are globally load balanced by Intelligent traffic control for reliability and performance

URL Reputation Service topology ITM NA NA NA EU NA Asia

Input Model - Known phish business rules ITM NA NA NA Feedback EU Partners NA Asia • Customers feedback loop • Grading filters • Partners input • URS DB on SQL Cluster • URF distribution to all pods Grading URF URS URS

Performance and global load balancing ITM • Optimizing client traffic by geography reduces latency and error rates • Send customers to closest datacenter based on source IP • Response time < 0.5 sec NA NA NA EU NA Pulse Asia

Fault Tolerance - Datacenter failover ITM • Intelligent Traffic Management • based on policy- re-route traffic from unavailable datacenter(DC) to other DCs • No service downtime during a DC failure • Disaster Recovery/Business Continuity is built in NA NA NA EU NA Pulse Asia

Rolling Upgrade (roll forward/backward) • Multiple VIPs per DC • Reassign 1 pod to test VIPs for deploying new bits • Rolling upgrade: Change validation process • Low risk of outage during deployment • Rollback: Easy recovery • Lower cost of test labs ITM NA NA NA EU NA Pulse Asia

Closing thoughts • Security • Security trumps feature • Monitor services • SCOM (MOM) • Transaction monitoring • Operations Framework Automation, Outsourcing, pre-racked deployment • Change Management, Environments, Agility • remote management • PowerShell • Partners testing, Beta, Production • Deployment automation, rollback

Closing thoughts • Capacity planning, load management • Software control not people control • Datacenter-agnostic deployment • Deploy servers where there is capacity • Standard SKUs • Fault Tolerance • Throttle incoming traffic / limit retries • Back-end servers load balancing/failover • Datacenter failover – services failover cross DC

Summary • Online Services = a Software Opportunity • Minimize costs of datacenters • Environmental Sustainability • Resiliency is competitive advantage

Building Resilient Web Services Benjamin Ravani

Building Resilient Web Services Benjamin Ravani

Presentation Transcript

Fault Tolerant and Resilient Web Services

Designing and Delivering Scalable and Resilient Web Services

Building Disaster-Resilient Places

Building Edge-Failure Resilient Networks

Building Web Services with Java

Building Resilient Urban Communities

Building Disaster-Resilient Places

Building Resilient Learners

Building Disaster-Resilient Places

Building Resilient Communities

Building Disaster-Resilient Places

Building Resilient Soldiers and Families

Building the Resilient Community

Building a Resilient WAN

Building Financially Resilient Communities

Building a Resilient Workforce

Building Disaster-Resilient Places

Building CICC Web services

Building Disaster-Resilient Places

Building Resilient Local Economies