410 likes | 655 Views
Microsoft.com Design for Resilience The Infrastructure of www.microsoft.com, Microsoft Update, and the Download Center. Paul Wright Technology Architect Manager Microsoft.com Operations pwright @microsoft.com. Sunjeev Pandey Senior Director Microsoft.com Operations sunjeevp @microsoft.com.
E N D
Microsoft.comDesign for ResilienceThe Infrastructure of www.microsoft.com, Microsoft Update, and the Download Center Paul Wright Technology Architect Manager Microsoft.com Operations pwright@microsoft.com Sunjeev PandeySenior Director Microsoft.com Operations sunjeevp@microsoft.com
Agenda • Microsoft.com Introduction • Size and Scale • Network and System Architecture • How Do We Do It? • Questions
30k users / day 4M UUsers / day 6.5M UUsers / day 17.1M UUsers / day 1995 2001 2003 2006 A Brief History Of Microsoft.com Microsoft launches www.microsoft.com Information & support publishing; hosting Microsoft combines Web platform, ops, and content teams Standardization effort begins, consolidation hosted systems Focus on MSCOM Network Programming and campaign-to-Web integration Single MSCOM group formed Brand, content, site std’s, Privacy, brand compliance Enable an innovative customer experience online & in-product Product Info, Support, Dev / ITPro Experience, Customer Intelligence, Profile Mgmt & Enterprise Downloads
Resiliency vs. Disaster Recovery Disaster Recovery Resiliency Type of Failover Reactive Static Manual Backup/Restore Proactive Dynamic Automatic Data Mirroring Characteristics Pros:* Increased Availability * Improved Performance Cons:* Higher Initial Costs* More Complexity 5
Microsoft.com Corporate Reach • ReachOverview – June 06 • #6 overall site in U.S; 55.7M UU for 36% reach* • #4 site worldwide; reaching 248.5M UU** • Avg 280M UU/month July 05 to Jun 06 • Reach Surpasses All Corporate Sites • Apple ranked #22: 17.8M UU, 11.5% reach • Netscape ranked #67: 9.6M UU, 6.2% reach • Sony ranked #217: 3.9M UU, 2.6% reach • SUN ranked #307: 3.1M UU, 2.0% reach • IBM ranked #485: 2.1M UU, 1.4% reach (US data provided for relative comparison*) *Nielsen/NetRatings June 2006 - (unique users in millions); **Worldwide data from comScore Media Metrix June 2006 – (unique users in millions)
Microsoft.com – Quick Facts Infrastructure and Application Footprint • 6 Internet Data Centers & 3 CDN Partnerships • 120+ Web Sites, 1000’s App's and 2138 Databases • 120+ Gigabit/sec Bandwidth Solutions at High Scale • www.Microsoft.com • 17.1M UUsers/Day & 70M Page Views/Day • 10K Req/Sec, 300K CC Conn’s on 80 Servers • 350 Vroots, 190 IIS Web App’s & 12 App Pools • Microsoft Update • 250M UScans/Day, 18K ASP.NET Req/Sec, 1.1M ConCurrent • 28.2 Billion Downloads for CY 2005 • Egress – MS, Akamai & Savvis (30-100+ Gbit/Sec)
Operations Workbench Plugin Keynote Web Site Availability • Externally Measured by Keynote Systems, Inc. • Benchmark Against Other Large Sites • Driving Cross-Team Maturity - Positive Trend in Availability: • 2003 – 99.70 • 2004 – 99.78 • 2005 – 99.83 • 2006 – 99.87 YTD
Web Site Availability • Total Errors and Daily Availability of www.microsoft.com - ’06 YTD • Constantly monitored and analyzed • Corrective actions taken as needed • Total Errors ’06 YTD grouped per error type • Content errors - #1 hit on availability • Only 1.3% of the total errors due to server issues (Service unavailable; Server Error; Connection Reset)
Cost Availability Performance Provide Predictable Service Resilient Against What? Power / Cooling Security ISP / Telco Infrastructure Virus Data Center Unauthorized Access HW / SW Failure DDoS Attack System/Data Corruption Application
Infrastructure Architecture Technologies GLBS DNS Caching WALB DDoS BGP Broad Peering HSRP, OSFP Spanning Tree Clustering WLB HSRP, OSFP Spanning Tree Clustering WLB
High Availability Architecture- Global Solutions & Networking
High Availability Architecture- Global Solutions & Networking • Global Solutions • Content Caching Partners: Akamai & Savvis • Global Load Balancing via DNS – Web Cluster Level Mgmt • Health Checking and Automatic Fail-over • Security Infrastructure • Cisco Guards – Anomaly Detection & DOS Filtering • Router ACLs Allow HTTP/S Only – Exceptions Require Review • Router Architecture – Cookie Cutter • Redundant Router and Switch Pairs with VLAN Segregation • Simple, Scalable, Manageable, Repeatable • Agility – Quickly Repurpose VLANs as Required
SYN flood Valid Traffic Filtered Enhanced DDos Protection
High Availability Architecture - Web & Database Hosting • Standard Hosting Models • Agility - Quickly Reallocate from System to System • Efficiency - Less Staffing & Equipment Required • Consistent Configurations • Repeatable Infrastructure Architecture
High Availability Architecture - Web & Database Hosting • Server Configurations • Standard Server Hardware – Flexibility • Identical Baseline O/S, IIS, ASP.NET Configurations • Build Scripts for consistent site builds • Application Code & Content Unique per Site • File, Registry, Service, and Local Security Attributes Collected for Configuration Auditing and Reporting
High Availability Architecture - Web & Database Hosting • Network Load Balancing (NLB) Clusters • Main Load Balancing Solution Today • Server Cluster Sizes: 3 – 8 Servers/Cluster • Positives: • Easy Mgmt – Knowledge within Team • Free with Windows SKU’s • Challenges: • Switch Overhead • Connection Affinity • Application Layer Switching
High Availability Architecture - Web & Database Hosting • Hardware Load Balancing • Limited Use for App Layer Load Balancing • Future – Greater Adoption for Non-NLB Features • Positives: • App Layer Load Balancing • Connection Affinity • Challenges: • Added Complexity/Risks • Costs – Hardware & People
High Availability Architecture - Collecting, Monitoring, & Reporting SMTP MOM Tools Services Layer IMQ IIS Log Monitor GAL Cluster Sentinel Core SE Annotations Perf IAdmin Keynote AD Cisco Guard
High Availability Architecture - Remote Server Management • Integrated Lights Out (iLO) from HP • Cold Reboot • Power On/Off • Debugging Over iLO – No More Crash Cart • Imaging for Dog Food OS Builds • RDP Over iLO • Movement to “Lights Out” Datacenter
Global Load Balancing & Caching • Heath Checking and Fail-over • Automated pulling of clusters to watermark • Removal on demand for maintenance • Load Shaping & Distribution • Control load percentages to specific clusters • Region specific traffic distribution • Distributing Patches/Files to 300M+ Clients • Partnership with 3 Providers • Akamai, Savvis, & MSN • Load Distributed via Load Balancing • Functions via DNS Resolution and Custom Logic from CDNs
100% 100% 100% Global Load Balancing & Caching– Intelligent Load Balancing x 26
Global Load Balancing & Caching- Geo Targeting • Load Shaping Based on Client Resolver Location • Direct Traffic to Particular Clusters or Caching Provider as Appropriate • Customer Experience Enhanced due to Improved Local Proximity • Load Shaping Based on Client Location • CDN Provider Proxies Requests – Responds with File Based on Location of Client
SQL Server 2005Peer-To-Peer Replication • Redundancy • Each server hosts a copy of the database • Availability • Individual servers can be patched/upgraded without causing database availability issues • Performance • Application calls are load balanced between nodes of the cluster for improved scale-out • Zero perceived App Downtime • Eliminate single point of failure for R/W Databases • Considerations: • Object names, object schema, and publication names should be identical • Publications must allow schema changes to be replicated • Updates for a given row should be made only at one database until it has synchronized with its peers
Scaling Out – Real World Implementation • Data Center and Geo redundancy • Scalable Units • Content Publishing • WAN Replication • End-to-end monitoring
Key Take Away's Huge Gains due to 64-bit H/W & Windows Platforms Seamless migration provided with WoW64 Enabled www.Microsoft.com to leverage saved infrastructure to enable Data Center Redundancy App Pool Recycles Eliminated – Enjoying the new 4GB VM address space running under WoW64!! Enabled more App Pools driving further Isolation of Code & Content in shared hosting models CPU Utilization Per Platform Comparative Study: x86 vs. x64
Windows 32bit vs. 64bit ComparisonComparative Study Results – Windows Update Download System Perf Scenario Stress generated by live HTTP traffic from Windows Update Downloads 32bit Application Processes bottlenecked by 2GB Virtual Memory limit vs 4GB capabilities on 64bit operating system enabling Max Mbits/Sec Improved compute times on 64bit increased Req/Sec while lowering Concurrent Connections (ie. Improved HTTP Request Processing Times)
Objective: Stress a live production server to identify Max ability to serve HTTP traffic from www.Microsoft.com client requests Windows 64bit Analysis Comparative Study Results: www.Microsoft.com Perf
Resources • http://blogs.technet.com/mscom • http://blogs.msdn.com/mscomts
Appendix 35
R/O NLB SQL Cluster • Redundancy - Each server hosts a copy of the database • SQL1– Read/Write • SQL2 & SQL3 – Read/Only • Availability • Individual servers can be patched/upgraded without causing database availability issues • Performance • Application calls are load balanced between nodes of the cluster for improved scale-out
R/W NLB SQL Cluster Redundancy - Each server hosts a copy of the database SQL1-Read/Write - Consolidator SQL2-Primary Read/Write (active) SQL3-Logshipping Secondary (stand by) Availability Single point of failure Manual failover – takes minutes to complete Performance Application calls to a database are not load balanced between the nodes of the cluster
Mirroring (SQL 2005 SP1) Mirroring Highest Availability Writes Log Shipping for DC Redundancy Reduced failover downtime from 10min avg to <1min (planned) Considerations: It works on a per database basis for DBs in full recovery model Only one database is available for clients at any time Supports two partners and an optional “witness” server for automated failover
TCP Improvements – Client Testing What Exactly Changed? Compound TCP (CTCP) - controls TCP sending window size; interesting when LH is the server Receive Window Auto-Tuning – controls TCP receive window size; interesting when Vista is client Test Scenario Clients: Dual boot client (XPSP2 & Vista 5308) Test: Download (EN W2KSP4 ~135MB) from 4 locations (Tukwila, Bay, Florida & Frankfurt) Results Corporate network environment - direct Internet connectivity (high speed, low packet loss) 5–7% relative speed gain in low latency scenarios (2-20msec RTT) >150% relative speed gain in mid to high latency scenarios (80-180msec RTT) Home network environment (Comcast cable modem) ~40% relative speed gain (16-330msec RTT)
TCP/IP Throughput Improvements Server to server transfer over 20ms RTT Link W2K3 W2K3: 10-12 Mbps Longhorn Longhorn: > 300Mbps Vista client Internet download speeds 160ms RTT > 2x