570 likes | 882 Views
SESSION CODE: UNC305. Microsoft Exchange Server 2010 High Availability Design Considerations. Ross Smith IV Senior Program Manager, Exchange Server Microsoft Corporation. Exchange 2010 High Availability and Site Resilience. Dallas . DB1. DB3. Client. DB5. Mailbox Server 6. San Jose.
E N D
SESSION CODE: UNC305 Microsoft Exchange Server 2010 High Availability Design Considerations Ross Smith IV Senior Program Manager, Exchange Server Microsoft Corporation
Exchange 2010High Availability and Site Resilience Dallas DB1 DB3 Client DB5 Mailbox Server 6 San Jose Client Access Server Array Mailbox Server 1 Mailbox Server 2 Mailbox Server 3 Mailbox Server 4 Mailbox Server 5 DB3 DB2 DB4 DB1 DB5 DB4 DB3 DB5 DB2 DB1 DB4 DB5 DB3 DB1 DB2
Agenda • Discuss different design dimensions: • Infrastructure design • Database Availability Group design • Client experiences • Goal: To ensure you understand how to design DAGs properly!
Infrastructure DesignActive Directory sites • Active Directory site assignment controls the association of CAS to Mailbox and Hub to Mailbox • CAS/HUB service local mailbox servers, “mostly” • Could be for multiple DAGs • DAGs can span subnets without special action • IP address for each MAPI subnet used by DAG • Configured on DAG object • Question : When would an AD site span datacenters? • Answer: When datacenters have LAN quality communication • Follow Active Directory guidance for AD site definition
Infrastructure DesignNetwork subnet recommendations • No single subnet requirement between or within datacenters • Block “cross network” communication to minimize heartbeat traffic • Complete redundancy is preferred but not required • Encryption and compression is controlled at subnet level • Watch server to subnet assignment in same datacenter Allowed Subnet 1 Subnet 3 Subnet 2 Subnet 4 Blocked
Infrastructure DesignCross-datacenter network configuration • For site resilience configurations use DHCP to assign addresses for replication network • Enables delivery of the typically required static routes • If using static IP addresses, use netsh instead of route for configuring static routes • In terms of latency requirements, Exchange 2010 was designed with a target round-trip latency of 250ms or less • Remember, the higher the latency, the more impact to replication • Configure a DNS TTL on “service access connection records” that is consistent with your SLA • E.g. ~5 minutes for a one hour RTO SLA • Direct association between this time and recovery • Remember the records might be in different zones!
Infrastructure DesignSite resilience models • There are three key models you have to take into account when designing site resilient solutions • Datacenter Model • Namespace Model • User Distribution Model • When planning for site resilience, each datacenter needs to be considered active • Enables cross-site single database *over events • Exchange Server 2010 site resilience requires “active” CAS, HUB, and UM in standby datacenter • Subtle difference from Exchange Server 2007 • Services used by databases mounted in standby datacenter after “single database *over”
Infrastructure DesignNamespace planning • Each datacenter should be considered active when planning for namespaces • Each datacenter needs the following namespaces • OWA/OA/EWS/EAS namespace • POP/IMAP namespace • RPC Client Access namespace • SMTP namespace • In addition, one datacenter will maintain the Autodiscover namespace
Infrastructure DesignLeverage split-brain DNS • Best Practice: Use “Split DNS” for Exchange hostnames used by clients • Goal: minimize number of hostnames • mail.contoso.com for Exchange connectivity on intranet and Internet • mail.contoso.com has different IP addresses in intranet/Internet DNS • Important – before moving down this path, be sure to map out all the host names (outside of Exchange) that you will want to create in the internal zone
Infrastructure DesignWhat does the namespace design look like? External DNS Mail.contoso.com Pop.contoso.com Imap.contoso.com Autodiscover.contoso.com Smtp.contoso.com External DNS Mail.region.contoso.com Pop.region.contoso.com Imap.region.contoso.com Smtp.region.contoso.com ExternalURL = mail.contoso.com CAS Array = outlook.contoso.com OA endpoint = mail.contoso.com ExternalURL = mail.region.contoso.com CAS Array = outlook.region.contoso.com OA endpoint = mail.region.contoso.com Datacenter 1 Datacenter 2 Internal DNS Mail.contoso.com Pop.contoso.com Imap.contoso.com Autodiscover.contoso.com Smtp.contoso.com Outlook.contoso.com Internal DNS Mail.region.contoso.com Pop.region.contoso.com Imap.region.contoso.com Smtp.region.contoso.com Outlook.region.contoso.com HT CAS CAS HT AD MBX MBX AD
Infrastructure DesignCertificate planning • Best practice: minimize the number of certificates • 1 certificate for all CAS servers + reverse proxy + Edge/Hub • Use “Subject Alternative Name” (SAN) certificate which can cover multiple hostnames • If leveraging a certificate per datacenter, then ensure that the Certificate Principal Name is the same on all certificates • Outlook Anywhere won’t connect if the Principal Name on the certificate does not match the value configured in msstd: (default matches OA RPC End Point) • Set-OutlookProvider EXPR -CertPrincipalNamemsstd:mail.contoso.com
Infrastructure DesignUser distribution models • The locality of the users will ultimately determine your site resilience architecture • Are users primarily located in one datacenter? • Are users located in multiple datacenters? • Is there a requirement to maintain user population in a particular datacenter? • Active/Passive user distribution model • Database copies deployed in the secondary datacenter, but no active mailboxes are hosted there • Active/Active user distribution model • User population dispersed across both datacenters with each datacenter being the primary datacenter for its specific user population
Infrastructure DesignClient Access Arrays • 1 CAS array per AD site • Multiple DAGs within an AD site can use the same CAS array • FQDN of the CAS array needs to resolve to a load-balanced virtual IP address in DNS • Should only resolve in internal DNS structure • CAS Array does not provide any load balancing -> you need a load balancer! • Set the databases in the AD site to utilize CAS array via Set-MailboxDatabaseRPCClientAccessServer property • By default, new databases will have the RPCClientAccessServer value set on creation • If database was created prior to creating CAS array, then it is set to random CAS FQDN (or local machine if role co-location) • If database is created after creating CAS array, then it is set to the CAS array FQDN
DAG DesignDatabase copies • Each DAG member can host 1 copy of each mailbox database • Maximum number of copies within a 16 member DAG: • 1 copy – 1600 databases • 2 copies – 800 databases • 3 copies – 533 databases • Two types of database copies • HA database copies • Lagged database copies
DAG DesignLagged database copies • Lagged copies are only for point-in-time protection • Logical corruption and/or mailbox deletion prevention scenarios • Provide a maximum of 14 days protection • When should you deploy a lagged copy? • Useful only to mitigate a risk • Not needed if deploying a third-party backup solution (e.g. DPM 2010) • Lagged copies are not HA database copies • Lagged copies should never be activated! • Lagged copies have storage implications
DAG DesignControlling database copy activation • Various scenarios: • Don’t want to activate database copies on servers in standby because… • Want to preclude activation of copies on server X because of hardware issue or lagged copies… • Block activation of database copies on a server during upgrade • Two ways to activation block copies • Set-MailboxServer <Server> -DatabaseCopyAutoActivationPolicy <Blocked,IntrasiteOnly,Unrestricted> • Suspend-MailboxDatabaseCopy <DB\Server> -ActivationOnly
DAG DesignSizing • Question: How many members should be in a DAG? • Answer: It depends (Greg would say 16) • The larger the DAG, better resiliency • Consider the implications of a three copy / six server DAG vs. two DAGs with three servers and three copies of each database • Larger DAGs continue to provide as much service as they can after more failures • The larger the DAG, the better efficiency of the hardware • Distribute active load across all members • For server count, consider a multiple of the number of copies you are deploying • Also need to consider quorum implications, especially in site resilience scenarios
DAG DesignPlanning for quorum Quorum Votes = 2 (No Majority) Quorum Votes = 3 (Majority) Quorum Votes = 5 (Majority)
DAG DesignPlanning for quorum Quorum Votes = 7 (Majority) Quorum Votes = 5 (Majority) Quorum Votes = 4 (Majority)
DAG DesignSizing • Question: How many DAGs should I deploy? • Answer: It depends • Obviously you will need to deploy multiple DAGs if you need more than 16 servers • You may also need multiple DAGs depending on your site resilience architecture • If deploying an Active/Active user distribution architecture, then you should consider deploying 2+ DAGs – allows you to control locality and not perform a site activation in the event of a network failure between datacenters
DAG Design Failure model flexibility • Design for all database copies activated • Design for the worst case - server architecture handles 100 percent of all hosted database copies becoming active • Design for targeted failure scenarios • Design server architecture to handle the active mailbox load during the worst failure case you plan to handle • 1 member failure requires 2 or more HA copies and 2 or more servers • 2 member failure requires 3 or more HA copies and 4 or more servers • Requires Set-MailboxServer <Server> -MaximumActiveDatabases <Number>
DAG Design It’s all in the layout • Consider this scenario • 8 servers, 40 databases with 2 copies
DAG Design It’s all in the layout • If I have a single server failure… • Life is good
DAG Design It’s all in the layout • If I have a double server failure… • Life could be good
DAG Design It’s all in the layout • If I have a double server failure… • Life could be bad
DAG Design It’s all in the layout • Now let’s consider this scenario • 4 servers, 12 databases with 3 copies • With a single server failure: • With a double server failure:
DAG Design It’s all in the layout – Over Subscription • If you plan to over subscribe the servers then: • Don’t plan to be perfect! • Set soft threshold for number of active databases per server • In some circumstances databases will fail to mount because of limit • Put processes in place for redistributing databases per server • After hardware maintenance • After software maintenance • Periodically – because of random failures • SP1 includes a script to provide automated load balancing (RedistributeActiveDatabases.ps1)
DAG Design It’s all in the layout – Over Subscription • If you plan to over subscribe the servers then: • Educate your operations team on implication of over subscription • Periodically validate you are not too over subscribed • Run in your worst case scenario for a period of time • Have a plan on how you handle being too over subscribed • Reminders: • Design storage subsystems to handle all database copy I/O and capacity • Design CPU to handle the max active database copies and the passive copies • Design memory to handle the max active database copies • Design network subsystem to handle the throughput required to sustain the active load, the number of target copies, and CI updates
DAG Design It’s all in the layout • Consider physical hardware situations where practical (JBOD in particular) • If servers in DAG are in multiple racks then spread copies across racks • If servers are in different rooms in datacenter then factor that into distribution • If servers reside on the same network switch/router, then a network failure can take out multiple servers • In summary, minimize possible single points of failures
DAG Design Storage architecture • Deployment on RAID or JBOD will be based on several factors • Cost • Hardware • Number of copies • Types of copies • Single or multi-datacenter
DAG Design Database copy selection concerns • Active Manager determines which copy to activate based on: • Sorts relevant copies based on lowest Copy queue length • Break any ties based activation preference • Selects a copy based on our 10 phase inspection (Db state, CI health state, copyqueuelenth, replayqueuelength) • Replication service determines if the database will mount based on AutoDatabaseMountDial • If we fail to mount, Active Manager selects the next best copy
DAG Design Replication concerns • Replication is always from source to target • Remember if you have multiple copies in a remote datacenter, you will have multiple log streams being shipped across the wire • Exchange 2010 offers compression for log shipping • Controllable setting for the DAG • Default is inter-subnet • MSIT sees 30% compression, but can vary for each customer based on message profile • SP1 adds Continuous Replication Block Mode • Reduces the exposure of data loss on failure by replicating to passive copies all logs writes in parallel to them being locally persisted • Only active when replication is up-to-date in terms of copying complete logs
DAG Design Content Indexing concerns • Content index is required for large mailboxes to enable fast search of data • Content index is maintained on both active and passive, but… • The index for a passive copy is updated by getting changes from active copy’s index • This communication is not compressed • How do I size for replication and content indexing impact? • Use the Exchange 2010 Mailbox Server Role Requirements Calculator
DAG Design Replication Networks • Single network DAG members fully supported • Recommendation: have minimum of two networks on each member server • Initial DAG network configuration is based on the enumeration of cluster networks • Cluster enumerates networks based on subnet • One cluster network is created for each subnet / port • Recommendation: Collapse into single MAPI and Replication DAG networks • MAPI network may be replication disabled • Network will be utilized for replication if no other valid replication path exists • There is no preference order to replication networks – uses the least recently used network
DAG Design Small scale architectures • Small scale / branch office architectures that require high availability • 2-4 servers typically • Requires Windows Server Enterprise Edition • There are many different options: * Requires third machine to host File Share Witness
Client Experiences Typical Outlook behavior • All Outlook versions behave consistently in a single datacenter HA scenario • Profile points to Client Access Server array • Profile is unchanged by failovers or loss of CAS • All Outlook versions should behave consistently in a datacenter activation scenario • Primary datacenter Client Access Server DNS name is bound to IP address of standby datacenter’s Client Access Server • Autodiscover continues to hand out primary datacenter CAS name as Outlook RPC endpoint • Profile remains unchanged
Client Experiences Outlook behavior in a cross-site database failover event • In RTM, the default behavior is to perform a direct connect from the CAS array in the first datacenter to the mailbox hosting the active copy in the second datacenter • You can only get a redirect to occur by changing the RPCClientAccessServer property on the database • In SP1 • You can choose to enable or disable cross-site direct connect • You can also define an activation preference for a database which determines whether to perform a direct connect or a redirect • SP1 behavior is based on three properties • Home server property in Outlook • Preferred database site (i.e. the RPCClientAccessServer property) • Active database site
Client Experiences Outlook behavior in a cross-site database failover event (SP1 Direct Connect) Home Server = CAS-PRI Active Preferred Database Site = PDC (RPCClientAccessServer = CAS-PRI) Preferred Database Site = PDC (RPCClientAccessServer = CAS-PRI) Cross Site Connections = Allowed Passive
Client Experiences Outlook behavior in a cross-site database failover event (SP1 Redirect) Autodiscover detects profile change and updates client (requires restart) Home Server = CAS-PRI Home Server = CAS-SEC Active Preferred Database Site = PDC (RPCClientAccessServer = CAS-PRI) Cross Site Connections = Not Allowed Passive Preferred Database Site = SDC (RPCClientAccessServer= CAS-SEC)
Client Experiences Outlook behavior in a cross-site database failover event (Outlook Versions) Outlook 2003 can’t update if source CAS is unavailable Autodiscover detects profile change and updates client with Home Server = CAS-SEC (requires restart) Outlook 2003 updates Home Server = CAS-SEC due to ecWrongServer (requires restart) Autodiscover detects profile change and updates client with Home Server = CAS-SEC (requires restart) Active Preferred Database Site = PDC (RPCClientAccessServer= CAS-PRI) Cross Site Connections = Not Allowed Passive
Client Experiences Other clients • Other client behavior varies per technology and scenario:
Conclusion • There are many different design dimensions that have to be considered when designing for high availability and site resilience with Exchange 2010 • The choices you will make will determine the number of copies and hardware you deploy • Design choices should be based on customer requirements • Exchange 2010 allows you to take advantage of new options which can lower costs
Related Content • Breakout Sessions • UNC202 – How Microsoft IT Implemented Exchange 2010 – Wed, 1:30pm • UNC301 – Microsoft Exchange Server 2010: Sizing and Performance - Get It Right the First Time – Thurs, 5pm • UNC304 – Microsoft Exchange Server 2010: High Availability Deep Dive – Wed, 9:45am • UNC306 – Going Big! Deploying Large Mailboxes with Exchange 2010 without Breaking the Bank – Thurs, 3:15pm • Interactive Sessions • UNC01-INT – Real-World Database Availability Group (DAG) Design – Tues, 1:30pm • UNC02-INT – Busting Microsoft Exchange Server 2010 Storage Myths! – Tues, 3:15pm • UNC05-INT – Deploying the E2010 CAS Role: Load Balancing & Certificates – Thurs, 1:30pm • Hands-on Labs • UNC02-HOL – Microsoft Exchange Server 2010 High Availability and Storage Scenarios
Unified Communications Track Call to Action! Learn More! • View Related Unified Communications (UNC) Content at TechEd/after at TechEd Online • Visit microsoft.com/communicationsserver for more Communications Server “14” product information • Find additional Communications Server “14” content in the Technical Library, weekly technical articles at NextHop, and follow DrRez on Twitter • Check out Microsoft TechNet resources for Communications Server and Exchange Server • Visit additional Exchange 2010 IT Professional-focused content • Partner LinkorCustomer Link (Name: ExProPword: EHLO!world) Try It Out! • Exchange 2010 SP1 Beta downloadis now available from the download center!
Required Slide Resources Learning • Sessions On-Demand & Community • Microsoft Certification & Training Resources www.microsoft.com/teched www.microsoft.com/learning • Resources for IT Professionals • Resources for Developers http://microsoft.com/technet http://microsoft.com/msdn
Required Slide Complete an evaluation on CommNet and enter to win!