310 likes | 618 Views
Ideal audience for this workshopMessaging SMENetwork SMESecurity SME. Exchange 2010 High Availability. Exchange 2010 High Availability. During this session focus on the following :How will we leverage this functionality in our organization?What availability and service level requirements do we
E N D
1. Exchange Deployment Planning Services Key Message
The goal of this presentation is to provide the audience with a basic understanding of Exchange 2010 High Availability, updated with SP1Key Message
The goal of this presentation is to provide the audience with a basic understanding of Exchange 2010 High Availability, updated with SP1
2. Exchange 2010 High Availability Slide Objective: To show the ideal audience for this module.
Instructor Notes:
This is the recommended audience for the module. Do not be overly concerned if the group does not match the ideal audience. During your time onsite you can have conversations with different resources to get questions answered.
Slide Objective: To show the ideal audience for this module.
Instructor Notes:
This is the recommended audience for the module. Do not be overly concerned if the group does not match the ideal audience. During your time onsite you can have conversations with different resources to get questions answered.
3. Exchange 2010 High Availability Slide Objective: To show the audience what to focus on during the module.
Instructor Notes:
Exchange 2010 high availability encompass many subjects. Ask the audience to think about how their organization manages each aspect of the availability story and how Exchange 2010 could assist in their requirements. Slide Objective: To show the audience what to focus on during the module.
Instructor Notes:
Exchange 2010 high availability encompass many subjects. Ask the audience to think about how their organization manages each aspect of the availability story and how Exchange 2010 could assist in their requirements.
4. Agenda Slide Objective: To explain the overall goals of the Exchange 2010 High Availability
Instructor Notes:
Slide Objective: To explain the overall goals of the Exchange 2010 High Availability
Instructor Notes:
5. Exchange Server 2007 Single Copy Clustering SCC out-of-box provides little high availability value
On Store failure, SCC restarts store on the same machine; no CMS failover
SCC does not automatically recover from storage failures
SCC does not protect your data, your most valuable asset
SCC does not protect against site failures
SCC redundant network is not leveraged by CMS
Conclusion
SCC only provides protection from server hardware failures and bluescreens, the relatively easy components to recover
Supports rolling upgrades without losing redundancy
6. Exchange Server 2007 Continuous Replication Slide Objective: Discuss Continuous Replication.
Instructor Notes:
Basic steps:
Create a target database by seeding the destination with a copy of the source database (this can be accomplished via several ways – log record, streaming copy of database to passive, or offline copy).
Monitor for new logs in the source log directory for copying by subscribing to Windows file system notification events.
Copy any new log files to the destination inspection log directory using SMB.
Inspect the copied log files.
Upon successful inspection, move the copied log file to the storage group copy’s log path and replay the copied log into the copy of the database.
Slide Objective: Discuss Continuous Replication.
Instructor Notes:
Basic steps:
Create a target database by seeding the destination with a copy of the source database (this can be accomplished via several ways – log record, streaming copy of database to passive, or offline copy).
Monitor for new logs in the source log directory for copying by subscribing to Windows file system notification events.
Copy any new log files to the destination inspection log directory using SMB.
Inspect the copied log files.
Upon successful inspection, move the copied log file to the storage group copy’s log path and replay the copied log into the copy of the database.
7. Exchange Server 2007 HA Solution (CCR + SCR) Slide Objective: Discuss the Exchange Server 2007 HA/SR solution.
Instructor Notes:
Key things to call out
CMS cannot co-exist with other roles
Clustered Exchange requires both Exchange and Windows Failover Clustering knowledge.
Failovers occur at the server level.
Site resilience is outside of clustering and therefore, requires manual activation.
Slide Objective: Discuss the Exchange Server 2007 HA/SR solution.
Instructor Notes:
Key things to call out
CMS cannot co-exist with other roles
Clustered Exchange requires both Exchange and Windows Failover Clustering knowledge.
Failovers occur at the server level.
Site resilience is outside of clustering and therefore, requires manual activation.
8. Exchange 2010 High Availability Goals Reduce complexity
Reduce cost
Native solution - no single point of failure
Improve recovery times
Support larger mailboxes
Support large scale deployments Slide Objective:
Instructor Notes:
Slide Objective:
Instructor Notes:
9. Slide Objective: Discuss Exchange 2010 high availability architecture.
Instructor Notes:
NOTE: “You can't replicate outside the DAG.” (key difference from SCR)
Here is Harv’s current Exchange environment.
There are 5 servers <click> in the main datacenter that host mailboxes. These mailbox servers are grouped to provide automatic failover. Each mailbox database has 3 instances, which we’ll refer to as copies, <click> placed on separate servers to provide redundancy. At any given time, only one of the three database copies is active <click> and accessible to clients.
The Front-End Server <click> manages all communications between clients and databases. Outlook clients no longer connect directly to mailbox servers, as they did in previous versions of Exchange.
When a client such as Outlook connects <click> to Exchange, it first contacts the Front-End Server.
The Front-End Server determines <click> where the user’s active database is located, and forwards the request <click> to the appropriate server.
When the client sends an e-mail <click>, the active database is updated. Then, through log shipping <click>, the other 2 passive copies of the database are updated.
Let’s say that a disk fails <click>, affecting one of the databases on Mailbox Server 2. In previous versions of Exchange, the administrator would need to failover all the databases on Mailbox Server 2 to recover from this failure, or else restore the Database 2 from a tape backup. However, Exchange’s new architecture supports database-level failover, so Database 2 automatically fails over to Mailbox Server 1 <click> without affecting the other databases.
The Outlook client, having lost its connection to the database, automatically contacts <click> the Front-End Server to reconnect.
The Front-End Server determines which mailbox server has the active copy of the users’ database. It connects <click> the client to Mailbox Server 1.
When new mail is sent <click>, the active database on Mailbox Server 1 is updated. The second copy of the database <click> is also updated through log shipping. The end user is unaware that anything has happened, and Harv can replace the failed disk drive at his leisure.
The administrator can set up any number of copies per database to meet the Service Level Agreements for his users. For a special category of users, Harv keeps a 4th database copy on a mail server in a geographically remote location <click>. This server is located in a different Active Directory site, but is kept up-to-date over the Wide Area Network using the same replication technology as the other servers. If a hurricane, earthquake, or other catastrophe should shut down the main datacenter, this remote server can be manually activated and readied for client access in about 15 minutes.
Note: If a focus group participant asks “How is this different from Microsoft Clustering Services?,” here’s the answer: “The product team is taking technology from Microsoft Clustering Services and integrate it natively into Exchange Server (no separate management tools required). Would you find value in that? Why or why not?”
Slide Objective: Discuss Exchange 2010 high availability architecture.
Instructor Notes:
NOTE: “You can't replicate outside the DAG.” (key difference from SCR)
Here is Harv’s current Exchange environment.
There are 5 servers <click> in the main datacenter that host mailboxes. These mailbox servers are grouped to provide automatic failover. Each mailbox database has 3 instances, which we’ll refer to as copies, <click> placed on separate servers to provide redundancy. At any given time, only one of the three database copies is active <click> and accessible to clients.
The Front-End Server <click> manages all communications between clients and databases. Outlook clients no longer connect directly to mailbox servers, as they did in previous versions of Exchange.
When a client such as Outlook connects <click> to Exchange, it first contacts the Front-End Server.
The Front-End Server determines <click> where the user’s active database is located, and forwards the request <click> to the appropriate server.
When the client sends an e-mail <click>, the active database is updated. Then, through log shipping <click>, the other 2 passive copies of the database are updated.
Let’s say that a disk fails <click>, affecting one of the databases on Mailbox Server 2. In previous versions of Exchange, the administrator would need to failover all the databases on Mailbox Server 2 to recover from this failure, or else restore the Database 2 from a tape backup. However, Exchange’s new architecture supports database-level failover, so Database 2 automatically fails over to Mailbox Server 1 <click> without affecting the other databases.
The Outlook client, having lost its connection to the database, automatically contacts <click> the Front-End Server to reconnect.
The Front-End Server determines which mailbox server has the active copy of the users’ database. It connects <click> the client to Mailbox Server 1.
When new mail is sent <click>, the active database on Mailbox Server 1 is updated. The second copy of the database <click> is also updated through log shipping. The end user is unaware that anything has happened, and Harv can replace the failed disk drive at his leisure.
The administrator can set up any number of copies per database to meet the Service Level Agreements for his users. For a special category of users, Harv keeps a 4th database copy on a mail server in a geographically remote location <click>. This server is located in a different Active Directory site, but is kept up-to-date over the Wide Area Network using the same replication technology as the other servers. If a hurricane, earthquake, or other catastrophe should shut down the main datacenter, this remote server can be manually activated and readied for client access in about 15 minutes.
Note: If a focus group participant asks “How is this different from Microsoft Clustering Services?,” here’s the answer: “The product team is taking technology from Microsoft Clustering Services and integrate it natively into Exchange Server (no separate management tools required). Would you find value in that? Why or why not?”
10. Exchange 2010 High Availability Fundamentals Database Availability Group
Server
Database
Database Copy
Active Manager
RPC Client Access service Slide Objective:
Instructor Notes:
Slide Objective:
Instructor Notes:
11. Exchange 2010 High Availability FundamentalsDatabase Availability Group A group of up to 16 servers hosting a set of replicated databases
Wraps a Windows Failover Cluster
Manages servers’ membership in the group
Heartbeats servers, quorum, cluster database
Defines the boundary of database replication
Defines the boundary of failover/switchover (*over)
Defines boundary for DAG’s Active Manager
Slide Objective:
Instructor Notes:
Defines properties applicable to the DAG or all servers, e.g.:
File Share Witness
List of servers
Network compression
Network Encryption
Supports multiple networks
Slide Objective:
Instructor Notes:
Defines properties applicable to the DAG or all servers, e.g.:
File Share Witness
List of servers
Network compression
Network Encryption
Supports multiple networks
12. Exchange 2010 High Availability FundamentalsServer Unit of membership for a DAG
Hosts the active and passive copies of multiple mailbox databases
Executes Information Store, CI, Assistants, etc., services on active mailbox database copies
Executes replication services on passive mailbox database copies
Slide Objective:
Instructor Notes:
Slide Objective:
Instructor Notes:
13. Exchange 2010 High Availability FundamentalsServer (Continued) Provides connection point between Information Store and RPC Client Access
Very few server-level properties relevant to HA
Server’s Database Availability Group
Server’s Activation Policy Slide Objective:
Instructor Notes:
Slide Objective:
Instructor Notes:
14. Exchange 2010 High Availability FundamentalsMailbox Database Unit of *over
A database has 1 active copy – active copy can be mounted or dismounted
Maximum # of passive copies == # servers in DAG – 1
Slide Objective:
Instructor Notes:
Slide Objective:
Instructor Notes:
15. Exchange 2010 High Availability FundamentalsMailbox Database (Continued) ~30 seconds database *overs
Server failover/switchover involves moving all active databases to one or more other servers
Database names are unique across a forest
Defines properties relevant at the database level
GUID: a Database’s unique ID
EdbFilePath: path at which copies are located
Servers: list of servers hosting copies
Slide Objective:
Instructor Notes:
Many customers have described their naming schemes that make this transition relatively straightforward. For example, a name might include acronyms for the forest, AD site(s), DAGs, physical racks, disk slots, etc.Slide Objective:
Instructor Notes:
Many customers have described their naming schemes that make this transition relatively straightforward. For example, a name might include acronyms for the forest, AD site(s), DAGs, physical racks, disk slots, etc.
16. Exchange 2010 High Availability Fundamentals Active/Passive vs. Source/Target Availability Terms
Active: Selected to provide email services to clients
Passive: Available to provide email services to clients if active fails
Replication Terms
Source: Provides data for copying to a separate location
Target: Receives data from the source
Slide Objective:
Instructor Notes:
Slide Objective:
Instructor Notes:
17. Exchange 2010 High Availability FundamentalsMailbox Database Copy Scope of replication
A copy is either source or target of replication at any given time
A copy is either active or passive at any given time
Only 1 copy of each database in a DAG is active at a time
A server may not host >1 copy of a any database
Slide Objective:
Instructor Notes:
Slide Objective:
Instructor Notes:
18. Exchange 2010 High Availability FundamentalsMailbox Database Copy Defines properties applicable to an individual database copy
Copy status: Healthy, Initializing, Failed, Mounted, Dismounted, Disconnected, Suspended, FailedandSuspended, Resynchronizing, Seeding
CopyQueueLength
ReplayQueueLength Slide Objective:
Instructor Notes:
System tracks health of each copy
Active definition of health – Is Information Store capable of providing email service against it?
Passive definition of health – Is Replication Service able to copy and inspect logs and have them replayed into the passive copy?Slide Objective:
Instructor Notes:
System tracks health of each copy
Active definition of health – Is Information Store capable of providing email service against it?
Passive definition of health – Is Replication Service able to copy and inspect logs and have them replayed into the passive copy?
19. Exchange-aware resource manager (high availability’s brain)
Runs on every server in the DAG
Manages which copies should be active and which should be passive
Definitive source of information on where a database is active or mounted
Provides this information to other Exchange components (e.g., RPC Client Access and Hub Transport)
Information stored in cluster database Exchange 2010 High Availability FundamentalsActive Manager Slide Objective:
Instructor Notes:
Active Directory is primary source for configuration information
Active Manager is primary source for changeable state information such as active and mounted
In Exchange 2010, the Microsoft Exchange Replication service periodically monitors the health of all mounted databases. In addition, it also monitors ESE for any I/O errors or failures. When the service detects a failure, it notifies Active Manager. Active Manager then determines which database copy should be mounted and what it required to mount that database. In addition, tracks the active copy of a mailbox database (based on the last mounted copy of the database) and provides the tracking results information to the RPC Client Access component on the Client Access server to which the client is connected.
When an administrator makes a database copy the active mailbox database, this process is known as a switchover. When a failure affecting a database occurs and a new database becomes the active copy, this process is known as a failover. This process also refers to a server failure in which one or more servers bring online the databases previously online on the failed server. When either a switchover or failover occurs, other Exchange 2010 server roles become aware of the switchover almost immediately and will redirect client and messaging traffic to the new active database.
For example, if an active database in a DAG fails because of an underlying storage failure, Active Manager will automatically recover by failing over to a database copy on another Mailbox server in the DAG. In the event the database is outside the automatic mount criteria and cannot be automatically mounted, an administrator can manually perform a database failover.
PAM: The Active Manager in the DAG decides which copies will be active and passive. It moves to other servers if the server hosting it is no longer able to. You need to move the PAM if you take a server offline for maintenance or upgrade.
SAM: Provides information on which server hosts the active copy of a mailbox database to other components of Exchange, e.g., RPC Client Access Service or HUB Transport.
Slide Objective:
Instructor Notes:
Active Directory is primary source for configuration information
Active Manager is primary source for changeable state information such as active and mounted
In Exchange 2010, the Microsoft Exchange Replication service periodically monitors the health of all mounted databases. In addition, it also monitors ESE for any I/O errors or failures. When the service detects a failure, it notifies Active Manager. Active Manager then determines which database copy should be mounted and what it required to mount that database. In addition, tracks the active copy of a mailbox database (based on the last mounted copy of the database) and provides the tracking results information to the RPC Client Access component on the Client Access server to which the client is connected.
When an administrator makes a database copy the active mailbox database, this process is known as a switchover. When a failure affecting a database occurs and a new database becomes the active copy, this process is known as a failover. This process also refers to a server failure in which one or more servers bring online the databases previously online on the failed server. When either a switchover or failover occurs, other Exchange 2010 server roles become aware of the switchover almost immediately and will redirect client and messaging traffic to the new active database.
For example, if an active database in a DAG fails because of an underlying storage failure, Active Manager will automatically recover by failing over to a database copy on another Mailbox server in the DAG. In the event the database is outside the automatic mount criteria and cannot be automatically mounted, an administrator can manually perform a database failover.
PAM: The Active Manager in the DAG decides which copies will be active and passive. It moves to other servers if the server hosting it is no longer able to. You need to move the PAM if you take a server offline for maintenance or upgrade.
SAM: Provides information on which server hosts the active copy of a mailbox database to other components of Exchange, e.g., RPC Client Access Service or HUB Transport.
20. Active Directory is still primary source for configuration info
Active Manager is primary source for changeable state information (such as active and mounted)
Replication service monitors health of all mounted databases, and monitors ESE for I/O errors or failure Exchange 2010 High Availability FundamentalsActive Manager Slide Objective:
Instructor Notes:
Active Directory is primary source for configuration information
Active Manager is primary source for changeable state information such as active and mounted
In Exchange 2010, the Microsoft Exchange Replication service periodically monitors the health of all mounted databases. In addition, it also monitors ESE for any I/O errors or failures. When the service detects a failure, it notifies Active Manager. Active Manager then determines which database copy should be mounted and what it required to mount that database. In addition, tracks the active copy of a mailbox database (based on the last mounted copy of the database) and provides the tracking results information to the RPC Client Access component on the Client Access server to which the client is connected.
When an administrator makes a database copy the active mailbox database, this process is known as a switchover. When a failure affecting a database occurs and a new database becomes the active copy, this process is known as a failover. This process also refers to a server failure in which one or more servers bring online the databases previously online on the failed server. When either a switchover or failover occurs, other Exchange 2010 server roles become aware of the switchover almost immediately and will redirect client and messaging traffic to the new active database.
For example, if an active database in a DAG fails because of an underlying storage failure, Active Manager will automatically recover by failing over to a database copy on another Mailbox server in the DAG. In the event the database is outside the automatic mount criteria and cannot be automatically mounted, an administrator can manually perform a database failover.
PAM: The Active Manager in the DAG decides which copies will be active and passive. It moves to other servers if the server hosting it is no longer able to. You need to move the PAM if you take a server offline for maintenance or upgrade.
SAM: Provides information on which server hosts the active copy of a mailbox database to other components of Exchange, e.g., RPC Client Access Service or HUB Transport.Slide Objective:
Instructor Notes:
Active Directory is primary source for configuration information
Active Manager is primary source for changeable state information such as active and mounted
In Exchange 2010, the Microsoft Exchange Replication service periodically monitors the health of all mounted databases. In addition, it also monitors ESE for any I/O errors or failures. When the service detects a failure, it notifies Active Manager. Active Manager then determines which database copy should be mounted and what it required to mount that database. In addition, tracks the active copy of a mailbox database (based on the last mounted copy of the database) and provides the tracking results information to the RPC Client Access component on the Client Access server to which the client is connected.
When an administrator makes a database copy the active mailbox database, this process is known as a switchover. When a failure affecting a database occurs and a new database becomes the active copy, this process is known as a failover. This process also refers to a server failure in which one or more servers bring online the databases previously online on the failed server. When either a switchover or failover occurs, other Exchange 2010 server roles become aware of the switchover almost immediately and will redirect client and messaging traffic to the new active database.
For example, if an active database in a DAG fails because of an underlying storage failure, Active Manager will automatically recover by failing over to a database copy on another Mailbox server in the DAG. In the event the database is outside the automatic mount criteria and cannot be automatically mounted, an administrator can manually perform a database failover.
PAM: The Active Manager in the DAG decides which copies will be active and passive. It moves to other servers if the server hosting it is no longer able to. You need to move the PAM if you take a server offline for maintenance or upgrade.
SAM: Provides information on which server hosts the active copy of a mailbox database to other components of Exchange, e.g., RPC Client Access Service or HUB Transport.
21. Exchange 2010 High Availability FundamentalsContinuous Replication Continuous replication has the following basic steps:
Database copy seeding of target
Log copying from source to target
Log inspection at target
Log replay into database copy Slide Objective:
Instructor Notes:
Slide Objective:
Instructor Notes:
22. Exchange 2010 High Availability FundamentalsDatabase Seeding There are several ways to seed the target instance:
Automatic Seeding
Update-MailboxDatabaseCopy cmdlet
Can be performed from active or passive copies
Manually copy the database
Backup and restore (VSS) Slide Objective:
Instructor Notes:
Now that we have a basic understanding of the core components that are involved, we can discuss how log shipping works. Before log files can be shipped to a passive copy, the active database copy must first be seeded. This can be accomplished in a few ways.
Automatic seeding: An automatic seed produces a copy of a database in the target location. Automatic seeding requires that all log files, including the very first log file created by the database (it contains the database creation log record), be available on the source. Automatic seeding only occurs during the creation of a new server or creation of a new database (or if the first log still exists, i.e. log truncation hasn’t occurred).
Seeding using the Update-MailboxDatabaseCopy cmdlet: You can use the Update-MailboxDatabaseCopy cmdlet in the Exchange Management Shell to seed a database copy.
Manually copying the offline database: This process dismounts the database and copies the database file to the same location on the passive node. If you use this method, there will be an interruption in service because the procedure requires you to dismount the database.
The second option utilizes the streaming copy backup API to copy the database from the active location to the target location. Slide Objective:
Instructor Notes:
Now that we have a basic understanding of the core components that are involved, we can discuss how log shipping works. Before log files can be shipped to a passive copy, the active database copy must first be seeded. This can be accomplished in a few ways.
Automatic seeding: An automatic seed produces a copy of a database in the target location. Automatic seeding requires that all log files, including the very first log file created by the database (it contains the database creation log record), be available on the source. Automatic seeding only occurs during the creation of a new server or creation of a new database (or if the first log still exists, i.e. log truncation hasn’t occurred).
Seeding using the Update-MailboxDatabaseCopy cmdlet: You can use the Update-MailboxDatabaseCopy cmdlet in the Exchange Management Shell to seed a database copy.
Manually copying the offline database: This process dismounts the database and copies the database file to the same location on the passive node. If you use this method, there will be an interruption in service because the procedure requires you to dismount the database.
The second option utilizes the streaming copy backup API to copy the database from the active location to the target location.
23. Exchange 2010 High Availability FundamentalsLog Shipping Log shipping in Exchange 2010 leverages TCP sockets
Supports encryption and compression
Administrator can set TCP port to be used
Replication service on target notifies the active instance the next log file it expects
Based on last log file which it inspected
Replication service on source responds by sending the required log file(s)
Copied log files are placed in the target’s Inspector directory Slide Objective:
Instructor Notes:
Exchange Server 2007 utilized SMB and Windows file system notifications to get logs. Exchange 2010 utilizes TCP sockets and notifications to the source about which logs are required on the target.Slide Objective:
Instructor Notes:
Exchange Server 2007 utilized SMB and Windows file system notifications to get logs. Exchange 2010 utilizes TCP sockets and notifications to the source about which logs are required on the target.
24. Exchange 2010 High Availability FundamentalsLog Inspection The following actions are performed to verify the log file before replay:
Physical integrity inspection
Header inspection
Move any Exx.log files to ExxOutofDate folder that exist on target if it was previously a source
If inspection fails, the file will be recopied and inspected (up to 3 times)
If the log file passes inspection it is moved into the database copy’s log directory Slide Objective:
Instructor Notes:
The following actions are performed by LogInspector:
Physical integrity inspection: This validation utilizes ESEUTIL /K against the log file and validates that the checksum recorded in the log file matches the checksum generated in memory.
Header inspection: The Replication service validates the following aspects of the log file’s header:
The generation is not higher than the highest generation recorded for the database in question.
The generation recorded in the log header matches the generation recorded in the log filename.
The log file signature recorded in the log header matches that of the log file.
Removal of Exx.log: Before the inspected log file can be moved into the log folder, the Replication service needs to remove any Exx.log files. These log files are placed into another sub-directory of the log directory, the ExxOutofDate directory. An Exx.log file would only exist on the target if it was previously running as a source. The Exx.log file needs to be removed before log replay occurs because it will contain old data which has been superseded by a full log file with the same generation. If the closed log file is not a superset of the existing Exx.log file, then we will have to perform an incremental or full reseed.Slide Objective:
Instructor Notes:
The following actions are performed by LogInspector:
Physical integrity inspection: This validation utilizes ESEUTIL /K against the log file and validates that the checksum recorded in the log file matches the checksum generated in memory.
Header inspection: The Replication service validates the following aspects of the log file’s header:
The generation is not higher than the highest generation recorded for the database in question.
The generation recorded in the log header matches the generation recorded in the log filename.
The log file signature recorded in the log header matches that of the log file.
Removal of Exx.log: Before the inspected log file can be moved into the log folder, the Replication service needs to remove any Exx.log files. These log files are placed into another sub-directory of the log directory, the ExxOutofDate directory. An Exx.log file would only exist on the target if it was previously running as a source. The Exx.log file needs to be removed before log replay occurs because it will contain old data which has been superseded by a full log file with the same generation. If the closed log file is not a superset of the existing Exx.log file, then we will have to perform an incremental or full reseed.
25. Exchange 2010 High Availability FundamentalsLog Replay Log replay has moved to Information Store
The following validation tests are performed prior to log replay:
Recalculate the required log generations by inspecting the database header
Determine the highest generation that is present in the log directory to ensure that a log file exists
Compare the highest log generation that is present in the directory to the highest log file that is required
Make sure the logs form the correct sequence
Query the checkpoint file, if one exists
Replay the log file using a special recovery mode (undo phase is skipped) Slide Objective:
Instructor Notes:
Log Replay
After the log files have been inspected, they are placed within the log directory so that they can be replayed in the database copy. Before the Replication service replays the log files, it performs a series of validation tests.
It recalculates the required log generations necessary for log replay to be successful. This check determines the highest and lowest log generations required by inspecting the database header.
The database header contains a log signature and this will be compared with the log file’s signature to ensure they match.
The database header will report one of two conditions: Either the database is in a consistent state (Clean Shutdown) or an inconsistent state (Dirty Shutdown). If the database is consistent, then the next log file to be committed log position (LGPOS) will be used to derive the lowest and highest required log generations. If the database is in an inconsistent state, then the “Required Log” stamp defines the lowest and highest required generations. In addition, the backup information that is stored in the database header is reviewed to determine the last generation that was backed up.
It determines the highest generation that is present in the log directory to ensure that a log file exists. If there are no log files, then replay cannot continue.
Compare the highest log generation that is present in the directory to the highest log file that is required. As long as the highest log generation that is present is equal to or higher than the highest required log generation, replay can continue.
Make sure the logs form the correct sequence. Checking all the log files in the directory is a slow process, so only the required log files are validated. This is done to determine if recovery will fail immediately (the assumption being that if recovery fails on generation N, the database header will require generation N-1). In addition to validating that all the required log files are available, the following additional validation steps are performed:
Ensures the generation of a log file matches the sequence expectation.
Ensures the log file's signature is correct.
Ensures the creation time of the log file forms a sequence with previous log files.
Query the checkpoint file, if one exists.
If the checkpoint file is too advanced by defining a generation that is higher than the lowest generation required, then the checkpoint file will be deleted.
If the checkpoint is too far behind and points to a non-existent file, then the checkpoint file will be deleted.
Once these validation checks have been completed, the Replication service will replay the log iteration. This is actually a special recovery mode, which is different from the replay performed by ESEUTIL /R. For more information, see Eseutil /R Recovery Mode. Typically when ESEUTIL /R is executed, it replays all available log files including the Exx.log file, thus ensuring all transactions are either committed to the database or rolled back. However, since in continuous replication the replication engine is always one log file behind (the active Exx.log file), the undo phase of recovery (where uncommitted transactions are rolled back) is skipped to ensure database divergence does not occur.
After the log file is replayed, the following additional steps are performed.
Validate that the required generations in the database header were updated. There are certain situations in which the database header will not get updated after a log replay event. This occurs when the log file that was replayed only contains termination and initialization records. In this scenario, however, the database should be in a consistent state.
Execute the LogTruncater phase to remove any unnecessary logs from the passive and the active path. This call reads the database header for full and incremental backup information, so replays are blocked for this duration. Slide Objective:
Instructor Notes:
Log Replay
After the log files have been inspected, they are placed within the log directory so that they can be replayed in the database copy. Before the Replication service replays the log files, it performs a series of validation tests.
It recalculates the required log generations necessary for log replay to be successful. This check determines the highest and lowest log generations required by inspecting the database header.
The database header contains a log signature and this will be compared with the log file’s signature to ensure they match.
The database header will report one of two conditions: Either the database is in a consistent state (Clean Shutdown) or an inconsistent state (Dirty Shutdown). If the database is consistent, then the next log file to be committed log position (LGPOS) will be used to derive the lowest and highest required log generations. If the database is in an inconsistent state, then the “Required Log” stamp defines the lowest and highest required generations. In addition, the backup information that is stored in the database header is reviewed to determine the last generation that was backed up.
It determines the highest generation that is present in the log directory to ensure that a log file exists. If there are no log files, then replay cannot continue.
Compare the highest log generation that is present in the directory to the highest log file that is required. As long as the highest log generation that is present is equal to or higher than the highest required log generation, replay can continue.
Make sure the logs form the correct sequence. Checking all the log files in the directory is a slow process, so only the required log files are validated. This is done to determine if recovery will fail immediately (the assumption being that if recovery fails on generation N, the database header will require generation N-1). In addition to validating that all the required log files are available, the following additional validation steps are performed:
Ensures the generation of a log file matches the sequence expectation.
Ensures the log file's signature is correct.
Ensures the creation time of the log file forms a sequence with previous log files.
Query the checkpoint file, if one exists.
If the checkpoint file is too advanced by defining a generation that is higher than the lowest generation required, then the checkpoint file will be deleted.
If the checkpoint is too far behind and points to a non-existent file, then the checkpoint file will be deleted.
Once these validation checks have been completed, the Replication service will replay the log iteration. This is actually a special recovery mode, which is different from the replay performed by ESEUTIL /R. For more information, see Eseutil /R Recovery Mode. Typically when ESEUTIL /R is executed, it replays all available log files including the Exx.log file, thus ensuring all transactions are either committed to the database or rolled back. However, since in continuous replication the replication engine is always one log file behind (the active Exx.log file), the undo phase of recovery (where uncommitted transactions are rolled back) is skipped to ensure database divergence does not occur.
After the log file is replayed, the following additional steps are performed.
Validate that the required generations in the database header were updated. There are certain situations in which the database header will not get updated after a log replay event. This occurs when the log file that was replayed only contains termination and initialization records. In this scenario, however, the database should be in a consistent state.
Execute the LogTruncater phase to remove any unnecessary logs from the passive and the active path. This call reads the database header for full and incremental backup information, so replays are blocked for this duration.
26. Exchange 2010 High Availability FundamentalsLossy Failure Process In the event of failure, the following steps will occur for the failed database:
Active Manager will determine the best copy to activate
The Replication service on the target server will attempt to copy missing log files from the source - ACLL
If successful, then the database will mount with zero data loss
If unsuccessful (lossy failure), then the database will mount based on the AutoDatabaseMountDial setting
The mounted database will generate new log files (using the same log generation sequence)
Transport Dumpster requests will be initiated for the mounted database to recover lost messages
When original server or database recovers, it will run through divergence detection and perform an incremental reseed or require a full reseed
Slide Objective:
Instructor Notes:
Slide Objective:
Instructor Notes:
27. Exchange 2010 High Availability FundamentalsBackups Streaming backup APIs for public use have been cut, must use VSS for backups
Backup from any copy of the database/logs
Always choose Passive (or Active) copy
Backup an entire server
Designate a dedicated backup server for a given database
Restore from any of these backups scenarios Slide Objective:
Instructor Notes:
Slide Objective:
Instructor Notes:
28. Multiple Database Copies Enable New Scenarios Exchange 2010 HA
E-mail archive
Extended/protected dumpster retention Slide Objective:
Instructor Notes:
Slide Objective:
Instructor Notes:
29. Mailbox Database Copies Create up to 16 copies of each mailbox database
Each mailbox database must have a unique name within Organization
Mailbox database objects are global configuration objects
All mailbox database copies use the same GUID
No longer connected to specific Mailbox servers
30. Mailbox Database Copies Each DAG member can host only one copy of a given mailbox database
Database path and log folder path for copy must be identical on all members
Copies have settable properties
Activation Preference
RTM: Used as second sort key during best copy selection
SP1: Used for distributing active databases; used as primary sorting key when using Lossless mount dial
Replay Lag and Truncation Lag
Using these features affects your storage design
31. Lagged Database Copies A lagged copy is a passive database copy with a replay lag time greater than 0
Lagged copies are only for point-in-time protection, but they are not a replacement for point-in-time backups
Logical corruption and/or mailbox deletion prevention scenarios
Provide a maximum of 14 days protection
When should you deploy a lagged copy?
Useful only to mitigate a risk
May not be needed if deploying a backup solution (e.g., DPM 2010)
Lagged copies are not HA database copies
Lagged copies should never be automatically activated by system
Steps for manual activation documented at http://technet.microsoft.com/en-us/library/dd979786.aspx
Lagged copies affect your storage design
32. DAG DesignTwo Failure Models Design for all database copies activated
Design for the worst case - server architecture handles 100 percent of all hosted database copies becoming active
Design for targeted failure scenarios
Design server architecture to handle the active mailbox load during the worst failure case you plan to handle
1 member failure requires 2 or more HA copies and 2 or more servers
2 member failure requires 3 or more HA copies and 4 or more servers
Requires Set-MailboxServer <Server> -MaximumActiveDatabases <Number>
33. DAG DesignIt’s all in the layout Consider this scenario
8 servers, 40 databases with 2 copies
34. DAG DesignIt’s all in the layout If I have a single server failure
Life is good
35. DAG DesignIt’s all in the layout If I have a double server failure
Life could be good…
36. DAG DesignIt’s all in the layout If I have a double server failure
Life could be bad…
37. DAG DesignIt’s all in the layout Now let’s consider this scenario
4 servers, 12 databases with 3 copies
With a single server failure:
With a double server failure:
38. Deep Dive on Exchange 2010 High Availability Basics QuorumWitnessDAG LifecycleDAG Networks Please Note, the delivery consultant should select the appropriate deep dive topics for presentation based on customer situation and one’s own technical depth. Please Note, the delivery consultant should select the appropriate deep dive topics for presentation based on customer situation and one’s own technical depth.
39. Quorum
40. Quorum Used to ensure that only one subset of members is functioning at one time
A majority of members must be active and have communications with each other
Represents a shared view of members (voters and some resources)
Dual Usage
Data shared between the voters representing configuration, etc.
Number of voters required for the solution to stay running (majority); quorum is a consensus of voters
When a majority of voters can communicate with each other, the cluster has quorum
When a majority of voters cannot communicate with each other, the cluster does not have quorum
41. Quorum Quorum is not only necessary for cluster functions, but it is also necessary for DAG functions
In order for a DAG member to mount and activate databases, it must participate in quorum
Exchange 2010 uses only two of the four available cluster quorum models
Node Majority (DAGs with an odd number of members)
Node and File Share Majority (DAGs with an even number of members)
Quorum = (N/2) + 1 (whole numbers only)
6 members: (6/2) + 1 = 4 votes for quorum (can lose 3 voters)
9 members: (9/2) + 1 = 5 votes for quorum (can lose 4 voters)
13 members: (13/2) + 1 = 7 votes for quorum (can lose 6 voters)
15 members: (15/2) + 1 = 8 votes for quorum (can lose 7 voters) N = number of nodes in clusterN = number of nodes in cluster
42. Witness and Witness Server
43. Witness A witness is a share on a server that is external to the DAG that participates in quorum by providing a weighted vote for the DAG member that has a lock on the witness.log file
Used only by DAGs that have an even number of members
Witness server does not maintain a full copy of quorum data and is not a member of the DAG or cluster
44. Witness Represented by File Share Witness resource
File share witness cluster resource, directory, and share automatically created and removed as needed
Uses Cluster IsAlive check for availability
If witness is not available, cluster core resources are failed and moved to another DAG member
If other DAG member does not bring witness resource online, the resource will remain in a Failed state, with restart attempts every 60 minutes
See http://support.microsoft.com/kb/978790 for details on this behavior
45. Witness If in a Failed state and needed for quorum, cluster will try to online File Share Witness resource once
If witness cannot be restarted, it is considered failed and quorum is lost
If witness can be restarted, it is considered successful and quorum is maintained
An SMB lock is placed on witness.log
Node PAXOS information is incremented and the updated PAXOS tag is written to witness.log
If in an Offline state and needed for quorum, cluster will not try to restart – quorum lost
46. Witness When witness is no longer needed to maintain quorum, lock on witness.log is released
Any member that locks the witness, retains the weighted vote (“locking node”)
Members in contact with locking node are in majority and maintain quorum
Members not in contact with locking node are in minority and lose quorum
47. Witness Server No pre-configuration typically necessary
Exchange Trusted Subsystem must be member of local Administrators group on Witness Server if Witness Server is not running Exchange 2010
Cannot be a member of the DAG (present or future)
Must be in the same Active Directory forest as DAG
48. Witness Server Can be Windows Server 2003 or later
File and Printer Sharing for Microsoft Networks must be enabled
Replicating witness directory/share with DFS not supported
Not necessary to cluster Witness Server
If you do cluster witness server, you must use Windows 2008
Single witness server can be used for multiple DAGs
Each DAG requires its own unique Witness Directory/Share
49. Database Availability Group Lifecycle
50. Database Availability Group Lifecycle Create a DAGNew-DatabaseAvailabilityGroup -Name DAG1 –WitnessServer EXHUB1 -WitnessDirectory C:\DAG1FSW -DatabaseAvailabilityGroupIpAddresses 10.0.0.8New-DatabaseAvailabilityGroup -Name DAG2 -DatabaseAvailabilityGroupIpAddresses 10.0.0.8,192.168.0.8
Add Mailbox Servers to DAGAdd-DatabaseAvailabilityGroupServer -Identity DAG1 -MailboxServer EXMBX1Add-DatabaseAvailabilityGroupServer -Identity DAG1 -MailboxServer EXMBX2
Add a Mailbox Database CopyAdd-MailboxDatabaseCopy -Identity DB1 -MailboxServer EXMBX2
51. Database Availability Group Lifecycle DAG is created initially as empty object in Active Directory
Continuous replication or 3rd party replication using Third Party Replication mode
Once changed to Third Party Replication mode, the DAG cannot be changed back
DAG is given a unique name and configured for IP addresses (or configured to use DHCP)
52. Database Availability Group Lifecycle When the first Mailbox server is added to a DAG
A failover cluster is formed with the name of DAG using Node Majority quorum
The server is added to the DAG object in Active Directory
A cluster name object (CNO) for the DAG is created in default Computers container using the security context of the Replication service
The Name and IP address of the DAG is registered in DNS
The cluster database for the DAG is updated with info about local databases
53. Database Availability Group Lifecycle When second and subsequent Mailbox server is added to a DAG
The server is joined to cluster for the DAG
The quorum model is automatically adjusted
The server is added to the DAG object in Active Directory
The cluster database for the DAG is updated with info about local databases
54. Database Availability Group Lifecycle After servers have been added to a DAG
Configure the DAG
Network encryption
Network compression
Replication port
Configure DAG networks
Network subnets
Collapse DAG networks in single network with multiple subnets
Enable/disable MAPI traffic/replication
Block network heartbeat cross-talk (Server1\MAPI !<-> Server2\Repl)
55. Database Availability Group Lifecycle After servers have been added to a DAG
Configure DAG member properties
Automatic database mount dial
BestAvailability, GoodAvailability, Lossless, custom value
Database copy automatic activation policy
Blocked, IntrasiteOnly, Unrestricted
Maximum active databases
Create mailbox database copies
Seeding is performed automatically, but you have options
Monitor health and status of database copies and perform switchovers as needed
56. Database Availability Group Lifecycle Before you can remove a server from a DAG, you must first remove all replicated databases from the server
When a server is removed from a DAG:
The server is evicted from the cluster
The cluster quorum is adjusted
The server is removed from the DAG object in Active Directory
Before you can remove a DAG, you must first remove all servers from the DAG
57. DAG Networks
58. DAG Networks A DAG network is a collection of subnets
All DAGs must have:
Exactly one MAPI network
MAPI network connects DAG members to network resources (Active Directory, other Exchange servers, etc.)
Zero or more Replication networks
Separate network on separate subnet(s)
Used for/by continuous replication only
LRU determines which replication network to use when multiple replication networks are configured
59. DAG Networks Initially created DAG networks based on enumeration of cluster networks
Cluster enumeration based on subnet
One cluster network is created for each subnet
60. DAG Networks
61. DAG Networks
62. DAG Networks To collapse subnets into two DAG networks and disable replication for the MAPI network:
63. DAG Networks To collapse subnets into two DAG networks and disable replication for the MAPI network:
64. DAG Networks Automatic network detection occurs only when members added to DAG
If networks are added after member is added, you must perform discovery
Set-DatabaseAvailabilityGroup -DiscoverNetworks
DAG network configuration persisted in cluster registry
HKLM\Cluster\Exchange\DAG Network
DAG networks include built-in encryption and compression
Encryption: Kerberos SSP EncryptMessage/DecryptMessage APIs
Compression: Microsoft XPRESS, based on LZ77 algorithm
DAGs use a single TCP port for replication and seeding
Default is TCP port 64327
If you change the port and you use Windows Firewall, you must manually change firewall rules MSIT sees 30% compression, but percentage will vary based on message profileMSIT sees 30% compression, but percentage will vary based on message profile
65. Deeper Dive on Exchange 2010High Availability Advanced Features Active ManagerBest Copy SelectionDatacenter Activation Coordination Mode
66. Active Manager
67. Active Manager Exchange component that manages *overs
Runs on every server in the DAG
Selects best available copy on failovers
Is the definitive source of information on where a database is active
Stores this information in cluster database
Provides this information to other Exchange components (e.g., RPC Client Access and Hub Transport)
68. Active Manager Active Manager roles
Standalone Active Manager
Primary Active Manager (PAM)
Standby Active Manager (SAM)
Active Manager client runs on CAS and Hub
69. Active Manager Transition of role state logged into Microsoft-Exchange-HighAvailability/Operational event log (Crimson Channel)
70. Active Manager Primary Active Manager (PAM)
Runs on the node that owns the cluster core resources (cluster group)
Gets topology change notifications
Reacts to server failures
Selects the best database copy on *overs
Detects failures of local Information Store and local databases
71. Active Manager Standby Active Manager (SAM)
Runs on every other node in the DAG
Detects failures of local Information Store and local databases
Reacts to failures by asking PAM to initiate a failover
Responds to queries from CAS/Hub about which server hosts the active copy
Both roles are necessary for automatic recovery
If the Replication service is stopped, automatic recovery will not happen
72. Best Copy Selection
73. Best Copy Selection Process of finding the best copy to activate for an individual database given a list of status results of potential copies for activation
Active Manager selects the “best” copy to become the new active copy when the existing active copy fails
74. Best Copy Selection – RTM Sorts copies by copy queue length to minimize data loss, using activation preference as a secondary sorting key if necessary
Selects from sorted listed based on which set of criteria met by each copy
Attempt Copy Last Logs (ACLL) runs and attempts to copy missing log files from previous active copy
75. Best Copy Selection – SP1 Sorts copies by activation preference when auto database mount dial is set to Lossless
Otherwise, sorts copies based on copy queue length, with activation preference used a secondary sorting key if necessary
Selects from sorted listed based on which set of criteria met by each copy
Attempt Copy Last Logs (ACLL) runs and attempts to copy missing log files from previous active copy This was checked into build 213.This was checked into build 213.
76. Best Copy Selection Is database mountable? Is copy queue length <= AutoDatabaseMountDial?
If Yes, database is marked as current active and mount request is issued
If not, next best database tried (if one is available)
During best copy selection, any servers that are unreachable or “activation blocked” are ignored
77. Best Copy Selection
78. Best Copy Selection – RTM Four copies of DB1
DB1 currently active on Server1
79. Best Copy Selection – RTM Sort list of available copies based by Copy Queue Length (using Activation Preference as secondary sort key if necessary):
Server3\DB1
Server2\DB1
Server4\DB1
80. Best Copy Selection – RTM Only two copies meet first set of criteria for activation (CQL< 10; RQL< 50;CI=Healthy):
Server3\DB1
Server2\DB1
Server4\DB1 Highlight criteriaHighlight criteria
81. Best Copy Selection – SP1 Four copies of DB1
DB1 currently active on Server1
Auto database mountdial set to Lossless
82. Best Copy Selection – SP1 Sort list of available copies based by Activation Preference:
Server2\DB1
Server3\DB1
Server4\DB1
83. Best Copy Selection – SP1 Sort list of available copies based by Activation Preference:
Server2\DB1
Server3\DB1
Server4\DB1 Highlight criteriaHighlight criteria
84. Best Copy Selection After Active Manager determines the best copy to activate
The Replication service on the target server attempts to copy missing log files from the source (ACLL)
If successful, then the database will mount with zero data loss
If unsuccessful (lossy failure), then the database will mount based on the AutoDatabaseMountDial setting
If data loss is outside of dial setting, next copy will be tried
85. Best Copy Selection After Active Manager determines the best copy to activate
The mounted database will generate new log files (using the same log generation sequence)
Transport Dumpster requests will be initiated for the mounted database to recover lost messages
When original server or database recovers, it will run through divergence detection and either perform an incremental resync or require a full reseed
86. Datacenter Activation Coordination Mode
87. Datacenter Activation Coordination Mode DAC mode is a property of a DAG
Acts as an application-level form of quorum
Designed to prevent multiple copies of same database mounting on different members due to loss of network
88. Datacenter Activation Coordination Mode RTM: DAC Mode is only for DAGs with three or more members that are extended to two Active Directory sites
Don’t enable for two-member DAGs where each member is in different AD site or DAGs where all members are in the same AD site
DAC Mode also enables use of Site Resilience tasks
Stop-DatabaseAvailabilityGroup
Restore-DatabaseAvailabilityGroup
Start-DatabaseAvailabilityGroup
SP1: DAC Mode can be enabled for all DAGs
89. Datacenter Activation Coordination Mode Uses Datacenter Activation Coordination Protocol (DACP), which is a bit in memory set to either:
0 = can’t mount
1 = can mount
90. Datacenter Activation Coordination Mode Active Manager startup sequence
DACP is set to 0
DAG member communicates with other DAG members it can reach to determine the current value for their DACP bits
If the starting DAG member can communicate with all other members, DACP bit switches to 1
If other DACP bits are set to 0, starting DAG member DACP bit remains at 0
If another DACP bit is set to 1, starting DAG member DACP bit switches to 1
91. Improvements in Service Pack 1 Replication and Copy Management enhancements in SP1
92. Improvements in Service Pack 1 Continuous replication changes
Enhanced to reduce data loss
Eliminates log drive as single point of failure
Automatically switches between modes:
File mode (original, log file shipping)
Block mode (enhanced log block shipping)
Switching process:
Initial mode is file mode
Block mode triggered when target needs Exx.log file (e.g., copy queue length = 0)
All healthy passives processed in parallel
File mode triggered when block mode falls too far behind (e.g., copy queue length > 0)
93. Improvements in Service Pack 1
94. Improvements in Service Pack 1 SP1 introduces RedistributeActiveDatabases.ps1 script (keep database copies balanced across DAG members)
Moves databases to the most preferred copy
If cross-site, tries to balance between sites
Targetless admin switchover altered for stronger activation preference affinity
First pass of best copy selection sorted by activation preference; not copy queue length
This basically trades off even distribution of copies for a longer activation time. So you might pick a copy with more logs to play, but it will provide you with better distribution of databases
95. Improvements in Service Pack 1 *over Performance Improvements
In RTM, a *over immediately terminated replay on copy that was becoming active, and mount operation did necessary log recovery
In SP1, a *over drives database to clean shutdown by playing all logs on passive copy, and no recovery required on new active
96. Improvements in Service Pack 1 DAG Maintenance Scripts
StartDAGServerMaintenance.ps1
It runs Suspend-MailboxDatabaseCopy for each database copy hosted on the DAG member
It pauses the node in the cluster, which prevents it from being and becoming the PAM
It sets the DatabaseCopyAutoActivationPolicy parameter on the DAG member to Blocked
It moves all active databases currently hosted on the DAG member to other DAG members
If the DAG member currently owns the default cluster group, it moves the default cluster group (and therefore the PAM role) to another DAG member
97. Improvements in Service Pack 1 DAG Maintenance Scripts
StopDAGServerMaintenance.ps1
It run Resume-MailboxDatabaseCopy for each database copy hosted on the DAG member
It resumes the node in the cluster, which it enables full cluster functionality for the DAG member
It sets the DatabaseCopyAutoActivationPolicy parameter on the DAG member to Unrestricted
98. Improvements in Service Pack 1 CollectOverMetrics.ps1 and CollectReplicationMetrics.ps1 rewritten
99. Improvements in Service Pack 1 Exchange Management Console enhancements in SP1
Manage DAG IP addresses
Manage witness server/directory and alternate witness server/directory
100. Switchovers and Failovers (*overs)
101. Exchange 2010 *Overs Within a datacenter
Database *over
Server *over
Between datacenters
Single database *over
Server *over
Datacenter switchover Slide Objective:
Instructor Notes:
Slide Objective:
Instructor Notes:
102. Single Database Cross-Datacenter *Over Database mounted in another datacenter and another Active Directory site
Serviced by “new” Hub Transport servers
“Different OwningServer” – for routing
Transport dumpster re-delivery now from both Active Directory sites
Serviced by “new” CAS
“Different CAS URL” – for protocol access
Outlook Web App now re-directs connection to second CAS farm
Other protocols proxy or redirect (varies) Slide Objective:
Instructor Notes:
Slide Objective:
Instructor Notes:
103. Datacenter Switchover Customers can evolve to site resilience
Standalone ? local redundancy ? site resilience
Consider name space design at first deployment
Keep extending the DAG!
Monitoring and many other concepts/skills just re-applied
Normal administration remains unchanged
Disaster recovery not HA event
Slide Objective:
Instructor Notes:
Slide Objective:
Instructor Notes:
104. Site Resilience
105. Agenda Understand the steps required to build and activate a standby site for Exchange 2010
Site Resilience Overview
Site Resilience Models
Planning and Design
Site activation steps
Client Behavior
106. Site Resilience Drivers Business requirements drive site resilience
When a risk assessment reveals a high-impact threat to meeting SLAs for data loss and loss of availability
Site resilience required to mitigate the risk
Business requirements dictate low recovery point objective (RPO) and recovery time objective (RTO)
107. Site Resilience Overview Ensuring business continuity brings expense and complexity
A site switchover is a coordinated effort with many stakeholders that requires practice to ensure the real event is handled well
Exchange 2010 reduces cost and complexity
Low-impact testing can be performed with cross-site single database switchover
108. Exchange 2007Site resilience choices CCR+SCR and /recoverCMS
SCC+SCR and /recoverCMS
CCR stretched across datacenters
SCR and database portability
SCR and /m:RecoverServer
SCC stretched across datacenters with synchronous replication
109. Exchange 2010 makes it simpler Database Availability Group (DAG) with members in different datacenters/sites
Supports automatic and manual cross-site database switchovers and failovers (*overs)
No stretched Active Directory site
No special networking needed
No /recoverCMS
110. Suitability of site resilience solutions
111. Site Resilience Models Voter Placement and Infrastructure Design
112. Infrastructure Design There are two key models you have to take into account when designing site resilient solutions
Datacenter / Namespace Model
User Distribution Model
When planning for site resilience, each datacenter is considered active
Exchange Server 2010 site resilience requires active CAS, HUB, and UM in standby datacenter
Services used by databases mounted in standby datacenter after single database *over
113. Infrastructure DesignUser Distribution Models The locality of the users will ultimately determine your site resilience architecture
Are users primarily located in one datacenter?
Are users located in multiple datacenters?
Is there a requirement to maintain user population in a particular datacenter?
Active/Passive user distribution model
Database copies deployed in the secondary datacenter, but no active mailboxes are hosted there
Active/Active user distribution model
User population dispersed across both datacenters with each datacenter being the primary datacenter for its specific user population
114. Infrastructure DesignClient Access Arrays 1 CAS array per AD site
Multiple DAGs within an AD site can use the same CAS array
FQDN of the CAS array needs to resolve to a load-balanced virtual IP address in DNS, but only in internal DNS
You need a load balancer for CAS array, as well
Set the databases in the AD site to utilize CAS array via Set-MailboxDatabase -RPCClientAccessServer property
By default, new databases will have the RPCClientAccessServer value set on creation
If database was created prior to creating CAS array, then it is set to random CAS FQDN (or local machine if role co-location)
If database is created after creating CAS array, then it is set to the CAS array FQDN
115. Voter Placement Majority of voters should be deployed in primary datacenter
Primary = datacenter with majority of user population
If user population is spread across datacenters, deploy multiple DAGs to prevent WAN outage from taking one datacenter offline
116. Voter Placement
117. Site Resilience Namespace, Network and Certificate Planning
118. Each datacenter is considered active and needs their own namespaces
Each datacenter needs the following namespaces
OWA/OA/EWS/EAS namespace
POP/IMAP namespace
RPC Client Access namespace
SMTP namespace
In addition, one of the datacenters will maintain the Autodiscover namespace Planning for site resilienceNamespaces
119. Best Practice: Use Split DNS for Exchange hostnames used by clients
Goal: minimize number of hostnames
mail.contoso.com for Exchange connectivity on intranet and Internet
mail.contoso.com has different IP addresses in intranet/Internet DNS
Important – before moving down this path, be sure to map out all host names (outside of Exchange) that you want to create in the internal zone Planning for site resilienceNamespaces
120. Planning for site resilienceNamespaces
121. Design High Availability for Dependencies
Active Directory
Network services (DNS, TCP/IP, etc.)
Telephony services (Unified Messaging)
Backup services
Network services
Infrastructure (power, cooling, etc.) Planning for site resilienceNetwork
122. Latency
Must have less than 250 ms round trip
Network cross-talk must be blocked
Router ACLs should be used to block traffic between MAPI and replication networks
If DHCP is used for the replication network, DHCP can be used to deploy static routes
Lower TTL for all Exchange records to 5 minutes
OWA/EAS/EWS/OA, IMAP/POP, SMTP, RPCCAS
Both internal and external DNS zone Planning for site resilienceNetwork
123. Planning for site resilienceCertificates
124. Planning for site resilienceCertificates Best practice: minimize the number of certificates
1 certificate for all CAS servers + reverse proxy + Edge/Hub
Use Subject Alternative Name (SAN) certificate which can cover multiple hostnames
1 additional certificate if using OCS
OCS requires certificates with <=1024 bit keys and the server name in the certificate principal name
If leveraging a certificate per datacenter, ensure the Certificate Principal Name is the same on all certificates
Outlook Anywhere won’t connect if the Principal Name on the certificate does not match the value configured in msstd: (default matches OA RPC End Point)
Set-OutlookProvider EXPR -CertPrincipalName msstd:mail.contoso.com
125. Datacenter Switchover Switchover Tasks
126. Datacenter Switchover Process Failure occurs
Activation decision
Terminate partially running primary datacenter
Activate secondary datacenter
Validate prerequisites
Activate mailbox servers
Activate other roles (in parallel with previous step)
Service is restored
127. Datacenter Switchovers Primary site fails
Stop-DatabaseAvailabilityGroup <DAGName> –ActiveDirectorySite <PSiteName> –ConfigurationOnly (run this in both datacenters)
Stop-Service clussvc
Restore-DatabaseAvailabilityGroup <DAGName> –ActiveDirectorySite <SSiteName>
Databases mount (assuming no activation blocks)
Adjust DNS records for SMTP and HTTPS -ConfigurationOnly is used only when you have a partial datacenter outage, where all Exchange is down, but AD is still up and running.-ConfigurationOnly is used only when you have a partial datacenter outage, where all Exchange is down, but AD is still up and running.
128. Datacenter Switchover Tasks Stop-DatabaseAvailabilityGroup
Adds failed servers to stopped list
Removes servers from started list
Restore-DatabaseAvailabilityGroup
Force quorum
Evict stopped nodes
Start using alternate file share witness if necessary
Start-DatabaseAvailabilityGroup
Remove servers from stopped list
Join servers to cluster
Add joined servers to started list
129. Client ExperiencesTypical Outlook Behavior All Outlook versions behave consistently in a single datacenter scenario
Profile points to RPC Client Access Server array
Profile is unchanged by failovers or loss of CAS
All Outlook versions should behave consistently in a datacenter switchover scenario
Primary datacenter Client Access Server DNS name is bound to IP address of standby datacenter’s Client Access Server
Autodiscover continues to hand out primary datacenter CAS name as Outlook RPC endpoint
Profile remains unchanged
130. Client ExperiencesOutlook – Cross-Site DB Failover Experience Behavior is to perform a direct connect from the CAS array in the first datacenter to the mailbox hosting the active copy in the second datacenter
You can only get a redirect to occur by changing the RPCClientAccessServer property on the database
131. Client ExperiencesOther Clients Other client behavior varies based on protocol and scenario
132. End of Exchange 2010 High Availability Module
133. For More Information Exchange Server Tech Centerhttp://technet.microsoft.com/en-us/exchange/default.aspx
Planning serviceshttp://technet.microsoft.com/en-us/library/cc261834.aspx
Microsoft IT Showcase Webcasts http://www.microsoft.com/howmicrosoftdoesitwebcasts
Microsoft TechNet http://www.microsoft.com/technet/itshowcase