Exchange 2003 High Availability & Site Redundancy

Exchange 2003 High Availability & Site Redundancy Wil Westwick | Dedicated Supportability Services EMEA eXchange Center of Excellence (UK Competence Centre Lead) MICROSOFT CORPORATION

Welcome to this TechNet Event • FREE bi-weekly technical newsletter • FREE regular technical events hosted across the UK • FREE weekly UK & US led technical webcasts • FREE comprehensive technical web site • Monthly CD / DVD subscription with the latest technical tools & resources • FREE quarterly technical magazine We would like to bring your attention to the key elements of the TechNet programme; the central information and community resource for IT professionals in the UK: To subscribe to the newsletter or just to find out more, please visit www.microsoft.com/uk/technet or speak to a Microsoft representative during the break

Progression in Messaging • History of Exchange (5.0, 5.5, 2000, 2003, E12) • Mission Critical • Email Evolution • Corner Stone of many businesses/industries • Primary professional communication mechanism • Increased investment (?) • Increased Development (3rd party investments)

Provision of Service • Greater Demands placed upon service • High Availability. • Business Continuity. • High Availability • ‘A highly available system is usable when the customer needs it’. • Planned and Unplanned. • Business Continuity • ‘Providing the continuity or uninterrupted provision of operations and services – but more importantly the ability for BUSINESS RECOVERY!’. • Microsoft’s current/future investments. = Alignment to business Service Level Agreements and a measurement of availability (% of uptime). “What are the 4 9’s and how do I measure them ?” Microsoft Exchange 2003 and Partner technologies help IT Professionals provide solutions to these modern day business requirements.

Exchange 2003: Highly Available • Clustering solutions • Shared Nothing • Models (A/A, A/P, Multi-Node, MNS) • Service Provision (end-to-end) (Application Dependencies) • ExRes.dll architectural changes (MSExchangeSA) • Interoperability with Storage Abstraction Layer (CLX, Geo-Span)

Exchange 2003: Highly Available…cont • Operational Excellence (ITIL & MOF) • People, process, technology • Non-Clustered Solutions • Outlook 2003 Cached Mode • Portable Databases (Replication Technology) • Clone Technology • Microsoft Windows 2003 VSS Framework

Exchange 2003: Business Continuity • Solutions • Geo-Graphically Distributed Clusters • (Non) Geo-Clustered Site Resilient • Design Scenario (A) • Design Scenario (B) • Design Scenario (C) Design Fundamental: Multi-Site Data Availability - Replication - Clones (VSS)

Multi-site Data Replication • What is Multi-site Data Replication ? • Replication Mechanisms • Asynchronous Replication • Data Loss • Data Integrity • Synchronous Replication • Distance • handling of replication link failure • Solutions • Geographically Distributed Clusters • Others…(Standby Solutions) • Exchange Data to Replicate • .edb, .stm, .chk, .log (Mandatory) • SMTP Queue Data & MTA Queue Data (Recommended) • Tracking Logs (Optional) • Best Practices for Configuring Replication Mechanisms • Configure replication at the logical/mount point volume level. • Create many replication points. • Keep transaction logs on different logical volumes. • Use multiple replication links. • Best Practices for Configuring Exchange For Synchronous Replication • Create the maximum number of storage groups per Exchange server. • Increase transaction log buffer size. • Deployment Planning for Synchronous Replication • Jetstress (RPC Average Latency, Disk Latency Counters, Client Response Time)

Multi-site Data Replication • Exchange Product Group Support Policy • In summary, Microsoft Exchange supports the data being replicated synchronously; where in an asynchronous replication environment, the third party vendors will provide support for the replicated data. Short Common Questions • 1. Do Microsoft discourage customers from deploying an Exchange asynchronous replication solution? • Microsoft Exchange does not encourage nor discourage customers from deploying asynchronous data replication solutions. • 2. What are the important tests that need to be covered before deployment? • Testing should be done in each of these categories: • Storage Reliability • Performance

Multi-site Data Replication • Backup strategy – replication is no backup solution • Disaster Recovery plan: Replicating data is only the first step in a disaster recovery plan. It is necessary to have a disaster recovery plan that describes step by step how to bring the replicated data online in the time window defined by your SLA. • What tools can I use for the testing ? • Jetstress & Loadsim • ‘Hot’ and ‘Cold’ Data • What is Microsoft’s support for replicating cold data?

Exchange 2003: Business Continuity • Geo-Graphically Distributed Clusters • What is a ‘true’ stretch cluster ? • Qualification • Storage Abstraction Layer (CLX, GEO-SPAN) • Connections • Latency • Multi-Node

Exchange 2003: Business Continuity…cont

The Alternate Designs (Others) • Design Scenario (A) – Single Leg DR Clusters (Dial-Tone) • Environment: • 20 Exchange 2003 A/P clusters in the production environment • 30K mailboxes (Outlook 97/98/2000/XP) • Disaster recovery site located 100Km away from the production datacenter • 100Mbit link between the 2 sites • Disaster recovery requirements for Exchange: • To provide email service continuity in case of a complete cluster failure OR in case of temporary unavailability of the production datacenter. • Data recovery is not required, users can work with empty mailboxes (dial-tone). • Geo-Clustering can not be considered because storage replication infrastructure can not be afforded. • Solution (Stand-by dial-tone clusters): • The network is configured so the VLANS are extended to the DR site, so we have the same IP subnets on the DR site. • DC/GC/DNS servers are installed on the DR site, being members of the same domain/site and are online. • Public Folder servers are installed in the same AG/RG and replicating data and are online • Bridgehead and connector servers are deployed on the DR site. Secondary connectors are created to the same destinations with higher costs and are online. • Mailbox servers: for each production A/P cluster, there is one single-node standby cluster on the DR site.

The Alternate Designs (Others) • How the standby cluster is configured: • Installed on the same subnet as the production cluster. • Distinct computer name and IP and is online. • Distinct cluster name and cluster IP and is online. • Physical disk configuration that correspond to the same drive letters of the production cluster, however smaller size since we don’t need space for data restore. • Exchange 2000 binaries pre-installed Service packs applied. • How we switch from production to DR: • Let’s say EVS1 is running on CLUSTER1 which is composed by SRV1 and SRV2, located on the production datacenter. • EVS1 entire cluster goes down. • The standby cluster for CLUSTER1 is CLUSTER11, composed by SRV11 only, which is online • On CLUSTER11, we create the Exchange IP and Exchange Network Name resources with same values of the production clusters (same IP and same name EVS1). • Bring the resources online. • Create the Exchange System Attendant resource. That will bring EVS1 back online on CLUSTER11. • We go to Exchange System Manager and manually mount the mailbox stores forcing the creation of empty databases. • Users are back online with empty mailboxes.

The Alternate Designs (Others) • How we switch back from DR to production: • Take all resources offline on CLUSTER11 • Restore CLUSTER1 to its original state (whatever the cause of the failure was) • Bring all the resources online on CLUSTER1 • EVS1 will be back online on CLUSTER1 • Users are back to the state they were on the moment of the failure • EXMERGE the data our of the standby clusters and EXMERGE the data into the production mailboxes • Solutions such as this are in place today and provides a 5~10min switch time from the production to the standby cluster and meets customer requirements. • For Exchange 2003 and Outlook 2003 in cache mode we have the following behavior when switching back and forth between the production and standby clusters. • OL2003 users are working in cache mode against EVS1, - EVS1 goes down. • OL2003 is now in “Disconnected” state and the user continues to work normally offline. • We switch EVS1 to the standby cluster and bring the empty databases online. • OL2003 users sees a popup saying that there has been a change and Outlook needs to be restarted. • User restarts OL2003 and sees a dialog saying that Exchange is currently running in recovery mode and you can either Connect or Work Offline. • If you choose to Work Offline, you will see your regular cache mode OST, with all your data and work offline as usual. • If you choose to Connect, you are going to see your empty mailbox and will begin to send and receive new mail on the new mailbox. • Switching back, now we take EVS1 offline again and switch it back to the production cluster. • User will be required to restart Outlook, Once restarted, will be prompted again to Connect/Work Offline. • If you choose Connect, OL2003 will reconnect to the production mailbox and reestablish cache mode functionality. • Next time OL2003 is restarted, it will be back to the original cache mode state and will sync up again the mailbox. • After EXMERGE, the messages generated in the standby cluster mailbox is merged back into the production mailbox.

The Alternate Designs (Others) • Design Scenario (B) – Single Leg DR Clusters (Data Available) • Disaster recovery requirements for Exchange: • To provide email service continuity in case of a complete cluster failure OR in case of temporary unavailability of the production datacenter • Data recovery IS required • Geo-Clustering can not be considered because storage replication infrastructure can not be afforded/qualified. • Solution • Introduction of Sync Replication & Clone based copies (VSS). • Async/Sync replicate Transaction Logs to DR Site. • Clone presentation to DR Site. • Log Shipping. Is this log shipping ?

The Alternate Designs (Others) • Design Scenario (C) Non-Clustered (Dial Tone or Data Available) • Current Environment: • 3 Exchange 2003 Servers in the production environment – (Site A) • 3 Exchange 2003 Servers in the DR environment – (Site B) • 15K mailboxes (Outlook 97/98/2000/XP) • Site (A) located 50M away from Site (B) • Dark-Fiber link between the 2 sites • Disaster recovery requirements for Exchange: • To provide email service continuity in case of a complete server failure OR in case of temporary unavailability of one of the datacenters. • Data recovery IS required. (but dial tone is also possible). • Geo-Clustering can not be considered because of internal political issues and qualification difficulties.

The Alternate Designs (Others) • Solution: • 3 Exchange Severs all located in Site A. • Synchronous replication will replicate Exchange IO to remote data center (Site B). 40Miles apart. • Site B will provide Business Continuity in the event of a primary site failure by offering three additional Exchange 2003 Servers. • DB and Log File Paths • Org and Admin Group Membership • Upon failure of Site A the replicated database and log volumes of each of the three production Exchange Sevres will be presented to their corresponding ‘standby server’ in Site B. For example: EXC1 -> EXC04 EXC2 -> EXC05 EXC3 -> EXC06 The Exchange Servers in Site B will take ownership/responsibility of serving all corporate messaging requirements and provide users with access to all mailbox data with no data loss. - AD attributes such as HOMEMDB and HOMEMATA will become incorrect. The following process details the steps required: SCENARIO (1:0) 1. Site A goes down. 2. Open up ADU&C and use the multiple select options to select each of the mail-enabled user objects that were homed on Exchange Servers. 3. Right Click the combined group selection and choose Exchange Tasks. The Exchange Task Wizard will launch. Follow through the wizard to delete each of the mail-enabled user objects mailboxes. NOTE: Pre-defined LDAP queries (querying the AD for HOMEMDB and HOMEMTA) can be created and saved into the ADU&C MMC. This will facilitate the quick identification of users homed on the failed servers. 4. Move the DB's as planned. (Storage level – executed by the presentation of target LUNS). 5. Mount appropriate databases within the RECOVERY MAILBOX CENTER (RMC). 6. Using the RMC select all non-connected mailboxes using either the CTL or Shift Key. 7. Right click and select FIND MATCH. Matching wizard will launch and establish GUID matches for each deleted MBX for an active user account. 8. Right click the group selection for a second time and choose the RECONNECT option. 9. The Mailbox connection feature will reconnect each of the deleted mailboxes with their original user account thus updating the HOME_MDB and HOME_MTA attributes with the new ‘dead server’ name where the mailbox will now be homed. 10. Change the MAPI profile via network script/PROFMAN tools.

The Alternate Designs (Others) • SCENARIO (2:0) – Fail Back • Fail-Back (Return messaging service and data to primary site [Site A]). • Procedure is identical to fail-over, however in reverse. • Ensure each of the standby servers are shut down prior to beginning fail back procedure. • Process should occur at a time of managed/planned downtime.

Product Roadmap Futures • E12 • Improve cluster failover operation • Log Shipping Support • Out of the Box Local Replication • I/O Operations Management

Questions from the Audience Recommended Links Multi-site data replication support for Exchange 2003 and Exchange 2000: (http://support.microsoft.com/default.aspx?scid=kb;en-us;895847http://support.microsoft.com/default.aspx?scid=kb;en-us;895847) Deployment Guidelines for Exchange Server Multi-Site Data Replication: (http://www.microsoft.com/technet/prodtechnol/exchange/guides/E2k3DataRepl/bedf62a9-dff7-49a8-bd27-b2f1c46d5651.mspx) Jetstress Tool is available from: http://go.microsoft.com/fwlink/?LinkId=27883. Achieving High Availability with Exchange Server at Microsoft: (http://www.microsoft.com/technet/itsolutions/msit/operations/exchhighavailTSB.mspx) Windows Server Catalogue: Geographically Dispersed Cluster Solutions: (http://www.microsoft.com/windows/catalog/server/default.aspx?subID=22&xslt=categoryProduct&pgn=b55095f4-71f3-4b26-98b1-05f3a9506d0d)

Exchange 2003 High Availability & Site Redundancy