Making Domino Clusters a Key Part of Your Enterprise Disaster Recovery Architecture

Making Domino Clusters a Key Part of Your Enterprise Disaster Recovery Architecture Andy Pedisich Technotics

In This Session … Clustering is one of my favorite Domino features Introduced with Release 4.5 in December of 1996 We have seen an increased adoption in clustering Domino servers over the past two years Much of that was for disaster recovery Some enterprises have been using Domino clustering for disaster recovery for years This session shares as much information as possible about the topic based on what I’ve seen, heard about, and discovered creating Domino-based disaster recovery solutions And most importantly, the session speaks to the all important issues about managing failovers

What We’ll Cover … • Convincing management that DR clusters rock • Exploring the choices for a clustered DR architecture • Mastering cluster replication • Setting up a private LAN for cluster traffic • Managing cluster failover and load balancing • Understanding the role of iNotes in your DR solution • Reviewing the 7 important rules for configuring DR clusters • Wrap-up

Setting a Baseline for Concepts • First, lets make sure that everyone knows when you see the letters DR, we’re talking about disaster recovery • Disaster recovery is part of a larger concept referred to as business continuity • That’s what we plan to do to keep our businesses running while disruptive events occur • These events could be extremely local, like a power outage in a building • Or they could be regional, like an earthquake or tornado, or a disaster caused by humans doing their thing • Disaster recovery has its focus on the technology that supports business operations

Why Use Domino for Disaster Recovery • Domino clustering has been accepted by many organizations as an important part of their DR infrastructure • Although that’s not a total endorsement as corporations do wacky things • Clustering works with just about all things Lotus Notes/Domino such as • Email, calendaring, Traveler, BlackBerry, and roaming user • Clustering is included as part of your enterprise server licenses • Clusters clearly should be exploited in every enterprise • And they should have a solid role in any DR solution for messaging and collaboration

Important Facts to Help You Sell Clustering for DR • Here are some facts to back you up when you start making plans to use Domino clusters for DR • Clustering is a shrink wrapped solution • It’s not new. As a matter of fact, it’s been burned in for years. • It works really well • It’s automatic • It’s easy to set up and maintain • It’s easy to test • The biggest drawback for most companies using clusters is the increase in storage requirements since their data size is double • Domino Attachment and Object Services (DAOS) canreduce that size by 30% to 40%

Basic Clustering Requirements • All servers in the cluster must: • Run on Domino Enterprise or Domino Utility server • Be connected using high-speed local area network (LAN) or a high-speed wide area network (WAN) • Use TCP/IP and be on the same Notes named network • Be in the same Domino domain • Share a common Domino directory • Have plenty of CPU power and memory • It’s safe to say that clustered servers need more power and more disk resources than unclustered servers • A server can be a member of only one cluster at a time

Technical Considerations for Disaster Recovery Clusters • A few guidelines regarding configuring clusters, especially for clusters used for disaster recovery • Servers in a cluster should not use the same power source • Nor should they use the same network subnet or routers • They should not use the same disk storage array • They should not be in the same building • Your clustered DR solution should ideally be in different cities • Never in the same room

Keep Your Distance • Good DR cluster designs should take into account both local and regional problems • Consider a financial company who had clustered servers in two separate buildings across the street from each other in Manhattan • This firm now has primary servers in offices in New York City • And failover servers are thousands of miles away • Another firm has primary servers in Chicago • With a failover server in the UK • A college has primary and failover servers separated by 200 miles • Another company we know is just starting out with DR and has servers 25 miles from each other • A good start, but they really want more distance

Servers Accessed Over a Wide Area Network • If servers are as far away as they are supposed to be, there might be some latency in the synchronization of the data • If this is the case, users might not find the failover server is up to date with their primary server during a failover • Everyone must be aware that this is a possibility • Expectations must be set • Or management needs to provide budgets for better networking • Work out all of these details in advance with management • Get them written down and approved so there are no surprises

Most Common DR Cluster Configuration The most common DR cluster configuration is active/passive Servers on site are active Servers at the failover site are passive, waiting for failover Sometimes domains use this failover server as a place to do backups or journaling The number of servers in the cluster usually vary There could be 1 active and 1 passive Or 2 or 3 active and 1 passive

The Active/Passive Clustered Servers Active/Passive servers • Cluster has two servers: one active, the other not generally used or used for backing up • Very common disaster recovery setup Each server holds replicas of all the files from the other server • During failover all users flip to the other server

The 3 Server Cluster for DR Three or more servers in the cluster There are two replicas for each cluster supported application or mail file Each primary server holds the mail files of the users assigned to that server Replicas of mail file from both primary servers are on the failover

3 Server Cluster with One Primary Server Down • If a primary server goes down, users from that server go to the failover server • Easy to understand, and you save yourself a server • You’ll still need twice the total disk space of Mail1 and Mail2 • What happens when both Mail1 and Mail2 are unavailable

Both Primary Servers Are Down If both primary servers are down, the last server in the cluster has to support everyone • Remember that generally speaking only about 60% to 70% of users assigned to a server are on there concurrently • Still that has to be a pretty strong server with fast disks Some sites have remote servers as primary and failover happens at the home office data center

Understanding Cluster Replication Cluster replication is event driven It doesn’t run on a schedule The cluster replicator detects a change in a database and immediately pushes the change to other replicas in the cluster If a server is down or there is significant network latency, the cluster replicator stores changes in memory so it can push them out when it can If a change to the same application happens before a previous change has been sent, the CLREPL gathers them and sends them all together

Streaming Cluster Replication R8 introduced Streaming Cluster Replication (SCR) This newer functionality reduces replicator overhead Provides reduction in cluster replicator latency As changes occur to databases, they are captured and immediately queued to other replicas in the same cluster This makes cluster replication more efficient

SCR Only Works with R8 If one server in the cluster is R8 another is not, Domino will attempt SCR first When that doesn’t work, it will fall back on standard cluster replication If you’d like to turn off SCR entirely to ensure compatibility, use this parameter DEBUG_SCR_DISABLED=1 This must be used on all cluster mates

Only One Cluster Replicator by Default When a cluster is created, each server has only a single cluster replicator instance If there have been a significant number of changes to many applications, a single cluster replicator can fall behind Databases synchronization won’t be up to date If a server fails when database synch has fallen behind, users will think their mail file or app is “missing data” They won’t understand why all the meetings they made this morning are not there They think their information is gone forever! Users need their cluster insurance!

Condition Is Completely Manageable Adding a cluster replicator will help fix this problem You can load cluster replicators manually using the following console command Load CLREPL Note that a manually loaded cluster replicator will not be there if the server is restarted after manually loading a cluster replicator Add cluster replicators permanently to a server Use this parameter in the NOTES.INI CLUSTER_REPLICATORS=# I always use at least two cluster replicators

When to Add Cluster Replicators But how do you tell if there’s a potential problem? Do you let it fail and then wait for the phone to ring? No! You look at the cluster stats and get the data you need to make an intelligent decision Adding too many will have a negative effect on server performance Here are some important statistics to watch

Key Stats for Vital Information About Cluster Replication

What to Do About Stats Over the Limit Acceptable Replica.Cluster.SecondsOnQueue Queue is checked every 15 seconds, so under light load, should be less than 15 Under heavy load, if the number is larger than 30, another cluster replicator should be added If the above statistic is low and Replica.Cluster.WorkQueueDepth is constantly higher than 10 … Perhaps your network bandwidth is too low Consider setting up a private LAN for cluster replication traffic

Stats That Have Meaning but Have Gone Missing There aren’t any views in Lotus version of Statrep that let you see these important statistics Matter of fact, the Cluster view is pretty worthless

The Documents Have More Information The cluster documents have much better information You can actually use the data in the docs But they still lack key stats, though they’re in each doc

Stats That Have Meaning but Have Gone Missing But there is a view like that in the Technotics R8.5 Statrep.NTF It shows the key stats you need To help track and adjust your clusters It is included on the CD for this conference

My Column Additions to Statrep

Use a Scheduled Connection Document Also • Back up your clustered replication with a scheduled connection document between servers • Have it replicate at least once per hour • You’ll always be assured to have your servers in sync even if one has been down for a few days • And it replicates deletion stubs too!

Busy Clusters Might Require a Private LAN A private LAN separates the network traffic the cluster creates for replication and server probes And will probably leave more room on your primary LAN Start by installing an additional network interface card (NIC) for each server in the cluster Connect the NICs through a private hub or switch

Setting Up the Private LAN Assign second IP address to additional NIC Assign host names to the addresses in the local HOSTS file on each server Using DNS is a best practice 10.200.100.1 mail1_clu.domlab.com 10.200.100.2 mail2_clu.domlab.com Test by pinging the new hosts from each server

Modify Server Document For each server in the cluster, edit the server document to enable the new port

Set Parameters so Servers Use Private LAN • Make your clusters use the private LAN for cluster traffic by establishing the ports in the server NOTES.INI with these parameters • CLUSTER=TCP,0,15,0 • PORTS=TCPIP,CLUSTER • CLUSTER_TCPIPADDRESS=0,10.200.100.2:1352 • You will use the address of your NIC card

Parameters to Make the Cluster Use the Port • Use the following parameter to ensure Domino uses the port for cluster traffic • SERVER_CLUSTER_DEFAULT_PORT=CLUSTER • Use this parameter just in case the CLUSTER port you’ve configured isn’t available • SERVER_CLUSTER_AUXILIARY_PORTS=* • This allows clustering to use any port if the one you’ve defined isn’t available

Keep Users Off the Private LAN • To keep users from grabbing on to the private LAN port, take the following steps • Create a group called ClusterServers • Add the servers in the cluster to this group • Add the following parameter to the NOTES.INI of both servers • It will keep users from connecting through the CLUSTER port • Allow_Access_Cluster=ClusterServers

Respect the Users Clustering provides outstanding service levels for users But the process of failing over is sometimes hard on users Failover is actually the most difficult moment for users And sometimes errors in network configuration might prevent successful failover For example, the common name of the server should be listed as an alias in DNS to ensure users can easily open their application on the servers If the server is not in DNS, the clients won’t know how to get to the failover servers

Best Practice for Cluster Management Best Practice: • Don’t take a clustered server down during working hours unless it is absolutely necessary • A non-planned server outage, such as a crash or power failure, is a legitimate reason to fail over Resist the urge to take a server down because you know it’s clustered • You could probably do it, but the risk of a hard failover will probably cause unwanted help desk calls

Easiest Cluster Configuration to Manage The Active/Passive model of clustering is by far the easiest to manage Use parameters in the NOTES.INI file on the servers in the cluster that allow users on the primary one server But don’t allow them on the failover server

Check for Server Availability • The parameters we use are thresholds that check the cluster statistic Server.AvailabilityIndex (SAI) • This statistic shows how busy the server is • 100 means it’s not busy at all • 0 means it’s crazy busy

Adjusting the Threshold Setting the parameter Server_availablity_threshold controls whether users can access the server 50 means if the SAI is above 50, then failover users to another server A setting like this can be used for load balancing 100 means the SAI must be 100, which means the server must be 100% available This translates into “nobody is allowed on the server” 0 means that load balancing and checking the SAI is turned off These thresholds can come in handy

Setting Up Active/Passive Servers in a Cluster • Let’s look at the following scenario • Mail1 is the active primary; Mail2 is the passive failover • To allow users to access their primary server, use this parameter in the NOTES.INI of Mail1 • Server_availablity_threshold = 0 • Use this console command: • Set config server_availability_threshold=0 • Prevent users from accessing failover server Mail2; use this parameter • Server_availablity_threshold = 100 • Use this console command: • Set config server_availability_threshold=100 • Administrators will still be able to access this server 43

Mail1 Is Crashing If Mail1 crashes, the Notes client will disregard our setting of 100 on Mail2 and users will be permitted To help stabilize the system use this parameter on Mail2 Server_availability_threshold = 0 Let all users aboard While Mail1 is down, enter this parameter into the NOTES.INI to prevent users from connecting: Server_restricted=2 2 will keep the setting after a server restart Setting it to 1 also keeps them off, but the setting will be disabled with a 0 after restarting the server

Recovering After a Crash When Mail1 is brought back up after the crash, no one will be permitted to access it except administrators That’s because of the Server_restricted=2 setting Leave it that way until the end of the day The ugliest part about failing over is the client part Clients are working just fine on Mail2 By the way, iMap and POP3 users still have access to Mail1 At the day’s end, switch the Server_availability_threshold back to 100 on Mail2 and 0 on Mail1 Issue this console command on Mail2 Drop all

Taking a Clustered Server Down Voluntarily If you must take clustered Mail1 server down, set Server_restricted=2, then drop all at the console Remember that POP3 and iMap users won’t like you very much Set Mail2 to Server_availability_threshold to 0 Don’t forget to set the Server_Restricted back to 0 when the crisis has passed I know someone who forgot to do this a couple of times This person could access the server and work because he was in an administrator’s group However nobody else could get on the server and he made all the users very angry

Triggering Failover • You can set the maximum number of concurrent NRPC users allowed to connect to a server • Server_MaxUsers NOTES.INI variable • Set variable to a number determined in planning stage • Set variable using console command • Or use NOTES.INI tab in server configuration document • Set config Server_MaxUsers = desired maximum number of active concurrent users • Additional users will fail over to the other members of the cluster

Load Balancing If you’d like to load balance your servers, determine what your comfort range is for how busy your servers are and set the Server_availablity_threshold accordingly Perhaps start with a value of 60 Users should fail over when the SAI goes below 60 Pay close attention to the SAI in the STATREP.NSF, which is listed under the Av Inx column Some hardware can produce inaccurate SAI reading and cause users to failover when it’s not necessary

SAI Is Unreliable in Some Cases Note how in this case, the Server Availability Index never seems to get much above 50 consistently Users would be failing over constantly And if both servers had the issue, users would be bouncing back and forth between the clustered servers

Making Domino Clusters a Key Part of Your Enterprise Disaster Recovery Architecture

Making Domino Clusters a Key Part of Your Enterprise Disaster Recovery Architecture

Presentation Transcript

Disaster Recovery

Disaster recovery

Disaster Recovery

Disaster Recovery

Disaster Recovery

Enterprise Architecture as a Decision-Making Framework

DISASTER RECOVERY A PILLAR OF DISASTER RESILIENCE PART 2: EARTHQUAKES AND TSUNAMIS

Disaster Recovery

Disaster Recovery

Anatomy of a disaster recovery

Disaster Recovery

Making Debt Sales a Part of Your Recovery Strategy

Disaster Recovery

Disaster Recovery

Advantages of a Disaster Recovery Plan

Key Areas of SAP Disaster Recovery Solutions

Disaster Recovery