10 likes | 332 Views
page 2. hp enterprise technical symposium. filename.ppt . . . page 3. hp enterprise technical symposium. filename.ppt . High Availability (HA). Ability for application processing to continue with high probability in the face of common (mostly hardware) failuresTypical technologies:Redund
E N D
1. page 1 hp enterprise technical symposium filename.ppt Using OpenVMS Clusters for Disaster Tolerance
Keith Parris
Systems/Software EngineerHP Services – Multivendor Systems Engineering
Seminar P142
Sunday, 18 May 2003
2. page 2 hp enterprise technical symposium filename.ppt
3. page 3 hp enterprise technical symposium filename.ppt High Availability (HA) Ability for application processing to continue with high probability in the face of common (mostly hardware) failures
Typical technologies:
Redundant power supplies and fans
RAID for disks
Clusters of servers
Multiple NICs, redundant routers
Facilities: Dual power feeds, n+1 air conditioning units, UPS, generator Here we define High AvailabilityHere we define High Availability
4. page 4 hp enterprise technical symposium filename.ppt Fault Tolerance (FT) The ability for a computer system to continue operating despite hardware and/or software failures
Typically requires:
Special hardware with full redundancy, error-checking, and hot-swap support
Special software
Provides the highest availability possible within a single datacenter Fault Tolerant products go beyond the usual High Availability products. HP NonStop products fall into the Fault Tolerant category.
No matter how good a High Availability or Fault Tolerant solution you have, if the datacenter it resides in is destroyed, it can no longer support your business.
Fault Tolerant products go beyond the usual High Availability products. HP NonStop products fall into the Fault Tolerant category.
No matter how good a High Availability or Fault Tolerant solution you have, if the datacenter it resides in is destroyed, it can no longer support your business.
5. page 5 hp enterprise technical symposium filename.ppt Disaster Recovery (DR) Disaster Recovery is the ability to resume operations after a disaster
Disaster could be destruction of the entire datacenter site and everything in it
Implies off-site data storage of some sort Disaster Recovery is a fairly well-understood concept in the industry, and most companies plan for DR needs. Key here is storing a copy of business data in a safe location, off-site.Disaster Recovery is a fairly well-understood concept in the industry, and most companies plan for DR needs. Key here is storing a copy of business data in a safe location, off-site.
6. page 6 hp enterprise technical symposium filename.ppt Disaster Recovery (DR) Typically,
There is some delay before operations can continue (many hours, possibly days), and
Some transaction data may have been lost from IT systems and must be re-entered In a typical DR scenario, significant time elapses before one can resume IT functions, and some amount of data typically needs to be re-entered to bring the system data back up to date.In a typical DR scenario, significant time elapses before one can resume IT functions, and some amount of data typically needs to be re-entered to bring the system data back up to date.
7. page 7 hp enterprise technical symposium filename.ppt Disaster Recovery (DR) Success hinges on ability to restore, replace, or re-create:
Data (and external data feeds)
Facilities
Systems
Networks
User access The key to a successful recovery is to be able to restore your data onto suitable computing hardware in a suitable facility, with access by the users.The key to a successful recovery is to be able to restore your data onto suitable computing hardware in a suitable facility, with access by the users.
8. page 8 hp enterprise technical symposium filename.ppt DR Methods:Tape Backup Data is copied to tape, with off-site storage at a remote site
Very-common method. Inexpensive.
Data lost in a disaster is: all the changes since the last tape backup that is safely located off-site
There may be significant delay before data can actually be used Tape backups are the most common method companies use for DR. Cost is low. Some care is needed to ensure the backup tapes actually get to an off-site storage location, and in a timely fashion.
Some amount of data will be lost in a disaster, so some data re-entry is required. This assumes that some other record (i.e. paper) of all transactions exists, from which the data can be re-entered.
It often takes multiple days to get back up and running after a disaster.Tape backups are the most common method companies use for DR. Cost is low. Some care is needed to ensure the backup tapes actually get to an off-site storage location, and in a timely fashion.
Some amount of data will be lost in a disaster, so some data re-entry is required. This assumes that some other record (i.e. paper) of all transactions exists, from which the data can be re-entered.
It often takes multiple days to get back up and running after a disaster.
9. page 9 hp enterprise technical symposium filename.ppt DR Methods:Vendor Recovery Site Vendor provides datacenter space, compatible hardware, networking, and sometimes user work areas as well
When a disaster is declared, systems are configured and data is restored to them
Typically there are hours to days of delay before data can actually be used Here, you contract with a vendor such as SunGard to have the right to use their facilities and equipment in the event that you suffer a disaster.
Their business is based on the assumption that not everyone will have a disaster at once, and they need only procure enough hardware to accommodate a few disasters at once. Out of that pool of hardware, they will start connecting together a suitable configuration immediately when you declare a disaster.
You must formally declare a disaster to invoke your right to use their facilities. At that point, they typically connect together the equipment you need, and give you access to use it.
These vendors also typically allow you to schedule a disaster recovery test at some interval, such as once a year.Here, you contract with a vendor such as SunGard to have the right to use their facilities and equipment in the event that you suffer a disaster.
Their business is based on the assumption that not everyone will have a disaster at once, and they need only procure enough hardware to accommodate a few disasters at once. Out of that pool of hardware, they will start connecting together a suitable configuration immediately when you declare a disaster.
You must formally declare a disaster to invoke your right to use their facilities. At that point, they typically connect together the equipment you need, and give you access to use it.
These vendors also typically allow you to schedule a disaster recovery test at some interval, such as once a year.
10. page 10 hp enterprise technical symposium filename.ppt DR Methods:Data Vaulting Copy of data is saved at a remote site
Periodically or continuously, via network
Remote site may be own site or at a vendor location
Minimal or no data may be lost in a disaster
There is typically some delay before data can actually be used You can copy data periodically or continuously to another site so that there’s always a copy at another site. This might involve copying data to tape drives at a remote site, or to disks.
The remote site might be at a second site that you own, or might be provided by a vendor.
In order to actually use that data, it would need to be restored to a suitable hardware configuration with network and user access and so forth.
Some computer vendors have a program that allows you to get an emergency order for replacement hardware shipped out quickly in the event of a disaster. DEC and Compaq had the Recover-All program, and HP has the Quick Ship program. This can help reduce the time needed to replace hardware destroyed by a disaster.You can copy data periodically or continuously to another site so that there’s always a copy at another site. This might involve copying data to tape drives at a remote site, or to disks.
The remote site might be at a second site that you own, or might be provided by a vendor.
In order to actually use that data, it would need to be restored to a suitable hardware configuration with network and user access and so forth.
Some computer vendors have a program that allows you to get an emergency order for replacement hardware shipped out quickly in the event of a disaster. DEC and Compaq had the Recover-All program, and HP has the Quick Ship program. This can help reduce the time needed to replace hardware destroyed by a disaster.
11. page 11 hp enterprise technical symposium filename.ppt DR Methods:Hot Site Company itself (or a vendor) provides pre-configured compatible hardware, networking, and datacenter space
Systems are pre-configured, ready to go
Data may already resident be at the Hot Site thanks to Data Vaulting
Typically there are minutes to hours of delay before data can be used Because the time required to configure (or even procure) suitable system hardware to access the data can be so great, some businesses choose to have a hardware configuration in place and ready to take over in the event of a disaster. This may be at another site the business owns, or at a vendor’s location, and the equipment may either belong to the business, or be provided by the vendor under the terms of its contract.Because the time required to configure (or even procure) suitable system hardware to access the data can be so great, some businesses choose to have a hardware configuration in place and ready to take over in the event of a disaster. This may be at another site the business owns, or at a vendor’s location, and the equipment may either belong to the business, or be provided by the vendor under the terms of its contract.
12. page 12 hp enterprise technical symposium filename.ppt Disaster Tolerance vs.Disaster Recovery Disaster Recovery is the ability to resume operations after a disaster.
Disaster Tolerance is the ability to continue operations uninterrupted despite a disaster The term Disaster Tolerance is one which is sometimes used incorrectly in the industry, particularly by vendors who can’t really achieve it. Disaster Recovery is a well-understood term, but DisasterTolerance is a newer term and often mis-used.
Disaster Recovery is the art and science of preparing so that one is able to recover after a disaster and once again be able to operate the business.
Disaster Tolerance is much harder, as it involves designing things such that business can continue in the face of a disaster, without the end-user or customer noticing any adverse effect.
The effects of global business across timezones, 24x7x365 operations, e-Commerce, and the risks of the modern world are pressuring more and more businesses to achieve the level of Disaster Tolerance in their business operations. We’ll be focusing mostly on Disaster Tolerance in the IT field, but the same principles apply to the business in general.The term Disaster Tolerance is one which is sometimes used incorrectly in the industry, particularly by vendors who can’t really achieve it. Disaster Recovery is a well-understood term, but DisasterTolerance is a newer term and often mis-used.
Disaster Recovery is the art and science of preparing so that one is able to recover after a disaster and once again be able to operate the business.
Disaster Tolerance is much harder, as it involves designing things such that business can continue in the face of a disaster, without the end-user or customer noticing any adverse effect.
The effects of global business across timezones, 24x7x365 operations, e-Commerce, and the risks of the modern world are pressuring more and more businesses to achieve the level of Disaster Tolerance in their business operations. We’ll be focusing mostly on Disaster Tolerance in the IT field, but the same principles apply to the business in general.
13. page 13 hp enterprise technical symposium filename.ppt Disaster Tolerance Ideally, Disaster Tolerance allows one to continue operations uninterrupted despite a disaster:
Without any appreciable delays
Without any lost transaction data The ideal DT solution would result in no downtime and no lost data, even in a disaster. Such solutions do exist, but they cost more than solutions which have some amount of downtime or data loss associated with a disaster.The ideal DT solution would result in no downtime and no lost data, even in a disaster. Such solutions do exist, but they cost more than solutions which have some amount of downtime or data loss associated with a disaster.
14. page 14 hp enterprise technical symposium filename.ppt Disaster Tolerance Businesses vary in their requirements with respect to:
Acceptable recovery time
Allowable data loss
Technologies also vary in their ability to achieve the ideals of no data loss and zero recovery time
OpenVMS Cluster technology today can achieve:
zero data loss
recovery times in the single-digit seconds range Different businesses have different levels of risk with regard to loss of data and to potential downtime.
Different technical solutions also provide different levels of capability with respect to these business needs.
The ideal solution would have no downtime and allow no data to be lost. In my experience, a disaster-tolerant OpenVMS Cluster comes the closest to those ideals of any available solution, supporting zero loss of data in a datacenter failure, with recovery times as low as just a few seconds.Different businesses have different levels of risk with regard to loss of data and to potential downtime.
Different technical solutions also provide different levels of capability with respect to these business needs.
The ideal solution would have no downtime and allow no data to be lost. In my experience, a disaster-tolerant OpenVMS Cluster comes the closest to those ideals of any available solution, supporting zero loss of data in a datacenter failure, with recovery times as low as just a few seconds.
15. page 15 hp enterprise technical symposium filename.ppt Measuring Disaster Tolerance and Disaster Recovery Needs Determine requirements based on business needs first
Then find acceptable technologies to meet the needs of the business It is important to understand the financial costs and risks to the business itself with regard to downtime and potential data loss. Once these are understood, the best available solution can be chosen.It is important to understand the financial costs and risks to the business itself with regard to downtime and potential data loss. Once these are understood, the best available solution can be chosen.
16. page 16 hp enterprise technical symposium filename.ppt Measuring Disaster Tolerance and Disaster Recovery Needs Commonly-used metrics:
Recovery Point Objective (RPO):
Amount of data loss that is acceptable, if any
Recovery Time Objective (RTO):
Amount of downtime that is acceptable, if any Recovery Point Objective and Recovery Time Objective are two metrics commonly used in the industry to measure the goals of a DR or DT solution.Recovery Point Objective and Recovery Time Objective are two metrics commonly used in the industry to measure the goals of a DR or DT solution.
17. page 17 hp enterprise technical symposium filename.ppt Disaster Tolerance vs.Disaster Recovery Definitions among vendors vary, but in general, Disaster Tolerance is the subset of possible solutions in the overall Disaster Recovery space that minimize RPO (data loss) and RTO (downtime).Definitions among vendors vary, but in general, Disaster Tolerance is the subset of possible solutions in the overall Disaster Recovery space that minimize RPO (data loss) and RTO (downtime).
18. page 18 hp enterprise technical symposium filename.ppt Recovery Point Objective (RPO) Recovery Point Objective is measured in terms of time
RPO indicates the point in time to which one is able to recover the data after a failure, relative to the time of the failure itself
RPO effectively quantifies the amount of data loss permissible before the business is adversely affected RPO is a metric used to quantify how much data loss is acceptable without adversely affecting the business.RPO is a metric used to quantify how much data loss is acceptable without adversely affecting the business.
19. page 19 hp enterprise technical symposium filename.ppt Recovery Time Objective (RTO) Recovery Time Objective is also measured in terms of time
Measures downtime:
from time of disaster until business can continue
Downtime costs vary with the nature of the business, and with outage length RTO is the metric used to quantify how much downtime can occur without adversely affecting the business.RTO is the metric used to quantify how much downtime can occur without adversely affecting the business.
20. page 20 hp enterprise technical symposium filename.ppt Downtime Cost Varies with Outage Length The potential cost of downtime actually varies with the length of the outage. A 2-hour outage will not necessarily cost twice what a 1-hour outage costs. This graph tries to show this non-linear relationship. For this particular hypothetical business, a 1-minute outage has no associated cost with it – its customers basically don’t even notice an outage of that short a duration. An outage of 1 hour results in some small amount of lost revenue, but the impact is still not severe. An outage of 1 day for this business, however, starts to result in significant losses, and for an outage of 1 week, the cost is so great (with many customers losing confidence and permanently moving their business to another vendor as a result) that the business itself would be at risk of totally failing.The potential cost of downtime actually varies with the length of the outage. A 2-hour outage will not necessarily cost twice what a 1-hour outage costs. This graph tries to show this non-linear relationship. For this particular hypothetical business, a 1-minute outage has no associated cost with it – its customers basically don’t even notice an outage of that short a duration. An outage of 1 hour results in some small amount of lost revenue, but the impact is still not severe. An outage of 1 day for this business, however, starts to result in significant losses, and for an outage of 1 week, the cost is so great (with many customers losing confidence and permanently moving their business to another vendor as a result) that the business itself would be at risk of totally failing.
21. page 21 hp enterprise technical symposium filename.ppt Examples of Business Requirements and RPO / RTO Greeting card manufacturer
RPO zero; RTO 3 days
Online stock brokerage
RPO zero; RTO seconds
Lottery
RPO zero; RTO minutes A greeting card manufacturer I once spoke with was very concerned that they never, ever lose an order from one of their stores. That would be too great a customer satisfaction problem. But since stores typically order only about once a week, it wouldn’t really hurt, they told me, if a store couldn’t place an order, even for several days, as long as the order was reliably delivered once it had been placed.
An online stock brokerage cannot ever lose an order in conjunction with an outage, because then they would be subject to massive risks of fraud, from customers who might claim, for example, that they had placed a large sell order just before the price on a stock dropped dramatically, or placed a large buy order just before the price jumped astronomically. Also, the brokerage has to cover the financial risk during downtime of orders placed before an outage, and that risk is related to the price movement of the stocks. Potential losses of millions of dollars per hour of downtime are not unusual in that industry.
A lottery absolutely cannot lose transactions for the same reason of potential fraud. Also, if tickets cannot be sold due to downtime, the potential revenue loss builds rapidly.A greeting card manufacturer I once spoke with was very concerned that they never, ever lose an order from one of their stores. That would be too great a customer satisfaction problem. But since stores typically order only about once a week, it wouldn’t really hurt, they told me, if a store couldn’t place an order, even for several days, as long as the order was reliably delivered once it had been placed.
An online stock brokerage cannot ever lose an order in conjunction with an outage, because then they would be subject to massive risks of fraud, from customers who might claim, for example, that they had placed a large sell order just before the price on a stock dropped dramatically, or placed a large buy order just before the price jumped astronomically. Also, the brokerage has to cover the financial risk during downtime of orders placed before an outage, and that risk is related to the price movement of the stocks. Potential losses of millions of dollars per hour of downtime are not unusual in that industry.
A lottery absolutely cannot lose transactions for the same reason of potential fraud. Also, if tickets cannot be sold due to downtime, the potential revenue loss builds rapidly.
22. page 22 hp enterprise technical symposium filename.ppt Examples of Business Requirements and RPO / RTO ATM machine
RPO minutes; RTO minutes
Semiconductor fabrication plant
RPO zero; RTO minutes; but data protection by geographical separation not needed Surprisingly, an ATM network can afford a small amount of data loss. This is because the financial transactions are relatively small (typically $500 tops), so the financial risk is relatively low, and the ATM machine itself keeps a log of transactions that can be used to resolve discrepancies. But if an ATM network is down for too long, revenue for transaction fees starts to be lost.
The computers that operate a semiconductor fabrication plant keep track of which processing steps have been applied to each silicon wafer. Without this knowledge, a wafer becomes worthless as sand. Excessive downtime keeps production from moving ahead. Interestingly, these businesses don’t worry about keeping a copy of data at a safe distance from the fabrication plant, because if a disaster affects the fabrication plant along with the datacenter, production can be shifted to another fab. Some separation is desirable to protect against things like building fires, but for a fabrication plant along an earthquake fault line, it does not help to protect the data on in-progress work by putting a copy a long distance away when the fabrication plant itself is likely to be damaged by the same earthquake that takes out the datacenter.Surprisingly, an ATM network can afford a small amount of data loss. This is because the financial transactions are relatively small (typically $500 tops), so the financial risk is relatively low, and the ATM machine itself keeps a log of transactions that can be used to resolve discrepancies. But if an ATM network is down for too long, revenue for transaction fees starts to be lost.
The computers that operate a semiconductor fabrication plant keep track of which processing steps have been applied to each silicon wafer. Without this knowledge, a wafer becomes worthless as sand. Excessive downtime keeps production from moving ahead. Interestingly, these businesses don’t worry about keeping a copy of data at a safe distance from the fabrication plant, because if a disaster affects the fabrication plant along with the datacenter, production can be shifted to another fab. Some separation is desirable to protect against things like building fires, but for a fabrication plant along an earthquake fault line, it does not help to protect the data on in-progress work by putting a copy a long distance away when the fabrication plant itself is likely to be damaged by the same earthquake that takes out the datacenter.
23. page 23 hp enterprise technical symposium filename.ppt Recovery Point Objective (RPO) RPO examples, and technologies to meet them:
RPO of 24 hours: Backups at midnight every night to off-site tape drive, and recovery is to restore data from set of last backup tapes
RPO of 1 hour: Ship database logs hourly to remote site; recover database to point of last log shipment
RPO of zero: Mirror data strictly synchronously to remote site Nightly tape backups can provide an RPO of a maximum of 24 hours, provided tapes are whisked off-site immediately after being written, or the tape drives are located in a different site from the data disks.
Database logs copied every hour to a remote site can provide an RPO of 1 hour at most.
To achieve zero data loss, data must be written to both sites simultaneously, and a transaction must not be allowed to complete until the data has safely landed at both sites (not awaiting transport between sites or still in transit).Nightly tape backups can provide an RPO of a maximum of 24 hours, provided tapes are whisked off-site immediately after being written, or the tape drives are located in a different site from the data disks.
Database logs copied every hour to a remote site can provide an RPO of 1 hour at most.
To achieve zero data loss, data must be written to both sites simultaneously, and a transaction must not be allowed to complete until the data has safely landed at both sites (not awaiting transport between sites or still in transit).
24. page 24 hp enterprise technical symposium filename.ppt Recovery Time Objective (RTO) RTO examples, and technologies to meet them:
RTO of 72 hours: Restore tapes to configure-to-order systems at vendor DR site
RTO of 12 hours: Restore tapes to system at hot site with systems already in place
RTO of 4 hours: Data vaulting to hot site with systems already in place
RTO of 1 hour: Disaster-tolerant cluster with controller-based cross-site disk mirroring
RTO of seconds: Disaster-tolerant cluster with bi-directional mirroring, CFS, and DLM allowing applications to run at both sites simultaneously Restoration of backup tapes to systems assembled and configured by a vendor at their DR site after declaration of a disaster could meet an RTO requirement of 3 days (72 hours).
To meet an RTO of 12 hours, systems would probably have to already be in place at another site, perhaps a vendor hot site.
To meet an RTO requirement of 4 hours, one would not really have time to restore data from tapes. The data would need to be on disk already at another site and connected to servers. Failover of users to these in-place systems could most likely be accomplished in 4 hours, with proper pre-planning and appropriate network equipment and fail-over procedures in place and pre-tested.
To meet an RTO of 1 hour would most likely require a disaster-tolerant cluster. One could use controller-based mirroring instead of host-based mirroring, because failover from one site to another can be accomplished in somewhere between 15 minutes and 1 hour, depending on the solution.
To meet an RTO of mere seconds would require a full-blown disaster-tolerant cluster with the ability to continually run the applications on all nodes at both sites, so that one is not so much doing failover as simply redirecting clients to an already-running copy of the application at the surviving site.Restoration of backup tapes to systems assembled and configured by a vendor at their DR site after declaration of a disaster could meet an RTO requirement of 3 days (72 hours).
To meet an RTO of 12 hours, systems would probably have to already be in place at another site, perhaps a vendor hot site.
To meet an RTO requirement of 4 hours, one would not really have time to restore data from tapes. The data would need to be on disk already at another site and connected to servers. Failover of users to these in-place systems could most likely be accomplished in 4 hours, with proper pre-planning and appropriate network equipment and fail-over procedures in place and pre-tested.
To meet an RTO of 1 hour would most likely require a disaster-tolerant cluster. One could use controller-based mirroring instead of host-based mirroring, because failover from one site to another can be accomplished in somewhere between 15 minutes and 1 hour, depending on the solution.
To meet an RTO of mere seconds would require a full-blown disaster-tolerant cluster with the ability to continually run the applications on all nodes at both sites, so that one is not so much doing failover as simply redirecting clients to an already-running copy of the application at the surviving site.
25. page 25 hp enterprise technical symposium filename.ppt Technologies Clustering
Inter-site links
Foundation and Core Requirements for Disaster Tolerance
Data replication schemes
Quorum schemes Next, we’ll look at the underlying technologies used to implement a disaster-tolerant solution, including clustering, inter-site link(s), the fundamentals required for DT, data replication, and quorum schemes.Next, we’ll look at the underlying technologies used to implement a disaster-tolerant solution, including clustering, inter-site link(s), the fundamentals required for DT, data replication, and quorum schemes.
26. page 26 hp enterprise technical symposium filename.ppt Clustering Allows a set of individual computer systems to be used together in some coordinated fashion
27. page 27 hp enterprise technical symposium filename.ppt Cluster types Different types of clusters meet different needs:
Scalability Clusters allow multiple nodes to work on different portions of a sub-dividable problem
Workstation farms, compute clusters, Beowulf clusters
Availability Clusters allow one node to take over application processing if another node fails
Our interest here concerns Availability Clusters For DT, we’re concerned with High Availability clusters rather than Scalability clusters.For DT, we’re concerned with High Availability clusters rather than Scalability clusters.
28. page 28 hp enterprise technical symposium filename.ppt Availability Clusters Transparency of failover and degrees of resource sharing differ:
“Shared-Nothing” clusters
“Shared-Storage” clusters
“Shared-Everything” clusters Cluster implementations vary in their degrees of resource sharing. Cluster solutions with greater ability to share (and coordinate access to) shared resources can typically provide higher availability and better disaster-tolerance.Cluster implementations vary in their degrees of resource sharing. Cluster solutions with greater ability to share (and coordinate access to) shared resources can typically provide higher availability and better disaster-tolerance.
29. page 29 hp enterprise technical symposium filename.ppt “Shared-Nothing” Clusters Data is partitioned among nodes
No coordination is needed between nodes In Shared-Nothing clusters, data is typically divided up across separate nodes. The nodes have little need to coordinate their activity, since each has responsibility for different subsets of the overall database. But in strict Shared-Nothing clusters, if one node is down, its fraction of the overall data is unavailable.In Shared-Nothing clusters, data is typically divided up across separate nodes. The nodes have little need to coordinate their activity, since each has responsibility for different subsets of the overall database. But in strict Shared-Nothing clusters, if one node is down, its fraction of the overall data is unavailable.
30. page 30 hp enterprise technical symposium filename.ppt “Shared-Storage” Clusters In simple “Fail-over” clusters, one node runs an application and updates the data; another node stands idly by until needed, then takes over completely
In Shared-Storage clusters which are more advanced than simple “Fail-over” clusters, multiple nodes may access data, but typically one node at a time “serves” a file system to the rest of the nodes, and performs all coordination for that file system For high availability to be provided, some shared access to the data disks is needed. In Shared-Storage clusters, at the low end, one gets the ability for one node to take over the storage (and applications) in the event another node fails. In higher-level solutions, simultaneous access to data by applications running on more than one node at a time is possible, but typically a single node at a time is responsible for coordinating all access to a given data disk, and serving that storage to the rest of the nodes. That single node can become a bottleneck for access to the data in busier configurations.For high availability to be provided, some shared access to the data disks is needed. In Shared-Storage clusters, at the low end, one gets the ability for one node to take over the storage (and applications) in the event another node fails. In higher-level solutions, simultaneous access to data by applications running on more than one node at a time is possible, but typically a single node at a time is responsible for coordinating all access to a given data disk, and serving that storage to the rest of the nodes. That single node can become a bottleneck for access to the data in busier configurations.
31. page 31 hp enterprise technical symposium filename.ppt “Shared-Everything” Clusters “Shared-Everything” clusters allow any application to run on any node or nodes
Disks are accessible to all nodes under a Cluster File System
File sharing and data updates are coordinated by a Lock Manager In a Shared-Everything cluster, all nodes can access all the data disks simultaneously, without one node having to go through another node to get access to the data. Clusters which provide this provide a cluster-wide file system (CFS), so all the nodes view the file system(s) identically, and typically provide a Distributed Lock Manager (DLM) to allow the nodes to coordinate the sharing and updating of files, records, and databases.
Some application software, such as Oracle Parallel Server or Oracle 9i/RAC, provide the appearance of a shared-everything cluster for the users of their specific application (in this case, the Oracle database), even when running in a less-sophisticated Shared-Storage cluster, because they provide capabilities analogous to the CFS and DLM of a shared-everything cluster.In a Shared-Everything cluster, all nodes can access all the data disks simultaneously, without one node having to go through another node to get access to the data. Clusters which provide this provide a cluster-wide file system (CFS), so all the nodes view the file system(s) identically, and typically provide a Distributed Lock Manager (DLM) to allow the nodes to coordinate the sharing and updating of files, records, and databases.
Some application software, such as Oracle Parallel Server or Oracle 9i/RAC, provide the appearance of a shared-everything cluster for the users of their specific application (in this case, the Oracle database), even when running in a less-sophisticated Shared-Storage cluster, because they provide capabilities analogous to the CFS and DLM of a shared-everything cluster.
32. page 32 hp enterprise technical symposium filename.ppt Cluster File System Allows multiple nodes in a cluster to access data in a shared file system simultaneously
View of file system is the same from any node in the cluster A Clusterwide File System provides the same view of disk data from every node in the cluster. This means the environment on each node can appear identical to both the users and the application programs, so it doesn’t matter on which node the application or user happens to be running at any given time.
HP TruClusters and OpenVMS Clusters provide a CFS, and HP-UX is slated to gain the CFS from TruClusters by 2004.
A Clusterwide File System provides the same view of disk data from every node in the cluster. This means the environment on each node can appear identical to both the users and the application programs, so it doesn’t matter on which node the application or user happens to be running at any given time.
HP TruClusters and OpenVMS Clusters provide a CFS, and HP-UX is slated to gain the CFS from TruClusters by 2004.
33. page 33 hp enterprise technical symposium filename.ppt Lock Manager Allows systems in a cluster to coordinate their access to shared resources:
Devices
File systems
Files
Database tables The Distributed Lock Manager is used to coordinate access to shared resources in a cluster.
HP TruClusters and OpenVMS Clusters provide a DLM, and HP-UX is slated to gain the DLM from TruClusters by 2004.The Distributed Lock Manager is used to coordinate access to shared resources in a cluster.
HP TruClusters and OpenVMS Clusters provide a DLM, and HP-UX is slated to gain the DLM from TruClusters by 2004.
34. page 34 hp enterprise technical symposium filename.ppt Multi-Site Clusters Consist of multiple sites in different locations, with one or more systems at each site
Systems at each site are all part of the same cluster, and may share resources
Sites are typically connected by bridges (or bridge-routers; pure routers don’t pass the special cluster protocol traffic required for many clusters)
e.g. SCS protocol for OpenVMS Clusters Clusters can consist of nodes in multiple sites.
MC/Service Guard requires that the heartbeat between cluster nodes be passed within the same subnet, and cannot cross a router. OpenVMS Clusters use the SCS protocol for intra-node communications, which also cannot be routed. So bridges (or bridge-routers with bridging enabled) would typically be used to connect multiple-site clusters with these products.Clusters can consist of nodes in multiple sites.
MC/Service Guard requires that the heartbeat between cluster nodes be passed within the same subnet, and cannot cross a router. OpenVMS Clusters use the SCS protocol for intra-node communications, which also cannot be routed. So bridges (or bridge-routers with bridging enabled) would typically be used to connect multiple-site clusters with these products.
35. page 35 hp enterprise technical symposium filename.ppt Multi-Site Clusters:Inter-site Link(s) Sites linked by:
E3 (DS-3/T3 in USA) or ATM circuits from a telecommunications vendor
Microwave link: E3 or Ethernet bandwidths
Free-Space Optics link (short distance, low cost)
“Dark fiber” where available. ATM over SONET, or:
Ethernet over fiber (10 mb, Fast, Gigabit)
FDDI (up to 100 km)
Fibre Channel
Fiber links between Memory Channel switches (up to 3 km)
Wave Division Multiplexing (WDM), in either Coarse or Dense Wave Division Multiplexing (DWDM) flavors
Any of the types of traffic that can run over a single fiber Various technologies are available to link multi-site clusters.
Data circuits at various speeds and costs are available from telecommunications providers such as AT&T, Sprint, MCI, and the regional Bell Operating Companies.
If line-of-sight is available, a private microwave link may be possible. If the distance is short (say no more than a mile or so) then a free-space optical link, which uses lasers or LEDs, may be possible, and is very inexpensive and requires no FCC license like a microwave link would.
Dark fiber can be very cost-effective if it is available in your area. Vendors such as Metro Media Fiber Networks (mmfn.com) or Nortel can provide dark fiber. Some vendors also provide channels within WDM links as an alternative to entire dark fiber pairs.
WDM typically allows 16 channels over the same fibe, while DWDM can allow as many as 64 or 80 channels over a single fiber.Various technologies are available to link multi-site clusters.
Data circuits at various speeds and costs are available from telecommunications providers such as AT&T, Sprint, MCI, and the regional Bell Operating Companies.
If line-of-sight is available, a private microwave link may be possible. If the distance is short (say no more than a mile or so) then a free-space optical link, which uses lasers or LEDs, may be possible, and is very inexpensive and requires no FCC license like a microwave link would.
Dark fiber can be very cost-effective if it is available in your area. Vendors such as Metro Media Fiber Networks (mmfn.com) or Nortel can provide dark fiber. Some vendors also provide channels within WDM links as an alternative to entire dark fiber pairs.
WDM typically allows 16 channels over the same fibe, while DWDM can allow as many as 64 or 80 channels over a single fiber.
36. page 36 hp enterprise technical symposium filename.ppt Bandwidth of Inter-Site Link(s) Link bandwidth:
E3: 34 Mb/sec (or DS-3/T3 at 45 Mb/sec)
ATM: Typically 155 or 622 Mb/sec
Ethernet: Fast (100 Mb/sec) or Gigabit (1 Gb/sec)
Fibre Channel: 1 or 2 Gb/sec
Memory Channel: 100 MB/sec
[D]WDM: Multiples of ATM, GbE, FC, etc. Link technologies vary in cost and bandwidth.Link technologies vary in cost and bandwidth.
37. page 37 hp enterprise technical symposium filename.ppt Bandwidth of Inter-Site Link(s) Inter-site link minimum standards are in OpenVMS Cluster Software SPD:
10 megabits minimum data rate
This rules out E1 (2 Mb) links (and T1at 1,5 Mb)
“Minimize” packet latency
Low SCS packet retransmit rate:
Less than 0,1% retransmitted. Implies:
Low packet-loss rate for bridges
Low bit-error rate for links
38. page 38 hp enterprise technical symposium filename.ppt Bandwidth of Inter-Site Link Bandwidth affects performance of:
Volume Shadowing full copy operations
Volume Shadowing merge operations
Link is typically only fully utilized during shadow copies
Size link(s) for acceptably-small shadowing Full Copy times
OpenVMS (PEDRIVER) can use multiple links in parallel quite effectively
Significant improvements in this area in OpenVMS 7.3
39. page 39 hp enterprise technical symposium filename.ppt Inter-Site Link Choices Service type choices
Telco-provided data circuit service, own microwave link, FSO link, dark fiber?
Dedicated bandwidth, or shared pipe?
Single or multiple (redundant) links? If multiple links, then:
Diverse paths?
Multiple vendors? Many questions must be answered in selecting an inter-site link.
Using a single vendor can cut costs, but a single failure in that vendor’s network could take out both links (there was a well-publicized failure of one vendor’s entire X.25 network not long ago after they upgraded the software on all their switches).
Even when multiple vendors are used, one must verify that redundant links actually follow separate paths. Both vendors might sub-contract at least a portion of the link (typically the links to the actual sites at both ends) to the same RBOC, potentially introducing a single point of failure.Many questions must be answered in selecting an inter-site link.
Using a single vendor can cut costs, but a single failure in that vendor’s network could take out both links (there was a well-publicized failure of one vendor’s entire X.25 network not long ago after they upgraded the software on all their switches).
Even when multiple vendors are used, one must verify that redundant links actually follow separate paths. Both vendors might sub-contract at least a portion of the link (typically the links to the actual sites at both ends) to the same RBOC, potentially introducing a single point of failure.
40. page 40 hp enterprise technical symposium filename.ppt Dark Fiber Availability Example This map is the fiber-optic coverage from MetroMedia Fiber Networks (http://mmfn.com/ for AmsterdamThis map is the fiber-optic coverage from MetroMedia Fiber Networks (http://mmfn.com/ for Amsterdam
41. page 41 hp enterprise technical symposium filename.ppt Inter-Site Link: Network Gear Bridge implementations must not drop small packets under heavy loads
SCS Hello packets are small packets
If two in a row get lost, a node without redundant LANs will see a Virtual Circuit closure; if failure lasts too long, node will do a CLUEXIT bugcheck
42. page 42 hp enterprise technical symposium filename.ppt Inter-Site Links It is desirable for the cluster to be able to survive a bridge/router reboot for a firmware upgrade or switch reboot
If only one inter-site link is available, cluster nodes will just have to wait during this time
Spanning Tree reconfiguration takes time
Default Spanning Tree protocol timers often cause delays longer than the default value for RECNXINTERVAL
Consider raising RECNXINTERVAL parameter
Default is 20 seconds
It’s a dynamic parameter
43. page 43 hp enterprise technical symposium filename.ppt Redundant Inter-Site Links If multiple inter-site links are used, but they are joined together into one extended LAN, the Spanning Tree reconfiguration time is typically too long for the default value of RECNXINTERVAL also
One may want to carefully select bridge root priorities so that one of the (expensive) inter-site links is not turned off by the Spanning Tree algorithm
44. page 44 hp enterprise technical symposium filename.ppt Inter-Site Links Multiple inter-site links can instead be configured as isolated, independent LANs, with independent Spanning Trees
There is a very low probability of experiencing Spanning Tree Reconfigurations at once on multiple LANs when they are completely separate
Use multiple LAN adapters in each system, with one connected to each of the independent inter-site LANs
45. page 45 hp enterprise technical symposium filename.ppt Inter-Site Link Monitoring Where redundant LAN hardware is in place, use the LAVC$FAILURE_ANALYSIS tool from SYS$EXAMPLES:
It monitors and reports, via OPCOM messages, LAN component failures and repairs
More detail later
46. page 46 hp enterprise technical symposium filename.ppt Disaster-Tolerant Clusters:Foundation Goal: Survive loss of up to one entire datacenter
Foundation:
Two or more datacenters a “safe” distance apart
Cluster software for coordination
Inter-site link for cluster interconnect
Data replication of some sort for 2 or more identical copies of data, one at each site:
Volume Shadowing for OpenVMS, StorageWorks DRM or Continuous Access, database replication, etc.
47. page 47 hp enterprise technical symposium filename.ppt Disaster-Tolerant Clusters Foundation:
Management and monitoring tools
Remote system console access or KVM system
Failure detection and alerting, for things like:
Network (especially inter-site link) monitoring
Shadowset member loss
Node failure
Quorum recovery tool (especially for 2-site clusters) It may be necessary to boot or reboot systems at a remote site without being physically there.
When redundancy is in place, it is important to be notified when failures occur, so that timely repairs can be made before a second failure can cause an outage.
For 2-site clusters, some way of recovering from a loss of quorum will be needed. (We’ll talk in more depth about quorum schemes later.)
It may be necessary to boot or reboot systems at a remote site without being physically there.
When redundancy is in place, it is important to be notified when failures occur, so that timely repairs can be made before a second failure can cause an outage.
For 2-site clusters, some way of recovering from a loss of quorum will be needed. (We’ll talk in more depth about quorum schemes later.)
48. page 48 hp enterprise technical symposium filename.ppt Disaster-Tolerant Clusters Foundation:
Configuration planning and implementation assistance, and staff training
HP recommends Disaster Tolerant Cluster Services (DTCS) package It is wise to get some consulting help in planning and implementing a disaster-tolerant cluster, and in training your people how to operate it and handle various recovery scenarios correctly.
HP offers consulting services for DT cluster solutions using MC/Service Guard, OpenVMS Clusters, TruClusters, and for DR or DT solutions using StorageWorks DRM, StorageWorks XP, and EMC controller-based mirroring solutions.It is wise to get some consulting help in planning and implementing a disaster-tolerant cluster, and in training your people how to operate it and handle various recovery scenarios correctly.
HP offers consulting services for DT cluster solutions using MC/Service Guard, OpenVMS Clusters, TruClusters, and for DR or DT solutions using StorageWorks DRM, StorageWorks XP, and EMC controller-based mirroring solutions.
49. page 49 hp enterprise technical symposium filename.ppt Disaster-Tolerant Clusters Foundation:
History of packages available for Disaster-Tolerant Cluster configuration planning, implementation assistance, and training:
HP currently offers Disaster Tolerant Cluster Services (DTCS) package
Monitoring based on tools from Heroix
Formerly Business Recovery Server (BRS)
Monitoring based on Polycenter tools (Console Manager, System Watchdog, DECmcc) now owned by Computer Associates
and before that, Multi-Datacenter Facility (MDF) It is wise to get some consulting help in planning and implementing a disaster-tolerant cluster, and in training your people how to operate it and handle various recovery scenarios correctly.
HP offers consulting services for DT cluster solutions using MC/Service Guard, OpenVMS Clusters, TruClusters, and for DR or DT solutions using StorageWorks DRM, StorageWorks XP, and EMC controller-based mirroring solutions.It is wise to get some consulting help in planning and implementing a disaster-tolerant cluster, and in training your people how to operate it and handle various recovery scenarios correctly.
HP offers consulting services for DT cluster solutions using MC/Service Guard, OpenVMS Clusters, TruClusters, and for DR or DT solutions using StorageWorks DRM, StorageWorks XP, and EMC controller-based mirroring solutions.
50. page 50 hp enterprise technical symposium filename.ppt Disaster-Tolerant Clusters Management and monitoring toolset choices:
Remote system console access
Heroix RoboCentral; CA Unicenter Console Management for OpenVMS (formerly Command/IT, formerly Polycenter Console Manager); TECSys Development Inc. ConsoleWorks; Ki Networks Command Line Interface Manager (CLIM)
Failure detection and alerting
Heroix RoboMon; CA Unicenter System Watchdog for OpenVMS (formerly Watch/IT, formerly Polycenter System Watchdog); BMC Patrol
HP also has a software product called CockpitMgr designed specifically for disaster-tolerant OpenVMS Cluster monitoring and control. See http://www.hp.be/cockpitmgr and http://www.openvms.compaq.com/openvms/journal/v1/mgclus.pdf
51. page 51 hp enterprise technical symposium filename.ppt Disaster-Tolerant Clusters Management and monitoring toolset choices:
Network monitoring (especially inter-site links)
HP OpenView; Unicenter TNG; Tivoli; ClearViSN; CiscoWorks; etc.
Quorum recovery tool
DECamds / Availability Manager
DTCS or BRS integrated tools (which talk to the DECamds/AM RMDRIVER client on cluster nodes)
52. page 52 hp enterprise technical symposium filename.ppt Disaster-Tolerant Clusters Management and monitoring toolset choices:
Performance Management
HP ECP (CP/Collect & CP/Analyze)
Perfcap PAWZ, Analyzer, & Planner
Unicenter Performance Management for OpenVMS (formerly Polycenter Performance Solution Data Collector and Performance Analyzer, formerly SPM and VPA) from Computer Associates
Fortel SightLine/Viewpoint (formerly Datametrics)
BMC Patrol
etc.
53. page 53 hp enterprise technical symposium filename.ppt Disaster-Tolerant Clusters Foundation:
Carefully-planned procedures for:
Normal operations
Scheduled downtime and outages
Detailed diagnostic and recovery action plans for various failure scenarios
54. page 54 hp enterprise technical symposium filename.ppt Disaster Tolerance:Core Requirements Foundation:
Complete redundancy in facilities and hardware:
Second site with its own storage, networking, computing hardware, and user access mechanisms is put in place
No dependencies on the 1st site are allowed
Monitoring, management, and control mechanisms are in place to facilitate fail-over
Sufficient computing capacity is in place at the 2nd site to handle expected workloads by itself if the 1st site is destroyed To achieve Disaster Tolerance, these core pieces must be put in place.To achieve Disaster Tolerance, these core pieces must be put in place.
55. page 55 hp enterprise technical symposium filename.ppt Disaster Tolerance:Core Requirements Foundation:
Data Replication:
Data is constantly replicated to or copied to a 2nd site, so data is preserved in a disaster
Recovery Point Objective (RPO) determines which technologies are acceptable To achieve Disaster Tolerance, these core pieces must be put in place.To achieve Disaster Tolerance, these core pieces must be put in place.
56. page 56 hp enterprise technical symposium filename.ppt Planning for Disaster Tolerance Remembering that the goal is to continue operating despite loss of an entire datacenter
All the pieces must be in place to allow that:
User access to both sites
Network connections to both sites
Operations staff at both sites
Business can’t depend on anything that is only at either site
57. page 57 hp enterprise technical symposium filename.ppt Disaster Tolerance:Core Requirements If all these requirements are met, there may be as little as zero data lost and as little as seconds of delay after a disaster before the surviving copy of data can actually be used
58. page 58 hp enterprise technical symposium filename.ppt Planning for Disaster Tolerance Sites must be carefully selected to avoid hazards common to both, and loss of both datacenters at once as a result
Make them a “safe” distance apart
This must be a compromise. Factors:
Business needs
Risks
Interconnect costs
Performance (inter-site latency)
Ease of travel between sites
Politics, legal requirements (e.g. privacy laws)
59. page 59 hp enterprise technical symposium filename.ppt Planning for Disaster Tolerance: What is a “Safe Distance” Analyze likely hazards of proposed sites:
Fire (building, forest, gas leak, explosive materials)
Storms (Tornado, Hurricane, Lightning, Hail)
Flooding (excess rainfall, dam failure, storm surge, broken water pipe)
Earthquakes, Tsunamis
60. page 60 hp enterprise technical symposium filename.ppt Planning for Disaster Tolerance: What is a “Safe Distance” Analyze likely hazards of proposed sites:
Nearby transportation of hazardous materials (highway, rail, ship/barge)
Terrorist (or disgruntled customer) with a bomb or weapon
Enemy attack in war (nearby military or industrial targets)
Civil unrest (riots, vandalism)
61. page 61 hp enterprise technical symposium filename.ppt Planning for Disaster Tolerance: Site Separation Select site separation direction:
Not along same earthquake fault-line
Not along likely storm tracks
Not in same floodplain or downstream of same dam
Not on the same coastline
Not in line with prevailing winds (that might carry hazardous materials)
62. page 62 hp enterprise technical symposium filename.ppt Planning for Disaster Tolerance: Site Separation Select site separation distance (in a “safe” direction):
1 kilometer: protects against most building fires, gas leak, terrorist bombs, armed intruder
10 kilometers: protects against most tornadoes, floods, hazardous material spills, release of poisonous gas, non-nuclear military bombs
100 kilometers: protects against most hurricanes, earthquakes, tsunamis, forest fires, “dirty” bombs, biological weapons, and possibly military nuclear attacks
63. page 63 hp enterprise technical symposium filename.ppt Planning for Disaster Tolerance: Providing Redundancy Redundancy must be provided for:
Datacenter and facilities (A/C, power, user workspace, etc.)
Data
And data feeds, if any
Systems
Network
User access
64. page 64 hp enterprise technical symposium filename.ppt Planning for Disaster Tolerance Also plan for continued operation after a disaster
Surviving site will likely have to operate alone for a long period before the other site can be repaired or replaced
65. page 65 hp enterprise technical symposium filename.ppt Planning for Disaster Tolerance Plan for continued operation after a disaster
Provide redundancy within each site
Facilities: Power feeds, A/C
Mirroring or RAID to protect disks
Obvious solution for 2-site clusters would be 4-member shadowsets, but the limit is 3 members. Typical workarounds are:
Shadow 2-member controller-based mirrorsets at each site, or
Have 2 members at one site and a 2-member mirrorset as the single member at the other site
Have 3 sites, with one shadow member at each site
Clustering for servers
Network redundancy
66. page 66 hp enterprise technical symposium filename.ppt Planning for Disaster Tolerance Plan for continued operation after a disaster
Provide enough capacity within each site to run the business alone if the other site is lost
and handle normal workload growth rate
67. page 67 hp enterprise technical symposium filename.ppt Planning for Disaster Tolerance Plan for continued operation after a disaster
Having 3 sites is an option to seriously consider:
Leaves two redundant sites after a disaster
Leaves 2/3 of processing capacity instead of just ˝ after a disaster
68. page 68 hp enterprise technical symposium filename.ppt Cross-site Data Replication Methods Hardware
Storage controller
Software
Host software Volume Shadowing, disk mirroring, or file-level mirroring
Database replication or log-shipping
Transaction-processing monitor or middleware with replication functionality
69. page 69 hp enterprise technical symposium filename.ppt Data Replication in Hardware HP StorageWorks Data Replication Manager (DRM) or Continuous Access (CA)
HP StorageWorks XP (formerly HP SureStore E Disk Array XP) with Continuous Access (CA) XP
EMC Symmetrix Remote Data Facility (SRDF)
70. page 70 hp enterprise technical symposium filename.ppt Data Replication in Software Host software volume shadowing or disk mirroring:
Volume Shadowing Software for OpenVMS
MirrorDisk/UX for HP-UX
Veritas VxVM with Volume Replicator extensions for Unix and Windows
Fault Tolerant (FT) Disk on Windows
Some other O/S platforms have software products which can provide file-level mirroring
71. page 71 hp enterprise technical symposium filename.ppt Data Replication in Software Database replication or log-shipping
Replication
e.g. Oracle DataGuard (formerly Oracle Standby Database)
Database backups plus “Log Shipping”
72. page 72 hp enterprise technical symposium filename.ppt Data Replication in Software TP Monitor/Transaction Router
e.g. HP Reliable Transaction Router (RTR) Software on OpenVMS, Unix, and Windows
73. page 73 hp enterprise technical symposium filename.ppt Data Replication in Hardware Data mirroring schemes
Synchronous
Slower, but less chance of data loss
Beware: Some hardware solutions can still lose the last write operation before a disaster
Asynchronous
Faster, and works for longer distances
but can lose minutes’ worth of data (more under high loads) in a site disaster
Most products offer you a choice of using either method
74. page 74 hp enterprise technical symposium filename.ppt Data Replication in Hardware Mirroring is of sectors on disk
So operating system / applications must flush data from memory to disk for controller to be able to mirror it to the other site On an operating system like Unix, where file system updates may remain in buffer cache in system memory for some time, there’s no way a hardware disk replication scheme can replicate the data, unless it is flushed out to disk first.
OpenVMS uses only write-through caching in the host, so the data is always written out to the disk before a transaction is considered to be completed, and doesn’t hang around in memory, at risk. The Files-11 XQP and RMS also use a technique for file system metadata and internal RMS file structures called “careful write ordering”, in which the design carefully chooses the order of operations so as to never leave the file system metadata or internal file structures in an unreadable state. With Careful Write Ordering, one operation at a time is done and allowed to complete before the next metadata-altering disk write is done. This technique has a performance cost (or at least it did in the days before controller write-back cache made write latencies very low), but is very crash-safe.On an operating system like Unix, where file system updates may remain in buffer cache in system memory for some time, there’s no way a hardware disk replication scheme can replicate the data, unless it is flushed out to disk first.
OpenVMS uses only write-through caching in the host, so the data is always written out to the disk before a transaction is considered to be completed, and doesn’t hang around in memory, at risk. The Files-11 XQP and RMS also use a technique for file system metadata and internal RMS file structures called “careful write ordering”, in which the design carefully chooses the order of operations so as to never leave the file system metadata or internal file structures in an unreadable state. With Careful Write Ordering, one operation at a time is done and allowed to complete before the next metadata-altering disk write is done. This technique has a performance cost (or at least it did in the days before controller write-back cache made write latencies very low), but is very crash-safe.
75. page 75 hp enterprise technical symposium filename.ppt Data Replication in Hardware Resynchronization operations
May take significant time and bandwidth
May or may not preserve a consistent copy of data at the remote site until the copy operation has completed
May or may not preserve write ordering during the copy
76. page 76 hp enterprise technical symposium filename.ppt Data Replication:Write Ordering File systems and database software may make some assumptions on write ordering and disk behavior
For example, a database may write to a journal log, wait until that I/O is reported as being complete, then write to the main database storage area
During database recovery operations, its logic may depend on these write operations having been completed to disk in the expected order
77. page 77 hp enterprise technical symposium filename.ppt Data Replication:Write Ordering Some controller-based replication methods copy data on a track-by-track basis for efficiency instead of exactly duplicating individual write operations
This may change the effective ordering of write operations within the remote copy
78. page 78 hp enterprise technical symposium filename.ppt Data Replication:Write Ordering When data needs to be re-synchronized at a remote site, some replication methods (both controller-based and host-based) similarly copy data on a track-by-track basis for efficiency instead of exactly duplicating writes
This may change the effective ordering of write operations within the remote copy
The output volume may be inconsistent and unreadable until the resynchronization operation completes
79. page 79 hp enterprise technical symposium filename.ppt Data Replication:Write Ordering It may be advisable in this case to preserve an earlier (consistent) copy of the data, and perform the resynchronization to a different set of disks, so that if the source site is lost during the copy, at least one copy of the data (albeit out-of-date) is still present
80. page 80 hp enterprise technical symposium filename.ppt Data Replication in Hardware:Write Ordering Some products provide a guarantee of original write ordering on a disk (or even across a set of disks)
Some products can even preserve write ordering during resynchronization operations, so the remote copy is always consistent (as of some point in time) during the entire resynchronization operation
81. page 81 hp enterprise technical symposium filename.ppt Data Replication:Performance over a Long Distance Replication performance may be affected by latency due to the speed of light over the distance between sites
Greater (and thus safer) distances between sites implies greater latency
82. page 82 hp enterprise technical symposium filename.ppt Data Replication:Performance over a Long Distance With some solutions, it may be possible to synchronously replicate data to a nearby “short-haul” site, and asynchronously replicate from there to a more-distant site
This is sometimes called “cascaded” data replication
83. page 83 hp enterprise technical symposium filename.ppt Data Replication:Performance During Re-Synchronization Re-synchronization operations can generate a high data rate on inter-site links
Excessive re-synchronization time increases Mean Time To Repair (MTTR) after a site failure or outage
Acceptable re-synchronization times and link costs may be the major factors in selecting inter-site link(s)
84. page 84 hp enterprise technical symposium filename.ppt Data Replication in Hardware:Copy Direction Most hardware-based solutions can only replicate a given set of data in one direction or the other
Some can be configured replicate some disks on one direction, and other disks in the opposite direction
This way, different applications might be run at each of the two sites
85. page 85 hp enterprise technical symposium filename.ppt Data Replication in Hardware:Disk Unit Access All access to a disk unit is typically from only one of the controllers at a time
Data cannot be accessed through the controller at the other site
Data might be accessible to systems at the other site via a Fibre Channel inter-site link, or by going through the MSCP Server on a VMS node
Read-only access may be possible at remote site with one product (“Productive Protection”)
Failover involves controller commands
Manual, or manually-initiated scripts
15 minutes to 1 hour range of minimum failover time
86. page 86 hp enterprise technical symposium filename.ppt Data Replication in Hardware:Multiple Copies Some products allow replication to:
A second unit at the same site
Multiple remote units or sites at a time (“M x N“ configurations)
In contrast, OpenVMS Volume Shadowing allows up to 3 copies, spread across up to 3 sites
87. page 87 hp enterprise technical symposium filename.ppt Data Replication in Hardware:Copy Direction Few or no hardware solutions can replicate data between sites in both directions on the same shadowset/mirrorset
But Host-based OpenVMS Volume Shadowing can do this
If this could be done in a hardware solution, host software would still have to coordinate any disk updates to the same set of blocks from both sites
e.g. OpenVMS Cluster Software, or Oracle Parallel Server or 9i/RAC
This capability is required to allow the same application to be run on cluster nodes at both sites simultaneously
88. page 88 hp enterprise technical symposium filename.ppt Managing Replicated Data With copies of data at multiple sites, one must take care to ensure that:
Both copies are always equivalent, or, failing that,
Users always access the most up-to-date copy
89. page 89 hp enterprise technical symposium filename.ppt Managing Replicated Data If the inter-site link fails, both sites might conceivably continue to process transactions, and the copies of the data at each site would continue to diverge over time
This is called a “Partitioned Cluster”, or “Split-Brain Syndrome”
The most common solution to this potential problem is a Quorum-based scheme
Access and updates are only allowed to take place on one set of data
90. page 90 hp enterprise technical symposium filename.ppt Quorum Schemes Idea comes from familiar parliamentary procedures
Systems are given votes
Quorum is defined to be a simple majority (just over half) of the total votes
91. page 91 hp enterprise technical symposium filename.ppt Quorum Schemes In the event of a communications failure,
Systems in the minority voluntarily suspend or stop processing, while
Systems in the majority can continue to process transactions
92. page 92 hp enterprise technical symposium filename.ppt Quorum Scheme If a cluster member is not part of a cluster with quorum, OpenVMS keeps it from doing any harm by:
Putting all disks into Mount Verify state, thus stalling all disk I/O operations
Requiring that all processes have the QUORUM capability before they can run
Clearing the QUORUM capability bit on all CPUs in the system, thus preventing any process from being scheduled to run on a CPU and doing any work
OpenVMS many years ago looped at IPL 4 instead For MC/Service Guard clusters, nodes in the minority perform a Transfer Of Control (TOC), which basically looks like a panic, and stops processing
For MC/Service Guard clusters, nodes in the minority perform a Transfer Of Control (TOC), which basically looks like a panic, and stops processing
93. page 93 hp enterprise technical symposium filename.ppt Quorum Schemes To handle cases where there are an even number of votes
For example, with only 2 systems,
Or half of the votes are at each of 2 sites
provision may be made for
a tie-breaking vote, or
human intervention With only two nodes, if communications fails between the nodes, neither node has the number of votes necessary to have a majority of votes, so both nodes consider themselves to be in the minority during a loss of communications, and neither one can process transactions.
Likewise, if there are two sites with equal numbers of votes at each site, neither site can process transactions after a failure of the inter-site link.
This problem is typically solved by allowing human intervention to decide which one (and it is important that only one) of the two sites will continue processing transactions, or else to make provision for a tie-breaking vote so that the determination of which site is to continue can be done automatically, without human intervention and the delay that would cause.With only two nodes, if communications fails between the nodes, neither node has the number of votes necessary to have a majority of votes, so both nodes consider themselves to be in the minority during a loss of communications, and neither one can process transactions.
Likewise, if there are two sites with equal numbers of votes at each site, neither site can process transactions after a failure of the inter-site link.
This problem is typically solved by allowing human intervention to decide which one (and it is important that only one) of the two sites will continue processing transactions, or else to make provision for a tie-breaking vote so that the determination of which site is to continue can be done automatically, without human intervention and the delay that would cause.
94. page 94 hp enterprise technical symposium filename.ppt Quorum Schemes:Tie-breaking vote This can be provided by a disk:
Quorum Disk for OpenVMS Clusters or TruClusters or MSCS
Cluster Lock Disk for MC/Service Guard
Or by a system with a vote, located at a 3rd site
Additional cluster member node for OpenVMS Clusters or TruClusters (called a “quorum node”) or MC/Service Guard clusters (called an “arbitrator node”)
Software running on a non-clustered node or a node in another cluster
e.g. Quorum Server for MC/Service Guard
95. page 95 hp enterprise technical symposium filename.ppt Quorum configurations inMulti-Site Clusters 3 sites, equal votes in 2 sites
Intuitively ideal; easiest to manage & operate
3rd site serves as tie-breaker
3rd site might contain only a “quorum node”, “arbitrator node”, or “quorum server” In MC/Service Guard disaster-tolerant clusters, there must be an equal number of systems in the cluster at each of the two sites. Each system has one vote in the quorum scheme.
In OpenVMS Clusters, the number of nodes in each site can vary, because the number of votes each node is given may be anywhere from 0 to 127 votes, and the quorum disk can also be given from 0 to 127 votes. In a TruCluster, the quorum disk may only have 0 or 1 votes.In MC/Service Guard disaster-tolerant clusters, there must be an equal number of systems in the cluster at each of the two sites. Each system has one vote in the quorum scheme.
In OpenVMS Clusters, the number of nodes in each site can vary, because the number of votes each node is given may be anywhere from 0 to 127 votes, and the quorum disk can also be given from 0 to 127 votes. In a TruCluster, the quorum disk may only have 0 or 1 votes.
96. page 96 hp enterprise technical symposium filename.ppt Quorum configurations inMulti-Site Clusters 3 sites, equal votes in 2 sites
Hard to do in practice, due to cost of inter-site links beyond on-campus distances
Could use links to quorum site as backup for main inter-site link if links are high-bandwidth and connected together
Could use 2 less-expensive, lower-bandwidth links to quorum site, to lower cost
OpenVMS SPD requires a minimum of 10 megabits bandwidth for any link
97. page 97 hp enterprise technical symposium filename.ppt Quorum configurations in3-Site Clusters
98. page 98 hp enterprise technical symposium filename.ppt Quorum configurations inMulti-Site Clusters 2 sites:
Most common & most problematic:
How do you arrange votes? Balanced? Unbalanced?
If votes are balanced, how do you recover from loss of quorum which will result when either site or the inter-site link fails? In an MC/Service Guard cluster, the votes must always be balanced.
With OpenVMS Clusters or TruClusters, one faces a choice. One site may be given more votes, and it can continue with or without the other site, and whether or not the inter-site link is functional. The site wth fewer votes would pause due to quorum loss if the other site failed or the inter-site link failed, but manual action could be taken to restore quorum and resume processing. In the OpenVMS Cluster case, the DECamds or Availability Manager tools can be used to restore quorum. One would need to be careful to only restore quorum at one of the two sites in the event of an inter-site link failure, to avoid a partitioned cluster, sometimes called “split-brain syndrome”.In an MC/Service Guard cluster, the votes must always be balanced.
With OpenVMS Clusters or TruClusters, one faces a choice. One site may be given more votes, and it can continue with or without the other site, and whether or not the inter-site link is functional. The site wth fewer votes would pause due to quorum loss if the other site failed or the inter-site link failed, but manual action could be taken to restore quorum and resume processing. In the OpenVMS Cluster case, the DECamds or Availability Manager tools can be used to restore quorum. One would need to be careful to only restore quorum at one of the two sites in the event of an inter-site link failure, to avoid a partitioned cluster, sometimes called “split-brain syndrome”.
99. page 99 hp enterprise technical symposium filename.ppt Quorum configurations inTwo-Site Clusters One solution: Unbalanced Votes
More votes at one site
Site with more votes can continue without human intervention in the event of loss of the other site or the inter-site link
Site with fewer votes pauses or stops on a failure and requires manual action to continue after loss of the other site
100. page 100 hp enterprise technical symposium filename.ppt Quorum configurations inTwo-Site Clusters Unbalanced Votes
Very common in remote-shadowing-only clusters (not fully disaster-tolerant)
0 votes is a common choice for the remote site in this case
but that has its dangers
101. page 101 hp enterprise technical symposium filename.ppt Quorum configurations inTwo-Site Clusters Unbalanced Votes
Common mistake:
Give more votes to Primary site, and
Leave Standby site unmanned
Result: cluster can’t run without Primary site or human intervention at the (unmanned) Standby site
102. page 102 hp enterprise technical symposium filename.ppt Quorum configurations inTwo-Site Clusters Balanced Votes
Equal votes at each site
Manual action required to restore quorum and continue processing in the event of either:
Site failure, or
Inter-site link failure
103. page 103 hp enterprise technical symposium filename.ppt Quorum Recovery Methods Methods for human intervention to restore quorum
Software interrupt at IPL 12 from console
IPC> Q
DECamds or Availability Manager Console:
System Fix; Adjust Quorum
DTCS or BRS integrated tool, using same RMDRIVER (DECamds/AM client) interface
104. page 104 hp enterprise technical symposium filename.ppt Quorum configurations inTwo-Site Clusters Balanced Votes
Note: Using REMOVE_NODE option with SHUTDOWN.COM (post V6.2) when taking down a node effectively “unbalances” votes
105. page 105 hp enterprise technical symposium filename.ppt Optimal Sub-cluster Selection Connection Manager compares potential node subsets that could make up the surviving portion of the cluster
Picks sub-cluster with the most votes; or,
If vote counts are equal, picks sub-cluster with the most nodes; or,
If node counts are equal, arbitrarily picks a winner
based on comparing SCSSYSTEMID values within the set of nodes with the most-recent cluster software revision
106. page 106 hp enterprise technical symposium filename.ppt Optimal Sub-cluster Selection: ExamplesBoot nodes and satellites Most configurations with satellite nodes give votes to disk/boot servers and set VOTES=0 on all satellite nodes
If the sole LAN adapter on a disk/boot server fails, and it has a vote, ALL satellites will CLUEXIT!
107. page 107 hp enterprise technical symposium filename.ppt Optimal Sub-cluster Selection: ExamplesBoot nodes and satellites
108. page 108 hp enterprise technical symposium filename.ppt Optimal Sub-cluster Selection: Examples
109. page 109 hp enterprise technical symposium filename.ppt Optimal Sub-cluster Selection: Examples
110. page 110 hp enterprise technical symposium filename.ppt Optimal Sub-cluster Selection: Examples
111. page 111 hp enterprise technical symposium filename.ppt Optimal Sub-cluster Selection: ExamplesBoot nodes and satellites Advice: give at least as many votes to node(s) on the LAN as any single server has, or configure redundant LAN adapters
112. page 112 hp enterprise technical symposium filename.ppt Optimal Sub-cluster Selection: Examples
113. page 113 hp enterprise technical symposium filename.ppt Optimal Sub-cluster Selection: Examples
114. page 114 hp enterprise technical symposium filename.ppt Optimal Sub-cluster Selection: Examples Two-Site Cluster with Unbalanced Votes This is the “remote-shadowing” configuration we talked about earlier, with some number of votes at the Primary site (on the left) but no votes at the Backup site (on the right).This is the “remote-shadowing” configuration we talked about earlier, with some number of votes at the Primary site (on the left) but no votes at the Backup site (on the right).
115. page 115 hp enterprise technical symposium filename.ppt Optimal Sub-cluster Selection: Examples Two-Site Cluster with Unbalanced Votes
116. page 116 hp enterprise technical symposium filename.ppt Optimal Sub-cluster Selection: Examples Two-Site Cluster with Unbalanced Votes
117. page 117 hp enterprise technical symposium filename.ppt Network Considerations Best network configuration for a disaster-tolerant cluster typically is:
All nodes in same DECnet area
All nodes in same IP Subnet
despite being at two separate sites When all the nodes are on the same DECnet area, upon the first inquiry to a router about sending data to a node, DECnet sets “On-LAN” bit to indicate that the destination node is on the same LAN, and subsequently the nodes can send traffic directly to each other over the LAN. If DECnet nodes are in different DECnet areas, all data must first be sent to the Area Router for the first DECnet area, which passes it along to the Area Router for the 2nd DECnet Area, which sends it to the destination node. So data must traverse 3 hops between nodes instead of just 1, and takes up LAN bandwidth 3 times instead of 1.When all the nodes are on the same DECnet area, upon the first inquiry to a router about sending data to a node, DECnet sets “On-LAN” bit to indicate that the destination node is on the same LAN, and subsequently the nodes can send traffic directly to each other over the LAN. If DECnet nodes are in different DECnet areas, all data must first be sent to the Area Router for the first DECnet area, which passes it along to the Area Router for the 2nd DECnet Area, which sends it to the destination node. So data must traverse 3 hops between nodes instead of just 1, and takes up LAN bandwidth 3 times instead of 1.
118. page 118 hp enterprise technical symposium filename.ppt Shadowing Between Sites Shadow copies can generate a high data rate on inter-site links
Excessive shadow-copy time increases Mean Time To Repair (MTTR) after a site failure or outage
Acceptable shadow full-copy times and link costs will typically be the major factors in selecting inter-site link(s)
119. page 119 hp enterprise technical symposium filename.ppt Shadowing Between Sites Because:
Inter-site latency is typically much greater than intra-site latency, at least if there is any significant distance between sites, and
Direct operations are typically 1-2 ms lower in latency than MSCP-served operations, even when the inter-site distance is small,
It is most efficient to direct Read operations to the local disks, not remote disks
(All Write operations have to go to all disks in a shadowset, remote as well as local members, of course)
120. page 120 hp enterprise technical symposium filename.ppt Shadowing Between Sites:Local vs. Remote Reads Directing Shadowing Read operations to local disks, in favor of remote disks:
Bit 16 (%x10000) in SYSGEN parameter SHADOW_SYS_DISK can be set to force reads to local disks in favor of MSCP-served disks
OpenVMS 7.3 (or recent VOLSHAD ECO kits) allow you to tell OpenVMS at which site member disks are located, and the relative “cost” to read a given disk
121. page 121 hp enterprise technical symposium filename.ppt Shadowing Between Sites:Local vs. Remote Reads on Fibre Channel With an inter-site Fibre Channel link, VMS can’t tell which disks are local and which are remote, so we give it some help:
To tell VMS what site a given disk is at:
$ SET DEVICE/SITE=xxx $1$DGAnnn:
To tell VMS what site a given VMS node is at, do:
$ SET DEVICE/SITE=xxx DSAnn:
for each virtual unit, from each VMS node, specifying the appropriate SITE value for each VMS node. There is a new SYSGEN parameter present (but only latent) in 7.3-1 called SHADOW_SITE, which is intended to eventually be used to indicate which site a node is at, and thus save needing to do these commands on each of the virtual units from each node. But that isn’t active yet.
There is a new SYSGEN parameter present (but only latent) in 7.3-1 called SHADOW_SITE, which is intended to eventually be used to indicate which site a node is at, and thus save needing to do these commands on each of the virtual units from each node. But that isn’t active yet.
122. page 122 hp enterprise technical symposium filename.ppt Shadowing Between Sites:Local vs. Remote Reads on Fibre Channel Here's an example of setting /SITE values:
$ DEFINE/SYSTEM/EXEC ZKO 1
$ DEFINE/SYSTEM/EXEC LKG 2
$! Please note for this example that:
$! $1$DGA4: is physically located at site ZKO.
$! $1$DGA2: is physically located at site LKG.
$! The MOUNT command is the same at both sites:
$ MOUNT/SYSTEM DSA42/SHAD=($1$DGA4, $1$DGA2) VOLUME_LABEL
$! At the ZKO site ...
$ SET DEVICE/SITE=ZKO DSA42:
$! At the LKG site ...
$ SET DEVICE/SITE=LKG DSA42:
$! At both sites, the following commands would be used to
$! specify at which site the disks are located
$ SET DEVICE/SITE=ZKO $1$DGA4:
$ SET DEVICE/SITE=LKG $1$DGA2: This example is adapted from the “Fibre Channel in a Disaster-Tolernant OpenVMS Cluster System” white paper at http://h71000.www7.hp.com/openvms/fibre/fc_hbvs_dtc_wp.pdf This example is adapted from the “Fibre Channel in a Disaster-Tolernant OpenVMS Cluster System” white paper at http://h71000.www7.hp.com/openvms/fibre/fc_hbvs_dtc_wp.pdf
123. page 123 hp enterprise technical symposium filename.ppt Shadowing Between Sites:Local vs. Remote Reads and Read Cost VMS assigns a default “read cost” for each shadowset member, with values in this order (lowest cost to highest cost):
DECram device
Directly-connected device in the same physical location
Directly-connected device in a remote location
DECram served device
All other served devices
VMS adds the Read Cost to the queue length and picks the member with the lowest value to do the read
124. page 124 hp enterprise technical symposium filename.ppt Shadowing Between Sites:Local vs. Remote Reads and Read Cost For even greater control, one can override the default “read cost” to a given disk from a given node:
$ SET DEVICE /READ_COST=nnn $1$DGAnnn:
For reads, VMS adds the Read Cost to the queue length and picks the member with the lowest value to do the read
To return all member disks of a shadowset to their default values after such changes have been done, one can use the command:
$ SET DEVICE /READ_COST = 1 DSAnnn:
125. page 125 hp enterprise technical symposium filename.ppt Shadowing Between Sites Mitigating Impact of Remote Writes
Impact of round-trip latency on remote writes:
Use write-back cache in controllers to minimize write I/O latency for target disks
Remote MSCP-served writes
Check SHOW CLUSTER/CONTINUOUS with CR_WAITS and/or AUTOGEN with FEEDBACK to ensure MSCP_CREDITS is high enough to avoid SCS credit waits
Use MONITOR MSCP, SHOW DEVICE/SERVED, and/or AUTOGEN with FEEDBACK to ensure MSCP_BUFFER is high enough to avoid segmenting transfers
126. page 126 hp enterprise technical symposium filename.ppt Shadowing Between Sites:Speeding Shadow Copies Host-Based Volume Shadowing Full-Copy algorithm is non-intuitive:
Read from source disk
Do Compare operation with target disk
If data is different, write to target disk, then go to Step 1.
127. page 127 hp enterprise technical symposium filename.ppt Speeding Shadow Copies Shadow_Server process does copy I/Os
Does one 127-block segment at a time, from the beginning of the disk to the end, with no double-buffering or other speed-up tricks
Odd algorithm is to ensure correct results in cases like:
With application writes occurring in parallel with the copy thread
application writes during a Full Copy go to the set of source disks first, then to the set of target disk(s)
On system disks with a VMS node booting or dumping while shadow copy is going on
128. page 128 hp enterprise technical symposium filename.ppt Speeding Shadow Copies Implications:
Shadow copy completes fastest if data is identical beforehand
Fortunately, this is the most-common case – re-adding a shadow member into shadowset again after it was a member before
129. page 129 hp enterprise technical symposium filename.ppt Speeding Shadow Copies If data is very different, empirical tests have shown that it is faster to:
Do BACKUP/PHYSICAL from source shadowset to /FOREIGN-mounted target disk
Then do shadow copy afterward
than to simply initiate the shadow copy with differing data.
But be sure to clobber SCB on target disk with an $INITIALIZE (or $MOUNT/OVERRIDE=SHADOW) command before adding new member to shadowset (fixed in 7.3-2)
130. page 130 hp enterprise technical symposium filename.ppt Speeding Shadow Copies For even more speed-up, perform the BACKUP/PHYSICAL on a node on the target side
Because remote (MSCP-served) writes take a minimum of 2 round trips, whereas remote reads take a minimum of only 1 round trip
131. page 131 hp enterprise technical symposium filename.ppt Speeding Shadow Copies Doing shadow copy work from a node at target site, not source site, is also most efficient, for the same reason. It also uses less inter-site bandwidth (only Reads go across the link).
But with limited inter-site link bandwidth, it may help to do some of the copies in the reverse direction and thus utilize the “reverse” direction bandwidth. too
132. page 132 hp enterprise technical symposium filename.ppt Speeding Shadow Copies To control which node does shadow copy:
1) Set dynamic SYSGEN parameter SHADOW_MAX_COPY to a large positive value on desired (e.g. target-site) node(s)
2) Set SHADOW_MAX_COPY to 0 on all other nodes
3) Do $MOUNT to add member to shadowset; wait briefly
4) Reset SHADOW_MAX_COPY parameter to original values on all nodes
133. page 133 hp enterprise technical symposium filename.ppt Speeding Shadow Copies Determining which node is performing a shadow copy:
Using SDA:
From each cluster node, do:
SDA> SET PROCESS SHADOW_SERVER
SDA> SHOW PROCESS/CHANNELS
and look for Busy channel to disk of interest
Or look for node holding a lock in Exclusive mode on a resource of the form “$DSAnnnn$_COPIER”
134. page 134 hp enterprise technical symposium filename.ppt Speeding Shadow Copies Because of the sequential, non-pipelined nature of the shadow-copy algorithm, to speed shadow copies:
Rather than forming controller-based stripesets and shadowing those across sites, shadow individual disks in parallel, and combine them into RAID-0 arrays with host-based RAID software
Dividing a disk up into 4 partitions at the controller level, and shadow-copying all 4 in parallel, takes only 40% of the time to shadow-copy the entire disk as a whole
135. page 135 hp enterprise technical symposium filename.ppt Speeding Shadow Copies:Mini-Copy Operations If one knows ahead of time that one shadowset member will temporarily be removed, one can set things up to allow a Mini-Copy instead of a Full-Copy when the disk is returned to the shadowset
136. page 136 hp enterprise technical symposium filename.ppt Speeding Shadow Copies Read-ahead caching on new controllers can speed shadow copies.
New command $SET DEVICE/COPY_SOURCE can be used to specify the source disk for a Full-Copy operation (e.g. to copy from a 2nd member at one site instead of a remote member)
137. page 137 hp enterprise technical symposium filename.ppt Avoiding Shadow Copies New $INITIALIZE/SHADOW command allows a shadowset to be formed wth multiple members from the start, instead of always having to do a Full-Copy from the initial member
But consider including the /ERASE qualifier or the first Full-Merge operation will be slow as it fixes up non-file-system areas.
138. page 138 hp enterprise technical symposium filename.ppt Speeding Merge Operations Logical names SHAD$MERGE_DELAY_FACTOR and SHAD$MERGE_DELAY_FACTOR_DSAnnn allow control of merge thread I/O rates in the face of application I/Os
Logical names checked every 1,000 I/Os
Default is 200; lower values slow merge thread I/Os; higher values increase merge thread I/Os
Default merge thread back-off threshold is now too low for modern controllers, so Shadowing is too cautious; use logical names to speed merges
MSCP controllers allow Mini-Merge operations
139. page 139 hp enterprise technical symposium filename.ppt Data Protection Scenarios Protection of the data is obviously extremely important in a disaster-tolerant cluster
We’ll look at one scenario that has happened in real life and resulted in data loss:
“Wrong-way shadow copy”
We’ll also look at two obscure but potentially dangerous scenarios that theoretically could occur and would result in data loss:
“Creeping Doom”
“Rolling Disaster”
140. page 140 hp enterprise technical symposium filename.ppt Protecting Shadowed Data Shadowing keeps a “Generation Number” in the SCB on shadow member disks
Shadowing “Bumps” the Generation number at the time of various shadowset events, such as mounting, or membership changes
141. page 141 hp enterprise technical symposium filename.ppt Protecting Shadowed Data Generation number is designed to constantly increase over time, never decrease
Implementation is based on OpenVMS timestamp value, and during a “Bump” operation it is increased to the current time value (or, if it’s already a future time for some reason, such as time skew among cluster member clocks, then it’s simply incremented). The new value is stored on all shadowset members at the time of the Bump.
142. page 142 hp enterprise technical symposium filename.ppt Protecting Shadowed Data Generation number in SCB on removed members will thus gradually fall farther and farther behind that of current members
In comparing two disks, a later generation number should always be on the more up-to-date member, under normal circumstances
143. page 143 hp enterprise technical symposium filename.ppt “Wrong-Way Shadow Copy” Scenario Shadow-copy “nightmare scenario”:
Shadow copy in “wrong” direction copies old data over new
Real-life example:
Inter-site link failure occurs
Due to unbalanced votes, Site A continues to run
Shadowing increases generation numbers on Site A disks after removing Site B members from shadowset
144. page 144 hp enterprise technical symposium filename.ppt Wrong-Way Shadow Copy
145. page 145 hp enterprise technical symposium filename.ppt Wrong-Way Shadow Copy Site B is brought up briefly by itself for whatever reason
Shadowing can’t see Site A disks. Shadowsets mount with Site B disks only. Shadowing bumps generation numbers on Site B disks. Generation number is now greater than on Site A disks.
146. page 146 hp enterprise technical symposium filename.ppt Wrong-Way Shadow Copy
147. page 147 hp enterprise technical symposium filename.ppt Wrong-Way Shadow Copy Link gets fixed. Both sites are taken down and rebooted at once.
Shadowing thinks Site B disks are more current, and copies them over Site A’s. Result: Data Loss.
148. page 148 hp enterprise technical symposium filename.ppt Wrong-Way Shadow Copy
149. page 149 hp enterprise technical symposium filename.ppt Protecting Shadowed Data If shadowing can’t “see” a later disk’s SCB (i.e. because the site or link to the site is down), it may use an older member and then update the Generation number to a current timestamp value
New /POLICY=REQUIRE_MEMBERS qualifier on MOUNT command prevents a mount unless all of the listed members are present for Shadowing to compare Generation numbers on
New /POLICY=VERIFY_LABEL on MOUNT means volume label on member must be SCRATCH_DISK, or it won’t be added to the shadowset as a full-copy target
150. page 150 hp enterprise technical symposium filename.ppt Avoiding Untimely/Unwanted Shadow Copies After a site failure or inter-site link failure, rebooting the downed site after repairs can be disruptive to the surviving site
Many DT Cluster sites prevent systems from automatically rebooting without manual intervention
Easiest way to accomplish this is to set console boot flags for conversational boot
151. page 151 hp enterprise technical symposium filename.ppt Avoiding Untimely/Unwanted Shadow Copies If MOUNT commands are in SYSTARTUP_VMS.COM, shadow copies may start as soon as the first node at the downed site reboots
Recommendation is to not mount shadowsets automatically at startup; manually initiate shadow copies of application data disks at an opportune time
152. page 152 hp enterprise technical symposium filename.ppt Avoiding Untimely/Unwanted Shadow Copies In bringing a cluster with cross-site shadowsets completely down and back up, you need to preserve both shadowset members to avoid a full copy operation
Cross-site shadowsets must be dismounted while both members are still accessible
This implies keeping MSCP-serving OpenVMS systems up at each site until the shadowsets are dismounted
Easy way is to use the CLUSTER_SHUTDOWN option on SHUTDOWN.COM
153. page 153 hp enterprise technical symposium filename.ppt Avoiding Untimely/Unwanted Shadow Copies In bringing a cluster with cross-site shadowsets back up, you need to ensure both shadowset members are accessible at mount time, to avoid removing a member and thus needing to do a shadow full-copy afterward
If MOUNT commands are in SYSTARTUP_VMS.COM, the first node up at the first site up will form 1-member shadow sets and drop the other site’s shadow members
154. page 154 hp enterprise technical symposium filename.ppt Avoiding Untimely/Unwanted Shadow Copies Recommendation is to not mount cross-site shadowsets automatically in startup; wait until at least a couple of systems are up at each site, then manually initiate cross-site shadowset mounts
Since MSCP-serving is enabled before a node joins a cluster, booting systems at both sites simultaneously works most of the time
155. page 155 hp enterprise technical symposium filename.ppt Avoiding Untimely/Unwanted Shadow Copies New Shadowing capabilities help in this area:
$ MOUNT DSAnnn: label
without any other qualifiers will mount a shadowset on an additional node using the existing membership, without the chance of any shadow copies being initiated.
This allows you to start the application at the second site and run from the first site’s disks, and do the shadow copies later
156. page 156 hp enterprise technical symposium filename.ppt Avoiding Untimely/Unwanted Shadow Copies DCL code can be written to wait for both shadowset members before MOUNTing, using the /POLICY=REQUIRE_MEMBERS and /NOCOPY qualifiers as safeguards against undesired copies
The /VERIFY_LABEL qualifier to MOUNT prevents a shadow copy from starting to a disk unless its label is “SCRATCH_DISK”:
This means that before a member disk can be a target of a full-copy operation, it must be MOUNTed with /OVERRIDE=SHADOW and a $ SET VOLUME/LABEL=SCRATCH_DISK command executed to change the label
157. page 157 hp enterprise technical symposium filename.ppt Avoiding Untimely/Unwanted Shadow Copies One of the USER* SYSGEN parameters (e.g. USERD1) may be used to as a flag to indicate to startup procedures the desired action:
Mount both members (normal case; both sites OK)
Mount only local member (other site is down)
Mount only remote member (other site survived; this site re-entering the cluster, but deferring shadow copies until later)
158. page 158 hp enterprise technical symposium filename.ppt “Creeping Doom” Scenario Here we have two sites, with data mirrored between them.Here we have two sites, with data mirrored between them.
159. page 159 hp enterprise technical symposium filename.ppt “Creeping Doom” Scenario A lightning strike hits the network room, taking out (all of) the inter-site link(s).A lightning strike hits the network room, taking out (all of) the inter-site link(s).
160. page 160 hp enterprise technical symposium filename.ppt “Creeping Doom” Scenario First symptom is failure of link(s) between two sites
Forces choice of which datacenter of the two will continue
Transactions then continue to be processed at chosen datacenter, updating the data In this case, for the sake of illustration, we’ll assume we happen to choose the left-hand site (the one where lightning struck) to continue processing transactions.In this case, for the sake of illustration, we’ll assume we happen to choose the left-hand site (the one where lightning struck) to continue processing transactions.
161. page 161 hp enterprise technical symposium filename.ppt “Creeping Doom” Scenario
162. page 162 hp enterprise technical symposium filename.ppt “Creeping Doom” Scenario In this scenario, the same failure which caused the inter-site link(s) to go down expands to destroy the entire datacenter
163. page 163 hp enterprise technical symposium filename.ppt “Creeping Doom” Scenario Fire spreads from the network room and engulfs the entire site.
Fire spreads from the network room and engulfs the entire site.
164. page 164 hp enterprise technical symposium filename.ppt “Creeping Doom” Scenario Transactions processed after “wrong” datacenter choice are thus lost
Commitments implied to customers by those transactions are also lost
165. page 165 hp enterprise technical symposium filename.ppt “Creeping Doom” Scenario Techniques for avoiding data loss due to “Creeping Doom”:
Tie-breaker at 3rd site helps in many (but not all) cases
3rd copy of data at 3rd site
166. page 166 hp enterprise technical symposium filename.ppt “Rolling Disaster” Scenario Disaster or outage makes one site’s data out-of-date
While re-synchronizing data to the formerly-down site, a disaster takes out the primary site
167. page 167 hp enterprise technical symposium filename.ppt “Rolling Disaster” Scenario The right-hand site has been down for scheduled maintenance or something, or has recovered from an earlier disaster, and we are in the process of copying data from the disks at the left-hand site to the disks at the right-hand site, one track after the other.The right-hand site has been down for scheduled maintenance or something, or has recovered from an earlier disaster, and we are in the process of copying data from the disks at the left-hand site to the disks at the right-hand site, one track after the other.
168. page 168 hp enterprise technical symposium filename.ppt “Rolling Disaster” Scenario Before the data re-synchronization can be completed, a disaster strikes the left-hand site. Now we have no valid copy of the data left. Before the data re-synchronization can be completed, a disaster strikes the left-hand site. Now we have no valid copy of the data left.
169. page 169 hp enterprise technical symposium filename.ppt “Rolling Disaster” Scenario Techniques for avoiding data loss due to “Rolling Disaster”:
Keep copy (backup, snapshot, clone) of out-of-date copy at target site instead of over-writing the only copy there, or
Use a hardware mirroring scheme which preserves write order during re-synch
In either case, the surviving copy will be out-of-date, but at least you’ll have some copy of the data
Keeping a 3rd copy of data at 3rd site is the only way to ensure there is no data lost
170. page 170 hp enterprise technical symposium filename.ppt Primary CPU Workload MSCP-serving in a disaster-tolerant cluster is typically handled in interrupt state on the Primary CPU
Interrupts from LAN Adapters come in on the Primary CPU
A multiprocessor system may have no more MSCP-serving capacity than a uniprocessor
Fast_Path may help
Lock mastership workload for remote lock requests can also be a heavy contributor to Primary CPU interrupt state usage
171. page 171 hp enterprise technical symposium filename.ppt Primary CPU interrupt-state saturation OpenVMS receives all interrupts on the Primary CPU (prior to 7.3-1)
If interrupt workload exceeds capacity of Primary CPU, odd symptoms can result
CLUEXIT bugchecks, performance anomalies
OpenVMS has no internal feedback mechanism to divert excess interrupt load
e.g. node may take on more trees to lock-master than it can later handle
Use MONITOR MODES/CPU=n/ALL to track primary CPU interrupt state usage and peaks (where “n” is the Primary CPU shown by $SHOW CPU)
172. page 172 hp enterprise technical symposium filename.ppt Interrupt-state/stack saturation FAST_PATH:
Can shift interrupt-state workload off primary CPU in SMP systems
IO_PREFER_CPUS value of an even number disables CPU 0 use
Consider limiting interrupts to a subset of non-primaries rather than all
FAST_PATH for CI since about 7.1
FAST_PATH for SCSI and FC is in 7.3 and above
FAST_PATH for LANs (e.g. FDDI & Ethernet) probably 7.3-2
FAST_PATH for Memory Channel probably “never”
Even with FAST_PATH enabled, CPU 0 still received the device interrupt, but handed it off immediately via an inter-processor interrupt
7.3-1 allows interrupts for FAST_PATH devices to bypass the Primary CPU entirely and go directly to a non-primary CPU
173. page 173 hp enterprise technical symposium filename.ppt Making System Management of Disaster-Tolerant Clusters More Efficient Most disaster-tolerant clusters have multiple system disks
This tends to increase system manager workload for applying upgrades and patches for OpenVMS and layered products to each system disk
Techniques are available which minimize the effort involved
174. page 174 hp enterprise technical symposium filename.ppt Making System Management of Disaster-Tolerant Clusters More Efficient Create a “cluster-common” disk
Cross-site shadowset
Mount it in SYLOGICALS.COM
Put all cluster-common files there, and define logicals in SYLOGICALS.COM to point to them:
SYSUAF, RIGHTSLIST
Queue file, LMF database, etc.
175. page 175 hp enterprise technical symposium filename.ppt Making System Management of Disaster-Tolerant Clusters More Efficient Put startup files on cluster-common disk also; and replace startup files on all system disks with a pointer to the common one:
e.g. SYS$STARTUP:STARTUP_VMS.COM contains only:
$ @CLUSTER_COMMON:SYSTARTUP_VMS
To allow for differences between nodes, test for node name in common startup files, e.g.
$ NODE = F$GETSYI(“NODENAME”)
$ IF NODE .EQS. “GEORGE” THEN ...
176. page 176 hp enterprise technical symposium filename.ppt Making System Management of Disaster-Tolerant Clusters More Efficient Create a MODPARAMS_COMMON.DAT file on the cluster-common disk which contains system parameter settings common to all nodes
For multi-site or disaster-tolerant clusters, also create one of these for each site
Include an AGEN$INCLUDE_PARAMS line in each node-specific MODPARAMS.DAT to include the common parameter settings
177. page 177 hp enterprise technical symposium filename.ppt Making System Management of Disaster-Tolerant Clusters More Efficient Use “Cloning” technique to replicate system disks and avoid doing “n” upgrades for “n” system disks
178. page 178 hp enterprise technical symposium filename.ppt System disk “Cloning” technique Create “Master” system disk with roots for all nodes. Use Backup to create Clone system disks.
To minimize disk space, move dump files off system disk for all nodes
Before an upgrade, save any important system-specific info from Clone system disks into the corresponding roots on the Master system disk
Basically anything that’s in SYS$SPECIFIC:[*]
Examples: ALPHAVMSSYS.PAR, MODPARAMS.DAT, AGEN$FEEDBACK.DAT
Perform upgrade on Master disk
Use Backup to copy Master to Clone disks again.
179. page 179 hp enterprise technical symposium filename.ppt Implementing LAVC$FAILURE_ANALYSIS Template program is found in SYS$EXAMPLES: and called LAVC$FAILURE_ANALYSIS.MAR
Written in Macro-32
but you don’t need to know Macro to use it
Documented in Appendix D of OpenVMS Cluster Systems Manual
Appendix E (subroutines the above program calls) and Appendix F (general info on troubleshooting LAVC LAN problems) are also very helpful
180. page 180 hp enterprise technical symposium filename.ppt Using LAVC$FAILURE_ANALYSIS To use, the program must be
Edited to insert site-specific information
Compiled (assembled on VAX)
Linked, and
Run at boot time on each node in the cluster
181. page 181 hp enterprise technical symposium filename.ppt Maintaining LAVC$FAILURE_ANALYSIS Program must be re-edited whenever:
The LAVC LAN is reconfigured
A node’s MAC address changes
e.g. Field Service replaces a LAN adapter without swapping MAC address ROMs
A node is added or removed (permanently) from the cluster
182. page 182 hp enterprise technical symposium filename.ppt How Failure Analysis is Done OpenVMS is told what the network configuration should be
From this info, OpenVMS infers which LAN adapters should be able to “hear” Hello packets from which other LAN adapters
By checking for receipt of Hello packets, OpenVMS can tell if a path is working or not
183. page 183 hp enterprise technical symposium filename.ppt How Failure Analysis is Done By analyzing Hello packet receipt patterns and correlating them with a mathematical graph of the network, OpenVMS can tell what nodes of the network are passing Hello packets and which appear to be blocking Hello packets
OpenVMS determines a Primary Suspect (and, if there is ambiguity as to exactly what has failed, an Alternate Suspect), and reports these via OPCOM messages with a “%LAVC” prefix
184. page 184 hp enterprise technical symposium filename.ppt Getting Failures Fixed Since notification is via OPCOM messages, someone or something needs to be scanning OPCOM output and taking action
ConsoleWorks, Console Manager, CLIM, or RoboMon can scan for %LAVC messages and take appropriate action (e-mail, pager, etc.)
185. page 185 hp enterprise technical symposium filename.ppt Gathering Info Data required:
Local Area Network configuration:
VMS Nodes
LAN adapters in each node
Bridges
Hubs
Links between all of the above
186. page 186 hp enterprise technical symposium filename.ppt Network Information OpenVMS considers LAN building blocks as being divided into 4 classes:
NODE: The OpenVMS systems
ADAPTER: LAN host-bus adapters in each OpenVMS system
COMPONENT: Hubs, bridges, bridge-routers
CLOUD: Combinations of components that can’t be diagnosed directly (more on this later)
187. page 187 hp enterprise technical symposium filename.ppt Network building blocks
188. page 188 hp enterprise technical symposium filename.ppt Network building blocks
189. page 189 hp enterprise technical symposium filename.ppt Network building blocks
190. page 190 hp enterprise technical symposium filename.ppt Network building blocks
191. page 191 hp enterprise technical symposium filename.ppt Handling Network Loops The algorithm used for LAVC$FAILURE_ANALYSIS can’t deal with loops in the network graph
Yet redundancy is often configured among LAN components
The bridges’ Spanning Tree algorithm shuts off backup links unless and until a failure occurs
Hello packets don’t get through these backup links, so OpenVMS can’t track them
For these cases, you replace the redundant portion of the network with a “network cloud” that includes all of the redundant components
Then OpenVMS can determine if the network “cloud” as a whole is functioning or not
192. page 192 hp enterprise technical symposium filename.ppt Handling Redundancy Multiple, completely separate LANs don’t count as “loops” and OpenVMS can track each one separately and simultaneously
193. page 193 hp enterprise technical symposium filename.ppt Gathering Info Data required (more detail):
Node names and descriptions
LAN adapter types and descriptions, and:
MAC address
e.g. 08-00-2B-xx-xx-xx, 00-00-F8-xx-xx-xx
plus DECnet-style MAC address for Phase IV
e.g. AA-00-04-00-yy-zz
194. page 194 hp enterprise technical symposium filename.ppt Getting MAC address info
195. page 195 hp enterprise technical symposium filename.ppt Editing the Program Once the data is gathered, you edit the program
There are 5 sections to edit, as follows:
196. page 196 hp enterprise technical symposium filename.ppt Edit 1 In Edit 1, you can give descriptive names to nodes, adapters, components, and clouds
These names become names of macros which you’ll create invocations of later in the code
197. page 197 hp enterprise technical symposium filename.ppt Edit 1
198. page 198 hp enterprise technical symposium filename.ppt Edit 2 In Edit 2, you create ASCII art to document the LAVC LAN configuration
This has no functional effect on the code, but helps you (and others who follow you) understand the information in the sections which follow
In the drawing, you choose brief abbreviated names for each network building block (Node, Adapter, Component, or Cloud)
These abbreviated names are only used within the program, and do not appear externally
199. page 199 hp enterprise technical symposium filename.ppt Edit 2
200. page 200 hp enterprise technical symposium filename.ppt Edit 3 In Edit 3, you name and provide a text description for each system and its LAN adapter(s), and the MAC address of each adapter
The name and text description will appear in OPCOM messages indicating when failure or repair has occurred
The MAC address is used to identify the origin of Hello messages
201. page 201 hp enterprise technical symposium filename.ppt Edit 3 For DECnet Phase IV, which changes the MAC address on all circuits it knows about from the default hardware address to a special DECnet address when it starts up, you provide both:
The hardware MAC address (e.g. 08-00-2B-nn-nn-nn) and
The DECnet-style MAC address which is derived from the DECnet address of the node (AA-00-04-00-yy-xx)
DECnet Phase V does not change the MAC address, so only the HW address is needed
202. page 202 hp enterprise technical symposium filename.ppt Edit 3
203. page 203 hp enterprise technical symposium filename.ppt Edit 4 In Edit 4, you name and provide a text description for each Component and each Cloud
The name and text description will appear in OPCOM messages indicating when failure or repair has occurred
204. page 204 hp enterprise technical symposium filename.ppt Edit 4
205. page 205 hp enterprise technical symposium filename.ppt Edit 5 In Edit 5, you indicate which network building blocks have connections to each other
This is a list of pairs of devices, indicating they are connected
206. page 206 hp enterprise technical symposium filename.ppt Edit 5
207. page 207 hp enterprise technical symposium filename.ppt Level of Detail There is a trade-off between level of detail in diagnostic info and the amount of work required to initially set up and to maintain the program over time
More detail means more work to setup, and more maintenance work, but can provide more-specific diagnostic info when failures occur
208. page 208 hp enterprise technical symposium filename.ppt Level of Detail Example
209. page 209 hp enterprise technical symposium filename.ppt EDIT_LAVC.COM Tool A DCL command procedure is available to gather the information and create an example LAVC$FAILURE_ANALYSIS.MAR program customized for a given cluster. See
http://encompasserve.org/~parris/edit_lavc.com and EDIT_LAVC_DOC.TXT at the same location
To turn off LAVC Failure Analysis, use the LAVC$FAILURE_OFF.MAR program found at the same location
210. page 210 hp enterprise technical symposium filename.ppt Long-Distance Clusters OpenVMS SPD supports distance of up to 250 km (150 miles) between sites
up to 833 km (500 miles) with DTCS or BRS
Why the limit?
Inter-site latency
211. page 211 hp enterprise technical symposium filename.ppt Long-distance Cluster Issues Latency due to speed of light becomes significant at higher distances. Rules of thumb:
About ˝ millisecond per 100 km, one-way or
About 1 ms per 100 km, round-trip latency
Actual circuit path length can be longer than highway driving distance between sites
Latency affects I/O and locking The Speed of Light is one speed limit you can’t break.
Inter-site latency will affect the performance of any I/O operations which must go to the remote site (such as write I/Os which update the mirrorset members at that site) as well as any distributed lock manager operations that go across the link between sites, in a cluster which has a DLM.The Speed of Light is one speed limit you can’t break.
Inter-site latency will affect the performance of any I/O operations which must go to the remote site (such as write I/Os which update the mirrorset members at that site) as well as any distributed lock manager operations that go across the link between sites, in a cluster which has a DLM.
212. page 212 hp enterprise technical symposium filename.ppt Inter-site Round-Trip Lock Request Latencies
213. page 213 hp enterprise technical symposium filename.ppt Differentiate between latency and bandwidth Can’t get around the speed of light and its latency effects over long distances
Higher-bandwidth link doesn’t mean lower latency
Multiple links may help latency somewhat under heavy loading due to shorter queue lengths, but can’t outweigh speed-of-light issues At long inter-site distances, a “higher-speed” (really higher-bandwidth) link likely won’t have lower latency, as the latency will be dominated by the effect of the speed of light over the extended distance.At long inter-site distances, a “higher-speed” (really higher-bandwidth) link likely won’t have lower latency, as the latency will be dominated by the effect of the speed of light over the extended distance.
214. page 214 hp enterprise technical symposium filename.ppt Latency of Inter-Site Link Latency affects performance of:
Lock operations that cross the inter-site link
Lock requests
Directory lookups, deadlock searches
Write I/Os to remote shadowset members, either:
Over SCS link through the OpenVMS MSCP Server on a node at the opposite site, or
Direct via Fibre Channel (with an inter-site FC link)
Both MSCP and the SCSI-3 protocol used over FC take a minimum of two round trips for writes
215. page 215 hp enterprise technical symposium filename.ppt Mitigating Impact of Inter-Site Latency Locking
Try to avoid lock requests to master node at remote site
OpenVMS does move mastership of a resource tree to the node with the most activity
Lock directory lookups with directory node at remote site can only be avoided by setting LOCKDIRWT to zero on all nodes at the remote site
This is typically only satisfactory for Primary/Backup or remote-shadowing-only clusters
216. page 216 hp enterprise technical symposium filename.ppt Mitigating Impact of Inter-Site Latency Check SHOW CLUSTER/CONTINUOUS with ADD CONNECTIONS, ADD REM_PROC_NAME and ADD CR_WAITS to check for SCS credit waits. If counts are present and increasing over time, increase the SCS credits at the remote end as follows:
217. page 217 hp enterprise technical symposium filename.ppt Mitigating Impact of Inter-Site Latency For credit waits on VMS$VAXcluster SYSAP connections:
Increase CLUSTER_CREDITS parameter
Default is 10; maximum is 128
218. page 218 hp enterprise technical symposium filename.ppt Mitigating Impact of Inter-Site Latency For credit waits on VMS$DISK_CL_DRVR / MSCP$DISK connections:
For OpenVMS server node, increase MSCP_CREDITS parameter. Default is 8; maximum is 128.
For HSJ/HSD controller, lower MAXIMUM_HOSTS from default of 16 to actual number of OpenVMS systems on the CI/DSSI interconnect
219. page 219 hp enterprise technical symposium filename.ppt Local versus Remote operations Optimize local operations:
Read I/Os:
Set SHADOW_SYS_DISK bit %X10000 (bit 16) to select local-read optimization (to favor CI/DSSI disks over MSCP-served disks), if applicable, or
Specify SITE or READ_COST for member disks with OpenVMS 7.3 or recent VOLSHAD ECO kits
220. page 220 hp enterprise technical symposium filename.ppt Application Scheme 1:Hot Primary/Cold Standby All applications normally run at the primary site
Second site is idle, except for volume shadowing, until primary site fails, then it takes over processing
Performance will be good (all-local locking)
Fail-over time will be poor, and risk high (standby systems not active and thus not being tested)
Wastes computing capacity at the remote site In an OpenVMS disaster-tolerant cluster, there are 3 main ways folks run applications.
(OpenVMS Clusters are a shared-everything cluster. In a shared-storage cluster, only the first 2 may be possible.)In an OpenVMS disaster-tolerant cluster, there are 3 main ways folks run applications.
(OpenVMS Clusters are a shared-everything cluster. In a shared-storage cluster, only the first 2 may be possible.)
221. page 221 hp enterprise technical symposium filename.ppt Application Scheme 2:Hot/Hot but Alternate Workloads All applications normally run at one site or the other, but not both; data is shadowed between sites, and the opposite site takes over upon a failure
Performance will be good (all-local locking)
Fail-over time will be poor, and risk moderate (standby systems in use, but specific applications not active and thus not being tested from that site)
Second site’s computing capacity is actively used
222. page 222 hp enterprise technical symposium filename.ppt Application Scheme 3:Uniform Workload Across Sites All applications normally run at both sites simultaneously; surviving site takes all load upon failure
Performance may be impacted (some remote locking) if inter-site distance is large
Fail-over time will be excellent, and risk low (standby systems are already in use running the same applications, thus constantly being tested)
Both sites’ computing capacity is actively used Doing this requires a shared-everything cluster implementation like OpenVMS Clusters or TruClusters.Doing this requires a shared-everything cluster implementation like OpenVMS Clusters or TruClusters.
223. page 223 hp enterprise technical symposium filename.ppt Capacity Considerations When running workload at both sites, be careful to watch utilization.
Utilization over 35% will result in utilization over 70% if one site is lost
Utilization over 50% will mean there is no possible way one surviving site can handle all the workload When capacity at both sites can be used (Application Schemes 2 and 3), one must be careful to not over-subscribe the computing capacity.When capacity at both sites can be used (Application Schemes 2 and 3), one must be careful to not over-subscribe the computing capacity.
224. page 224 hp enterprise technical symposium filename.ppt Response time vs. Utilization Queuing theory allows us to predict the impact on response time as a resource is increasingly busy. In this graph, response time for an idle resource is set at 1. At 50% utilization, response time has doubled, and at 90% utilization, response time has gone to 10 times. To prevent response times from becoming unacceptable, it is advisable to keep resource utilization below the “knee” of the curve, which starts to occur at about 70%.Queuing theory allows us to predict the impact on response time as a resource is increasingly busy. In this graph, response time for an idle resource is set at 1. At 50% utilization, response time has doubled, and at 90% utilization, response time has gone to 10 times. To prevent response times from becoming unacceptable, it is advisable to keep resource utilization below the “knee” of the curve, which starts to occur at about 70%.
225. page 225 hp enterprise technical symposium filename.ppt Response time vs. Utilization: Impact of losing 1 site If you take advantage of the capacity of both sites to run production workload, be careful not to fall into the trap of not providing additional capacity to keep up with increases in workload, because if one of the two sites fails, you will have only one-half the capacity, utilization will double, and response times may become intolerable. To keep utilization under the “knee of the curve”, or about 70%, after a site loss, one needs to procure additional computing capacity whenever utilization gets over about ˝ that, or 35%, under normal conditions, when two sites are present.If you take advantage of the capacity of both sites to run production workload, be careful not to fall into the trap of not providing additional capacity to keep up with increases in workload, because if one of the two sites fails, you will have only one-half the capacity, utilization will double, and response times may become intolerable. To keep utilization under the “knee of the curve”, or about 70%, after a site loss, one needs to procure additional computing capacity whenever utilization gets over about ˝ that, or 35%, under normal conditions, when two sites are present.
226. page 226 hp enterprise technical symposium filename.ppt Testing Separate test environment is very helpful, and highly recommended
Good practices require periodic testing of a simulated disaster. This allows you to:
Validate your procedures
Train your people
227. page 227 hp enterprise technical symposium filename.ppt Setup Steps for Creating a Disaster-Tolerant Cluster Let’s look at the steps involved in setting up a Disaster-Tolerant Cluster from the ground up.
Datacenter site preparation
Install the hardware and networking equipment
Ensure dual power supplies are plugged into separate power feeds
Select configuration parameters:
Choose an unused cluster group number; select a cluster password
Choose site allocation class(es)
228. page 228 hp enterprise technical symposium filename.ppt Steps for Creating a Disaster-Tolerant Cluster Configure storage (if HSx controllers)
Install OpenVMS on each system disk
Load licenses for Open VMS Base, OpenVMS Users, Cluster, Volume Shadowing and, for ease of access, your networking protocols (DECnet and/or TCP/IP)
229. page 229 hp enterprise technical symposium filename.ppt Setup Steps for Creating a Disaster-Tolerant Cluster Create a shadowset across sites for files which will be used on common by all nodes in the cluster. On it, place:
SYSUAF and RIGHTSLIST files (copy from any system disk)
License database (LMF$LICENSE.LDB)
NETPROXY.DAT, NET$PROXY.DAT (DECnet proxy login files), if used; NETNODE_REMOTE.DAT, NETNODE_OBJECT.DAT
VMS$MAIL_PROFILE.DATA (OpenVMS Mail Profile file)
Security audit journal file
Password History and Password Dictionary files
Queue manager files
System login command procedure SYS$SYLOGIN:
LAVC$FAILURE_ANALYSIS program from the SYS$EXAMPLES: area, customized for the specific cluster interconnect configuration and LAN addresses of the installed systems
230. page 230 hp enterprise technical symposium filename.ppt Setup Steps for Creating a Disaster-Tolerant Cluster To create the license database:
Copy initial file from any system disk
Leave shell LDBs on each system disk for booting purposes (we’ll map to the common one in SYLOGICALS.COM)
Use LICENSE ISSUE/PROCEDURE/OUT=xxx.COM (and LICENSE ENABLE afterward to re-enable the original license in the LDB on the system disk), then execute the procedure against the common database to put all licenses for all nodes into the common LDB file
Add all additional licenses to the cluster-common LDB file (i.e. layered products)
231. page 231 hp enterprise technical symposium filename.ppt Setup Steps for Creating a Disaster-Tolerant Cluster Create a minimal SYLOGICALS.COM that simply mounts the cluster-common shadowset, defines a logical name CLUSTER_COMMON to point to a common area for startup procedures, and then invokes @CLUSTER_COMMON:SYLOGICALS.COM
232. page 232 hp enterprise technical symposium filename.ppt Setup Steps for Creating a Disaster-Tolerant Cluster Create “shell” command scripts for each of the following files. The “shell” will contain only one command, to invoke the corresponding version of this startup file in the CLUSTER_COMMON area. For example. SYS$STARTUP:SYSTARTUP_VMS.COM on every system disk will contain the single line: $ @CLUSTER_COMMON:SYSTARTUP_VMS.COMDo this for each of the following files:
SYCONFIG.COM
SYPAGSWPFILES.COM
SYSECURITY.COM
SYSTARTUP_VMS.COM
SYSHUTDWN.COM
Any command procedures that are called by these cluster-common startup procedures should also be placed in the cluster-common area
233. page 233 hp enterprise technical symposium filename.ppt Setup Steps for Creating a Disaster-Tolerant Cluster Create AUTOGEN “include” files to simplify the running of AUTOGEN on each node:
Create one for parameters common to systems at each site. This will contain settings for a given site for parameters such as:
ALLOCLASS
TAPE_ALLOCLASS
Possibly SHADOW_SYS_UNIT (if all systems at a site share a single system disk, this gives the unit number)
234. page 234 hp enterprise technical symposium filename.ppt Setup Steps for Creating a Disaster-Tolerant Cluster Create one for parameters common to every system in the entire cluster. This will contain settings for things like:
VAXCLUSTER
RECNXINTERVAL (based on inter-site link recovery times)
SHADOW_MBR_TMO (typically 10 seconds larger than RECNXINTERVAL)
EXPECTED_VOTES (total of all votes in the cluster when all nodes are up)
Possibly VOTES (i.e. if all nodes have 1 vote each)
DISK_QUORUM=” “ (no quorum disk)
Probably LOCKDIRWT (i.e. if all nodes have equal values of 1)
SHADOWING=2 (enable host-based volume shadowing)
NISCS_LOAD_PEA0=1
NISCS_MAX_PKTSZ (to use larger FDDI or this plus LAN_FLAGS to use larger Gigabit Ethernet packets)
Probably SHADOW_SYS_DISK (to set bit 16 to enable local shadowset read optimization if needed)
Minimum values for:
CLUSTER_CREDITS
MSCP_BUFFER
MSCP_CREDITS
MSCP_LOAD, MSCP_SERVE_ALL; TMSCP_LOAD, TMSCP_SERVE_ALL
Possibly TIMVCFAIL (if faster-than-standard failover times are required)
235. page 235 hp enterprise technical symposium filename.ppt Setup Steps for Creating a Disaster-Tolerant Cluster Pare down the MODPARAMS.DAT file in each system root. It should contain basically only the parameter settings for:
SCSNODE
SCSSYSTEMID
plus a few AGEN$INCLUDE_PARAMS lines pointing to the CLUSTER_COMMON: area for:
MODPARAMS_CLUSTER_COMMON.DAT (parameters which are the same across the entire cluster)
MODPARAMS_COMMON_SITE_xxx.DAT (parameters which are the same for all systems within a given site or lobe of the cluster)
Architecture-specific common parameter file (Alpha vs. VAX vs. Itanium), if needed (parameters which are common to all systems of that architecture)
236. page 236 hp enterprise technical symposium filename.ppt Setup Steps for Creating a Disaster-Tolerant Cluster Typically, all the other parameter values one tends to see in an individual stand-alone node’s MODPARAMS.DAT file will be better placed in one of the common parameter files. This helps ensure consistency of parameter values across the cluster and minimize the system manager’s workload and reduce the chances of an error when a parameter value must be changed on multiple nodes.
237. page 237 hp enterprise technical symposium filename.ppt Setup Steps for Creating a Disaster-Tolerant Cluster Place the AGEN$INCLUDE_PARAMS lines at the beginning of the MODPARAMS.DAT file in each system root. The last definition of a given parameter value found by AUTOGEN is the one it uses, so by placing the “include” files in order from cluster-common to site-specific to node-specific, if necessary you can override the cluster-wide and/or site-wide settings on a given node by simply putting the desired parameter settings at the end of a specific node’s MODPARAMS.DAT file. This may be needed, for example, if you install and are testing a new version of OpenVMS on that node, and the new version requires some new SYSGEN parameter settings that don’t yet apply to the rest of the nodes in the cluster.
(Of course, an even more elegant way to handle this particular case would be to create a MODPARAMS_VERSION_xx.DAT file in the common area and include that file on any nodes running the new version of the operating system. Once all nodes have been upgraded to the new version, these parameter settings can be moved to the cluster-common MODPARAMS file.)
238. page 238 hp enterprise technical symposium filename.ppt Setup Steps for Creating a Disaster-Tolerant Cluster Create startup command procedures to mount cross-site shadowsets
239. page 239 hp enterprise technical symposium filename.ppt Recent Disaster-Tolerant Cluster Developments Fibre Channel storage
Data Replication Manager
240. page 240 hp enterprise technical symposium filename.ppt Fibre Channel and SCSI in Clusters Fibre Channel and SCSI are Storage-Only Interconnects
Provide access to storage devices and controllers
Storage can be shared between several nodes
SCSI Bus or SCSI Hub
FC Switch
Each node can access the storage directly
241. page 241 hp enterprise technical symposium filename.ppt Fibre Channel and SCSI in Clusters Fibre Channel or SCSI are Storage-Only Interconnects
Cannot carry SCS protocol
e.g. Connection Manager and Lock Manager traffic
Need SCS-capable Cluster Interconnect also
Memory Channel, CI, DSSI, FDDI, Ethernet, or Galaxy Shared Memory C.I. (SMCI)
242. page 242 hp enterprise technical symposium filename.ppt SCSI and FC Notes Fail-over between a direct path and an MSCP-served path is first supported in OpenVMS version 7.3-1
243. page 243 hp enterprise technical symposium filename.ppt Direct vs. MSCP-Served Paths
244. page 244 hp enterprise technical symposium filename.ppt Direct vs. MSCP-Served Paths
245. page 245 hp enterprise technical symposium filename.ppt Direct vs. MSCP-Served Paths
246. page 246 hp enterprise technical symposium filename.ppt Direct vs. MSCP-Served Paths
247. page 247 hp enterprise technical symposium filename.ppt Fibre Channel Notes After FC switch, cable, or disk virtual unit reconfiguration, the documentation recommends you do:
SYSMAN> IO SCSI_PATH_VERIFY
and
SYSMAN> IO AUTOCONFIGURE
This is unlike CI or DSSI, where path changes are detected and new devices appear automatically
248. page 248 hp enterprise technical symposium filename.ppt Fibre Channel Notes Volume Shadowing Mini-Merges are not supported for SCSI or Fibre Channel
Full Merge will occur after any node crashes while it has a shadowset mounted
HSG80-specific Mini-Merge project was cancelled; general Host-Based Mini-Merge capability is now under development
249. page 249 hp enterprise technical symposium filename.ppt New Failure Scenario:SCS link OK but FC link broken
250. page 250 hp enterprise technical symposium filename.ppt New Failure Scenario:SCS link OK but FC link broken With inter-site FC link broken, Volume Shadowing doesn’t know which member to remove
Node on left wants to remove right-hand disk, and vice-versa
New DCL commands allow you to force removal of one member manually
See white paper on “Fibre Channel in a Disaster-Tolerant OpenVMS Cluster System” at http://h71000.www7.hp.com/openvms/fibre/fc_hbvs_dtc_wp.pdf
Assumes you are logged in already, or SYSUAF, etc. aren’t on a cross-site shadowed disk
Not easy to use in practice, in my experience
Problem solved by direct-to-MSCP-served failover in 7.3-1
251. page 251 hp enterprise technical symposium filename.ppt Cross-site Shadowed System Disk With only an SCS link between sites, it was impractical to have a shadowed system disk and boot nodes from it at multiple sites
With a Fibre Channel inter-site link, it becomes possible to do this, but it is probably still not a good idea (single point of failure for the cluster)
252. page 252 hp enterprise technical symposium filename.ppt Data Replication Manager
253. page 253 hp enterprise technical symposium filename.ppt Data Replication Manager Multi-site mirroring between HSG80 controllers
Host writes to one HSG80
That HSG80 sends data to the HSG80 at the opposite site
254. page 254 hp enterprise technical symposium filename.ppt Data Replication Manager
255. page 255 hp enterprise technical symposium filename.ppt Data Replication Manager Since HSG80 controller coordinates writes to a mirrorset:
Mirrorset is visible on and can only be accessed through one HSG80 at a time
256. page 256 hp enterprise technical symposium filename.ppt Data Replication Manager Nodes at opposite site from controller in charge of mirrorset will have to do remote I/Os, both for writes and reads
Storage Engineering plans the ability to access data simultaneously from both sites in a future “DRM II” implementation
257. page 257 hp enterprise technical symposium filename.ppt Data Replication Manager
258. page 258 hp enterprise technical symposium filename.ppt Data Replication Manager
259. page 259 hp enterprise technical symposium filename.ppt Data Replication Manager Because data can be accessed only through one controller at a time; failover is messy and involves manual or, at best, scripted changes
260. page 260 hp enterprise technical symposium filename.ppt Achieving Very-High Availability In mission-critical environments, where OpenVMS Disaster-Tolerant Clusters are typically installed, there is typically a completely different mindset than at less-critical sites
Extra effort is typically taken to ensure very-high availability
This effort occurs across a broad range of areas
261. page 261 hp enterprise technical symposium filename.ppt Achieving Extremely High Availability Configure “extra” redundancy
2N redundancy instead of N+1, and 3N instead of 2N
i.e. shadowing instead of RAID-5
Layer redundancy when possible
e.g. shadowsets of mirrorsets
Recognize all potential single points of failure
e.g. dual-redundant controller pair tightly coupled
e.g. non-mirrored cache memory in controller
262. page 262 hp enterprise technical symposium filename.ppt Achieving Extremely High Availability Monitor closely
Fix broken hardware quickly because it reduces redundancy and subsequent hardware failures can cause outages
Quicker mean-time-to-repair (MTTR) means better mean-time-between-failures (MTBF) in practice
263. page 263 hp enterprise technical symposium filename.ppt Achieving Extremely High Availability Configure reserve capacity
Avoid saturation and recovery scenarios, infrequently-used code paths
264. page 264 hp enterprise technical symposium filename.ppt Achieving Extremely High Availability Avoid using very-new OpenVMS versions, ECO patch kits, firmware releases, and hardware products
Let new OpenVMS releases, ECO kits, and products “age”
This lets other sites find the bugs first, and helps avoid installing an ECO kit only to find shortly that it has been put on “Engineering Hold” due to a problem
265. page 265 hp enterprise technical symposium filename.ppt Achieving Extremely High Availability Allow fail-over mechanisms the freedom to work
Leave options open for fail-over in case they are needed:
e.g. Rather than leaving preferred paths set permanently down at the HSJ/HSD controller level, give OpenVMS the freedom to select the path, in anticipation of future failure conditions
e.g. Consider leaving 10–megabit or Fast Ethernet LAN paths enabled for SCS traffic even though Gigabit Ethernet (or FDDI) is present and larger packet size is in use; use SCACP to set at lower priority if desired
e.g. Load MSCP server with MSCP_LOAD=2 even on a node which you normally don’t desire to take MSCP serving load
266. page 266 hp enterprise technical symposium filename.ppt Achieving Extremely High Availability Give time for hardware to mature
The second or third revisions of a product often have higher performance and fewer problems than the very first version (and often cost less)
However, one aspect of “availability” is acceptable performance, so
You may have to trade off hardware / software maturity in favor of higher performance to handle high workload growth rates DE500 is in 3rd generation now, and has been more stable than the newer DE60x adapter and its driver
2nd-generation DEGXA handles small packets at higher rates than the 1st-generation DEGPA Gigabit Ethernet LAN adapterDE500 is in 3rd generation now, and has been more stable than the newer DE60x adapter and its driver
2nd-generation DEGXA handles small packets at higher rates than the 1st-generation DEGPA Gigabit Ethernet LAN adapter
267. page 267 hp enterprise technical symposium filename.ppt Achieving Extremely High Availability Test new code and hardware (and any changes) first in a test environment separate from the production environment
Ideally, have a full-scale test environment which exactly duplicates the production environment, to allow testing under full (simulated) loading
Introduce any changes into production cautiously, one node at a time
268. page 268 hp enterprise technical symposium filename.ppt Achieving Extremely High Availability Introduce new technology using “generational diversity”
Instead of using all new technology, use a mix of old (proven) and new (fast, but unproven) technology, backing each other up in redundant configurations
269. page 269 hp enterprise technical symposium filename.ppt Achieving Extremely High Availability Consider using multiple clusters
One approach:
Partition data among clusters
Provide in-bound user routing across clusters to where their data resides
Allow user data to be migrated between clusters on-the-fly, transparent to the user
Failure of any single cluster affects only a fraction of the users
Partial failure of any cluster can be mitigated by migrating its users to another cluster
Cleanest approach to multiple clusters:
Design with RTR in mind from the beginning
270. page 270 hp enterprise technical symposium filename.ppt Achieving Extremely High Availability Do cluster configuration and application design or re-design with availability in mind:
On-line backups
Disk defragmentation
Indexed file reorganization
271. page 271 hp enterprise technical symposium filename.ppt Business Continuity Although we’ve been talking about tolerating disasters in the IT area, true ability to survive a disaster involves more than using a disaster-tolerant cluster in the IT department
The goal of Business Continuity is the ability for the entire business, not just IT, to continue operating despite a disaster
272. page 272 hp enterprise technical symposium filename.ppt Business Continuity:Not just IT Not just computers and data:
People
Facilities
Communications
Networks
Telecommunications
Transportation
273. page 273 hp enterprise technical symposium filename.ppt Business Continuity Resources Disaster Recovery Journal:
http://www.drj.com
Contingency Planning & Management Magazine
http://www.contingencyplanning.com
Both are free to qualified subscribers
Both hold conferences as well as publishing high-quality journals
274. page 274 hp enterprise technical symposium filename.ppt Real-Life Examples Credit Lyonnais fire in Paris, May 1996
Data replication to a remote site saved the data
Fire occurred over a weekend, and DR site plus quick procurement of replacement hardware allowed bank to reopen on Monday
275. page 275 hp enterprise technical symposium filename.ppt Real-Life Examples:Online Stock Brokerage 02:00 h on 29 December, 1999, an active stock market trading day
UPS Audio Alert alarmed security guard on his first day on the job, who pressed emergency power-off switch, taking down the entire datacenter
276. page 276 hp enterprise technical symposium filename.ppt Real-Life Examples:Online Stock Brokerage Disaster-tolerant cluster continued to run at opposite site; no disruption
Ran through that trading day on one site alone
Re-synchronized data in the evening after trading hours
Procured replacement security guard by the next day
277. page 277 hp enterprise technical symposium filename.ppt Real-Life Examples: Commerzbank on 9/11 Datacenter near WTC towers
Generators took over after power failure, but dust & debris eventually caused A/C units to fail
Data replicated to remote site 50 kilometers away
One server continued to run despite 40° C temperatures, running off the copy of the data at the opposite site after the local disk drives had succumbed to the heat
See http://h71000.www7.hp.com/openvms/brochures/commerzbank/
278. page 278 hp enterprise technical symposium filename.ppt Real-Life Examples: Online Brokerage Dual inter-site links
From completely different vendors
Both vendors sub-contracted to same local RBOC for local connections at both sites
Result: One simultaneous failure of both links within 4 years’ time
279. page 279 hp enterprise technical symposium filename.ppt Real-Life Examples: Online Brokerage Dual inter-site links from different vendors
Both used fiber optic cables across the same highway bridge
El Nińo caused flood which washed out bridge
Vendors’ SONET rings wrapped around the failure, but latency skyrocketed and cluster performance suffered
280. page 280 hp enterprise technical symposium filename.ppt Real-Life Examples: Online Brokerage Vendor provided redundant storage controller hardware
Despite redundancy, a controller pair failed, preventing access to the data behind the controllers
Host-based volume shadowing was in use, and the cluster continued to run using the copy of the data at the opposite site
281. page 281 hp enterprise technical symposium filename.ppt Real-Life Examples: Online Brokerage Dual inter-site links from different vendors
Both vendors’ links did fail sometimes
Redundancy and automatic failover masks failures
Monitoring is crucial
One outage lasted 6 days before discovery
282. page 282 hp enterprise technical symposium filename.ppt Speaker Contact Info:
Keith Parris
E-mail: parris@encompasserve.org
or keithparris@yahoo.com
or Keith.Parris@hp.com
Web: http://encompasserve.org/~parris/
and http://www.geocities.com/keithparris/