1 / 40

Case Studies of Disaster Tolerance and Disaster Recovery with OpenVMS Keith Parris, HP

Case Studies of Disaster Tolerance and Disaster Recovery with OpenVMS Keith Parris, HP. Case Study: 3,000-mile Cluster. Case Study: 3,000-mile Cluster. Unidentified OpenVMS customer 3,000 mile site separation distance Disaster-tolerant OpenVMS Cluster

badrani
Download Presentation

Case Studies of Disaster Tolerance and Disaster Recovery with OpenVMS Keith Parris, HP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Case Studies of Disaster Tolerance and Disaster Recovery with OpenVMS Keith Parris, HP

  2. Case Study: 3,000-mile Cluster

  3. Case Study: 3,000-mile Cluster • Unidentified OpenVMS customer • 3,000 mile site separation distance • Disaster-tolerant OpenVMS Cluster • Originally VAX-based, thus running for many years now, so presumably acceptable performance

  4. Case Study: Global Cluster

  5. Case Study: Global Cluster • Internal HP test cluster • IP network as Cluster Interconnect • Sites in India, USA, Germany, Australia, etc. at various times • Inter-site distance (India-to-USA) about 8,000 miles • Round-trip latency of about 350 milliseconds • Estimated circuit path length about 22,000 miles

  6. Global IPCI Cluster

  7. HP facility, GERMANY I64G05 HP facility, USA HP facility INDIA HPVM guest NORSP Global IPCI Cluster NODEG TASHA Corporate network / internet L2/L3 switch PERK L3 Switch HUFFLE L3 Switch HP facility, Australia HP facility Bangalore, INDIA I64MOZ

  8. Case Study: 1,400-mile Volume Shadowing

  9. Case Study: 1,400-mile Shadowing • Healthcare customer • 1,400 mile site separation distance • 1,875 mile circuit length; 23-millisecond round-trip latency • Remote Vaulting (not clustering) across distance • Independent OpenVMS Clusters at each site (not cross-site) • Synchronous Replication using Volume Shadowing • SAN Extension over OC-3 links with Cisco MDS 9000 series FCIP boxes • Writes take 1 round trip (23 millisecond write latency)

  10. Case Study: 1,400-mile Shadowing • Did things right: tested first with inter-site distance simulated using latency of 30 milliseconds with network emulator box; 23 millisecond latency in the real-life configuration turned out to be even better • Caché database -- seems to be tolerant of high-latency writes and doesn’t let those slow writes block reads

  11. Case Study: 1,400-mile Shadowing • Later replaced Host-Based Volume Shadowing over the 1,400-mile distance with Caché Shadowing asynchronous replication for DR purposes • Trade-off: Better performance but risk some transaction data loss in conjunction with a disaster • Still use HBVS within each site • Considering short-distance multi-site DT clusters at each end

  12. Case Study: Proposed 600-mile Disaster-Tolerant Cluster

  13. Case Study: Proposed 600-mile Cluster • Existing OpenVMS DT cluster with 1-mile distance • One of two existing datacenters is to be closed • Proposed moving one-half of the cluster to an existing datacenter 600 miles away • Round-trip latency 13 milliseconds • Estimated circuit path length about 800 miles

  14. Case Study: Proposed 600-mile Cluster • Month-end processing time is one of the most performance-critical tasks • Tested in OpenVMS Customer Lab using D4 • Performance impact too high: • Performance at 600 miles was unacceptable • From results of tests at shorter distances we were able to extrapolate that performance would be acceptable if the circuit path length were 150-200 miles or less

  15. Case Study: Proposed 600-mile Cluster • Customer found a co-location site 30 miles away • Systems were moved from the one datacenter to the Co-Lo site • Network connections are via DWDM with Dual 1-Gigabit Ethernet and Dual Fibre Channel SAN links • Systems are clustered with the original site, with Volume Shadowing for data replication • This move was accomplished without any application downtime • Asynchronous replication (e.g. Continuous Access) to the site 600 miles away can be used for DR protection

  16. Case Study: New 20-mile Disaster-Tolerant Cluster

  17. Case Study: 20-mile DT Cluster • Existing OpenVMS Cluster • Needed protection against disasters • Implemented DT cluster to site 20 miles away • 0.8 millisecond round-trip latency • Estimated circuit path length about 50 miles

  18. Case Study: 20-mile DT Cluster:Pre-existing Performance Issues • Performance of night-time batch jobs had been problematic in the past • CPU saturation, disk fragmentation, directory files of 3K-5K blocks in size, and need for database optimization were potential factors

  19. Case Study: 20-mile DT Cluster:Post-implementation Performance Issues • After implementing DT cluster (shadowing between sites), overnight batch jobs now took hours too long to complete • Slower write latencies identified as the major new factor • Former factors still uncorrected • With default Read_Cost values, customer was getting all the detriment of Volume Shadowing for writes, but none of the potential benefit for reads

  20. Case Study: 20-mile DT Cluster:Write Latencies • MSCP-serving was used for access to disks at remote site. Theory predicts writes take 2 round trips. • Write latency to local disk measured at 0.4 milliseconds • Write latency to remote disks calculated as: • 0.4 + ( twice 0.8 millisecond round-trip time ) = 2.0 milliseconds • Factor of 5X slower write latency

  21. Case Study: 20-mile DT Cluster:Write Latencies • FCIP-based SAN Extension with Cisco Write Acceleration or Brocade FastWrite would allow writes in one round-trip instead of two • Write latency to remote disks calculated as: • 0.4 + ( once 0.8 millisecond round-trip time ) = 1.2 milliseconds • Factor of 3X slower write latency instead of 5X

  22. Case Study: 20-mile DT Cluster:Read Latencies • Shadowset member disks contain identical data, so Shadowing can read from any member disk • When selecting a shadowset member disk for a read, Volume Shadowing adds the local queue length to the Read_Cost value and selects the disk with the lowest total to send the read to.

  23. Case Study: 20-mile DT Cluster:Read Latencies • Default OpenVMS Read_Cost values: • Local Fibre Channel disks = 2 • MSCP-served disks = 501 • Difference of 499 • Queue length at local site would have to reach 499 before sending any reads across to the remote site

  24. Case Study: 20-mile DT Cluster:Read Latencies • Question: How should Read_Cost values be set for optimal read performance with Volume Shadowing between two relatively-nearby sites? • Example: 50-mile circuit path length, 0.8 millisecond round-trip latency, average local read latency measured at 0.4 milliseconds • MSCP-served reads can be done in one round trip between sites instead of the two required for writes

  25. Case Study: 20-mile DT Cluster:Read Latencies • Read latency to remote disks calculated as: • 0.4 + ( one 0.8 millisecond round-trip time for MSCP-served reads ) = 1.2 milliseconds • 1.2 milliseconds divided by 0.4 milliseconds is 3 (3X worse performance for remote reads) • At a local queue length of 3 you would start to get a response time equal to a remote read • so certainly at a local queue depth of 4 or more it might be beneficial to start sending some of the reads across to the remote site: • Therefore, a difference in Read_Cost values of around 4 might work well

  26. Case Study: 20-mile DT Cluster:Workaround for High Write Latencies • Workaround: They remove remote shadowset members each evening to get acceptable performance overnight, and put them back in with Mini-Copy operations each morning. • Recovery after a failure of the main site would include re-running night-time work from the copy of data at the remote site • Business requirements in terms of RPO, RTO happen to be lenient enough to permit this strategy

  27. Case Study: New 3-site Disaster-Tolerant Cluster

  28. Case Study: 3-site Cluster:Network configuration • Major new OpenVMS Customer, moving from HP-UX to OpenVMS • Dual DWDM links between sites • Gigabit Ethernet for SCS (cluster) traffic • Fibre Channel for storage • EVA storage shadowed between 2 sites • Quorum node at 3rd site

  29. Case Study: 3-site Cluster:Network configuration Site A Site B Node Node LAN LAN DWDM DWDM Node Node DWDM DWDM Node Node Quorum Site Node

  30. Case Study: 3-site Cluster:Network configuration Site A Site B Node Node LAN LAN DWDM DWDM Node Node DWDM DWDM Node Node Quorum Site Node

  31. Case Study: 3-site Cluster:Shadowing configuration Site A Site B Node SHADOW_SITE_ID=1 Node SHADOW_SITE_ID=2 SAN SAN DWDM DWDM DWDM DWDM Node SHADOW_SITE_ID=2 Node SHADOW_SITE_ID=2 EVA EVA $SET DEVICE/SITE=1 $SET DEVICE/SITE=2

  32. Case Study: 3-site Cluster:Shadowing configuration Site A Site B Node SHADOW_SITE_ID=1 Node SHADOW_SITE_ID=2 SAN SAN DWDM DWDM DWDM DWDM Node SHADOW_SITE_ID=2 Node SHADOW_SITE_ID=2 EVA EVA $SET DEVICE/SITE=1 $SET DEVICE/SITE=2

  33. Case Study: 3-site Cluster:Shadowing configuration Site A Site B Node SHADOW_SITE_ID=1 Node SHADOW_SITE_ID=2 SAN SAN DWDM DWDM Reads Reads DWDM DWDM Node SHADOW_SITE_ID=2 Node SHADOW_SITE_ID=2 Reads EVA EVA $SET DEVICE/SITE=1 $SET DEVICE/SITE=2

  34. Case Study: 3-site Cluster:Shadowing configuration Reads Site A Site B Node SHADOW_SITE_ID=1 Node SHADOW_SITE_ID=2 SAN SAN DWDM DWDM Reads Reads DWDM DWDM Node SHADOW_SITE_ID=2 Node SHADOW_SITE_ID=2 Reads EVA EVA $SET DEVICE/SITE=1 $SET DEVICE/SITE=2

  35. Case Study: Proposed Multiple Disaster-Tolerant Clusters using HPVM

  36. Case Study: Proposed DT Clusters using HPVM • Educational customer, state-wide network • OpenVMS systems at 29 remote sites • Proposed using HPVM on Blade hardware and storage at central site to provide 2nd site and form disaster-tolerant clusters for 29 other sites simultaneously

  37. Case Study: Proposed DT Clusters using HPVM • Most of the time only Volume Shadowing would be done to central site • Upon failure of any of the 29 sites, the OpenVMS node/instance at the central site would take over processing for that site

  38. Questions?

  39. Speaker contact info: E-mail: Keith.Parris@hp.com or keithparris@yahoo.com Web: http://www2.openvms.org/kparris/

More Related