490 likes | 504 Views
Explore the importance of testing disaster recovery strategies, the frequency of updates, and the challenges faced in implementing robust testing practices to ensure business continuity. Learn how to overcome common obstacles and improve your DR testing efficiency.
E N D
DR Testing Programs:How Useful Are They? Presented By Jon William Toigo CEO, Toigo Partners International Chairman, Data Management Institute
When Last We Spoke, I Argued… • Testing changes a plan into a capability that can adapt to change • Business changes: recovery priorities and objectives must adapt • Technology changes: restoration strategies must adapt • Strategy changes: teams must adapt • Yet, only half of the companies with plans actually test them… and the majority do so rarely
Question • How frequently do you test your recovery strategy? • Don’t know • Never • Annually • Twice annually • More than twice per year • What recovery strategy? (We are just starting the planning process, I’m new to the job, etc.)
Testing Frequency Among The “DR Savvy” Survey of 250 DR Planners Conducted by Forrester/DRJ 10/08 One would think that DR-savvy folks would get it…
Question 2 • How often do you update your plan? • Don’t know/Can’t recall • Can’t remember the last time • Every two years • Once a year • Twice a year • Quarterly • Continuously • Other
How Often Do We Update Plans? 58 percent claim to update more than once per year -- less than half do so “continuously”… Survey of 250 DR Planners Conducted by Forrester/DRJ 10/08
Those Are The Savvy, But Are They The Norm? • Hot off the wire: AT&T survey of 100 Chicago firms with revenues of $10 million or more… “The survey showed 75% of IT executives consider business-continuity planning to be a priority, compared with 60% in 2007. Eighty-one percent of those surveyed said their companies have a business-continuity plan, up from 76% a year ago.” “But only 43% have fully tested their plans within the last 12 months, an improvement over the 37% that did so in 2007. Almost one-fifth -- 12% -- admitted they have never tested their business-continuity plans, up slightly from 10% in 2007. Businesses were much more likely to update their plans -- 54% had done so within the last year.”
Why? • Big reason: DR is currently “bolted on” rather than “built in” -- increasing the complexity and expense of testing • Were it the other way around: • Testing would be an integral part of: • Application development • Tech procurement and deployment • Systems maintenance • Corporate change management • Automated DR strategy monitoring and management could reduce the need for a stand-alone testing regime • Live tests would become simply periodic opportunities for continuity team training
Bolt-On Recovery Strategies Complicate Testing Processes • Traditional approach with four phases • Develop annual test strategy • Develop objectives-driven test plans for each discrete test event • Execute test event • Assess results and manage change Traditional DR Testing
Multiplicity Of Techniques Required For Defense In-Depth • Necessitating a “divide and conquer” approach • Test regime organized by process or system • Each recovery strategy must be tested fully over the course of the year • But, the approach to testing must be non-linear to prevent failed task tests from compromising overall testing program • And you need to leave room for re-testing “abended tasks” at a later date Test Task List for Q1 Test Event
Multiplicity Of Techniques Required For Defense In-Depth (Continued) Careful coordination and management are key to avoid looking like an idiot (there are no failed tests except those that were improperly planned) In short, testing is hard work… Test Task List for Q1 Test Event
Add To The Litany Of Excuses • Other reasons, based on parabalistic evidence… • Lack of management support • Lack of resources, time and budget • Other operational priorities • Misapprehension of testing methods or skill requirements • Fear of results
Management May Have A Memory Gap • Why was a DRP/BCP project originally authorized? • For an audit checkmark? • In the wake of a disaster? • Because the company could afford it? • Perhaps a business value case was never made, or maybe it requires a refresh…
Do More With Less • Has the job of plan maintenance and testing gotten bigger, while available resources have gotten smaller? • As the scope of planning grows to encompass more and more business processes, the testing and maintenance task list increases: Has your staff? • Is DR as “an additional responsibility” tossed on top of your regular workload?
Have Priorities Changed? • Perhaps other priorities have come to the fore… • “Recessionary pressures” • Defer CAPEX • Trim OPEX • Renewed focus on growing revenues • Change of perspective about continuity -- “a capability that in the best of circumstances never needs to be used”
Maybe Some Left-Over Angst From High School • You’d be surprised how many people fear the idea of testing… • Failed tests are not test failures: Identifying gaps is the point • “Real tests” do not require surprise and disruption • Test results are not a reflection of your competence
Maybe A Bit Of Backsliding? • It can happen once the plan project is complete and maintenance mode commences • Many causes • New technologies/services that claim to “change the rules” -- challenging your carefully crafted strategies • Threat du jour mentality (Avian Bird Flu, for example) has enthusiasm-numbing effect • Need to convince each new manager with budget oversight PROGRESS BACKSLIDING Feedback
Bottom Line • Testing isn’t being done, or is undertaken so infrequently as to yield limited value to the organization • Predictable Impact • Continuity capability falls out of step with the needs of the organization • Hard work invested in developing a continuity plan goes to waste • Meanwhile, your next disaster may be just a nickel away…
True Story • Rush to market for high capacity drives • Saving five cents on a vibration sensor • Vibration sensor fails, S.M.A.R.T. warns and retries testing, drive vibrates, overheats, then fails • This occurs with every drive in the manufacturing run, in every system and array where installed • Failures are sequential • RAID 5…and 6 jeopardized • It spoils your day
Is Traditional Testing Sufficient? • A valid question: Here are a few more… • Is traditional testing too expensive, time consuming and resource intensive to keep strategies synchronized with business/technology change? • Does non-linearity mitigate “soup-to-nuts” validation of recovery strategy? • Does the careful planning of test day logistics (to ensure worthwhile test regimen) cause us to miss subtle issues (problems in untested data replication processes, for example) that could wreak havoc in an actual disaster? • Do re-testing requirements overtake change management process?
What Do We Test Today? • Typically, we focus tests on… • Ensuring we have replicated the right data and that it is recoverable within Time to Data standards set in plan strategy objectives • Ensuring that we have built a shadow infrastructure from which we can operate at minimum acceptable service levels in an emergency • Ensuring that the procedures for initializing and operating from the shadow infrastructure are correct and that folks know their jobs
What Do We Test Today? (Continued) The results of tests update the test plan (re-test requirements) and the DR strategy itself Given the broad range of business processes to protect in larger firms, that is a lot of work to perform within a limited amount of time!
Could We Divvy Up The Workload Better? • What are the logistics for recovery? • Are all resources accounted for? • Do we have the right personnel? • Do they know what they need to do and have the skills to do it? • How are applications and hosting platforms changing? • What elements have been added, modified or deleted? • Is “shadow infrastructure” synchronized with production environment? Traditional Testing DR TESTING Infrastructure Management Processes Data Management Processes • What data needs to be protected? • Where is it? • What replication services are provided? • Are they consistent with data volume growth and time to data requirements? What if we separate out some of the tasks?
Enter The “Aggregators” • Driven by the practical operational need to consolidate monitoring and reporting on an increasing number of data replication/protection processes… • Everyone wants in on the data protection game • Application software vendors adding their own backup functions • Database vendors adding CDP and backup • Virtualization vendors offering backups of what the VM sees • Hardware vendors adding mirroring and CDP as “value add” features • Not to mention traditional backup and mirroring ISVs
Enter The “Aggregators” (Continued) Chances are good that most companies have a combination of replication processes in play: Monitoring each one individually is a painful and time-consuming task
Latest Generation • Continuity Software RecoverGuard™ • Maps data replication processes back to host, VM, application, or business process • Provides “dashboard” display of current replication status • With some scripting, compares mirrored data to original to verify mirror • Cool feature: analyzes volume of data being backed up or mirrored to strategy for data restoration and signals if RTO threshold exceeded
Key Insights From Continuity Software • “Customers recognize that the environment changes daily but DR isn’t updated daily: they are vulnerable between tests” • “Three major coverage gaps” • Replication device management: data state mismatches • Platform management: LUN configuration on hosts in production and shadow environments • Databases: Configurations on file systems presented by volumes
Key Insights From Continuity Software Method of operation: Discovery of devices, RAID schemes, LUN sizes, relationships between local copies and DR consistency groups Replication gaps are typical: “We find them in every environment” Often “find “unauthorized hosts” in the DR environment that have access to volumes that they shouldn’t.” Generate graphical maps of topology for troubleshooting mismatches
Limitations And Benefits • Visibility into hardware: Some hardware vendors limit monitors • Same problem that confronts SRM vendors: DR aggregators need to beg, buy or do end runs around visibility barriers erected by hardware purveyors) • Some aggregators do not monitor tape backup to any degree of granularity • Database “hot backup” modes may interfere with monitoring accuracy • Infrastructure scanning and “training” of reports to eliminate “known issues” can take time
Limitations And Benefits (Continued) “Especially useful for monitoring apps that can’t be taken down for DR testing” “Provides greater confidence in recoverability than tabletop test exercises” “Deliverable as a service with a lot of engineering hand-holding”
Ultimately, DR Aggregators Deliver Value as… • A means to monitor data replication processes for solvency and storage-to-server configurations for undocumented changes • They also address a Heisenberg-like quandary of classic DR testing: meticulous pre-test planning can skew results
In the final analysis Aggregators are only as good as the number of processes and components they can monitor: Selection criteria should include support for installed hardware and software Aggregators inform about, but do not correct, gaps: they can provide useful input into DR change management, but they are strictly passive monitors (like traditional testing, but without the logistical resource requirements) Ultimately, DR Aggregators Deliver Value as (Continued)…
Building In Testing • The next evolution of DR testing parallels next evolution in DR strategy: Failover • Different meanings depending upon who you ask • For simplicity, the automated transition of processing workload between two hosting environments • A strategy once limited to companies with deep pockets and “always-on” application requirements X
Building In Testing Prerequisites: “Shared nothing” application architecture Known end points for data mirroring (usually multi-hop) A high speed, high volume network interconnect A “failover engine” to switch workload based on a trigger Comparable (or identical) hosting platforms supported by manual processes for replicating changes X
Failover As Built-In DR Today, failover is increasingly affordable Network costs have declined and IP nets are ubiquitous 1-to-1 platform replacement is no longer required Vendor agnostic data mirroring technologies have begun to appear Servers and networks -- but not necessarily storage -- have largely succumbed to commoditization New failover engines have appeared from ISVs to facilitate distributed systems failover that do not have native clustering capabilities And the Internet has helped encourage “always on” thinking among continuity planners (whether it is actually needed or not) Primary Shadow
A Couple Of Flavors Active-Passive: Asynchronous Updates, Data Mirroring, and Failover Backup Environment Production Environment Trigger event starts backup hosting environment and redirects WANs and user networks as needed. Expect some data deltas. Active-Active Synchronous Updates, Copy on Write, Load Balancing with Failover Production Environment A Production Environment B Trigger event re-balances load, shifting it to the remaining environment.
From A Testing Perspective • Failover strategies themselves are just another DR strategy to test…unless the failover technology used is designed with testing in mind. WRAPPER
From A Testing Perspective (Continued) Enter “DR Wrapper” software Designed initially to coordinate data replication between platforms as a precursor to clustered failover Products in this category are increasing in number: Neverfail, EMC RepliStor, Double-Take, CA XOsoft, etc. Of these, CA has gone a step further than most: Delivering a robust failover scripting facility as well as a data replication aggregator (monitor) and a native data replication engine WRAPPER
From The Horse’s Mouth • Conceived as a data replication engine with support for cross-platform data replication • Integrated tape backup monitoring (ARCserve) and third-party hardware/software replication monitoring via script language • Scenario-based host environment failover scripting • Bandwidth benchmarking to assess suitability and scheduling of networks for replication activity
Customer Report • University of Texas at Brownsville, post-Hurricane Rita, searching for reliable DR strategy for email hosting • Microsoft Exchange Mail configured in production data center as a physical server cluster with data stored to a FC fabric • Remote target: ISP in Austin running VMware and using iSCSI-connected DAS (replicating the fabric deemed too expensive) • Existing VLAN provides 20 MB bandwidth deemed adequate to support replication of daily change data volume of 50 to 100 MB via CA XOsoft CDP
Customer Report (Continued) UT simulates failover frequently and actually uses shadow infrastructure when performing maintenance on Windows server and FC fabric environments “Surprisingly, there is no significant reduction in Exchange Mail response times when operating from either environment.”
Limitations And Benefits • “Only challenge to install include customer security concerns around Domain Administration Rights, which CA XOsoft uses to expedite software push to servers. Workarounds exist.” • Out of the box failover support is listed for software products, including Exchange, SQL Server, IIS, Oracle, BlackBerry Enterprise Server, etc., and specific operating systems, including Windows, UNIX and Linux (various distributions): Custom scripting required to support other applications.
Limitations And Benefits (Continued) No hardware-based data replication processes are natively supported: Scripts must be written to include third-party replication in monitoring Typically, customers do not use the product to do system state replication or patching; a manual process is used instead
From A Testing Perspective • The Cool Stuff • Failover scenarios can be both simulated and executed from the console in a fully automatic or fully interactive mode • Failover between physical and virtual environments fully supported • Dual use value: Testing capability can also be leveraged to eliminate downtime during production system maintenance and patching.
Aggregators And Wrappers Can Trim The Testing Burden • With aggregated data replication process monitoring (a data management service) and built in platform failover and recovery testing (and infrastructure management service), planners may be able to reduce the number of manual tests that they need to perform in order to keep the continuity strategy up to date
But They Do Not Eliminate The Need For Traditional Testing • Some applications may not avail themselves (or require) failover solutions • Moreover, certain tasks cannot be tested via simulated or actual failover • Emergency response and evacuation • Setting up crisis communications and emergency operations • Re-fitting user work areas • Transitioning to normal service levels following an event • Nor do aggregators and wrappers provide actual change management services or documentation updates that will be needed for continuous strategy improvement and audit
Improved Test Tools, Not Test Substitutes • New technologies can reduce some of the testing burden, but they do not substitute for end-to-end testing at the business process recovery level • Plus, they do not provide the single most important benefit realized from traditional testing: rehearsal of those who will play a role in an actual recovery
In The Final Analysis All things are ready if our minds be so… William Shakespeare
Toigo Partners International and the Data Management Institute www.toigopartners.com www.datainstitute.org www.it-sense.org Email: jtoigo@toigopartners.com Storage Management.org www.storagemgt.org DR Planning.org www.drplanning.org DrunkenData.com www.drunkendata.com Questions? Thanks. jtoigo@toigopartners.com