330 likes | 661 Views
Disaster Recovery with Tivoli Storage Manager. Agenda. People, Process, and Technology--Basic Dimensions of the Solution Information Technology Infrastructure Management Business Continuity Management and Disaster Recovery Planning Service Level Management and backup/restore Planning
E N D
Agenda • People, Process, and Technology--Basic Dimensions of the Solution • Information Technology Infrastructure Management • Business Continuity Management and Disaster RecoveryPlanning • Service Level Management and backup/restore Planning • The Tiers of Disaster Recovery • Configuring Tivoli Storage Manager for Disaster Recovery • Architectural considerations • System Recovery from a Bare Machine • Tivoli Disaster Recovery Manager
People, Process, and Technology Today we will focus on ProcessandTechnology
Why the Emphasis on Process? • In Disaster Recovery, Planning is everything • What machines have priority for post-disaster restoration? • How quickly do critical computing resources need to be back on-line? • What is the optimum backup strategy for support of disaster recovery? • HINT: It is probably not the most efficient backup methodology • Server storage architecture impacts potential speed of restoration • Will tape backups support the Recovery Time Objective (RTO) • What materials, tools, hardware, and procedures are needed? • Who will perform the restore? • Will the original server/storage team be available? • Planning is process-driven.
Your Tivoli Storage Manager architecture should be based on SLAs, and capable of supporting the Business Continuity Plan Information Technology Infrastructure Library (ITIL)-A recognized international Standard for IT Management Processes • Business Continuity Management (BCM) is concerned with managing risks to ensure that at all times an organization can continue operating to, at least, a predetermined minimum level. • IT Service Continuity Management (ITSCM) is part of the BCM process which focuses on continuity of IT operations. • Disaster Recovery Planning (DR Planning) is a logical subset of the ITSCM process. • Service Level Management ensures that service targets are documented and agreed in Service Level Agreements (SLAs) • Basic planning data for backup and recovery infrastructure • Basis for RTO’s
The Concept • DR solutions and strategies can be very complex. The Seven “tiers” of DR is a concept developed by the SHARE user group. • Tier 0 - Do Nothing, No off-site data • Tier 1 - Offsite vaulting • Tier 2 - Offsite vaulting with a hotsite • Tier 3 - Electronic Vaulting • Tier 4 - Electronic vaulting to hotsite (active secondary site) • Tier 5 - Two-site two-phase commit • Tier 6 - Zero data loss • The choice of tier is based on: • The Recovery Time Objective • The Recovery Point Objective
Tier 0 • Characteristics • No saved information • No documentation • No backup hardware • No contingency plan • Capability • None
Tier 1 Offsite vaulting • Characteristics • Backups couriered to off-site storage • Some planning for recovery has been done • No backup hardware kept available • Capability • Limited, but recovery possible given enough time (typical recovery times normally exceed one week) • Limitations • Recovery is dependent on time to obtain hardware • Recovery Time Objectives may not be achievable
Tier 2 Offsite vaulting with a hotsite • Characteristics • Backups couriered to off-site storage • Some planning for recovery has been done • Hot Site provided • Capability • Good, but typical recovery times exceed one day. • Limitations • Aggressive Recovery Time Objectives may not be achievable
Tier 3 Electronic vaulting • Characteristics • Backups couriered to off-site storage • Disaster Recovery Plan • Hot Site provided • Electronic vaulting of some critical data • Capability • Typical recovery time less than one day. • Limitations • May be very expensive since hotsite is kept running at all times. Electronic Vaulting: Data is copied or mirrored to remote disk storage at frequent intervals
Tier 4 Electronic vaulting to hotsite (active secondary site) • Characteristics • Backups couriered to off-site storage • Disaster Recovery Plan • Active secondary data center • Electronic vaulting of some critical data • Capability • Recovery times up to one day, typically. • Can provide bi-directional recovery (each site backs up the other) • Limitations • Data centers MUST be physically separated.
Tier 5 Two-site, two-phase commit • Characteristics • Backups couriered to off-site storage • Disaster Recovery Plan • Active secondary data center • Electronic vaulting of some critical data • maintain selected data in image status • requires that both the primary and secondary platforms’ data be updated before the update request is considered successful. • Capability • Typical recovery times less than 12 hours. • Can provide bi-directional recovery (each site backs up the other) • Limitations • Data centers MUST be physically separated. • requires partially or fully dedicated hardware on the secondary platform with the ability to automatically transfer the workload over to the secondary platform. • Data “in transit” between sites at time of disaster may be compromised.
Tier 6 Zero data loss • Characteristics • Disaster Recovery Plan • dual online storage is utilized with a full network switching capability. • The two systems are advanced coupled, allowing an automated switchover • Capability • Recovery time measured in minutes. • Can provide bi-directional recovery (each site backs up the other) • Limitations • the most expensive solution as it requires coupling or clustering applications, additional hardware to support data replication, and high bandwidth connections over extended distances.
Server Hardware Selection • Choose a TSM server platform where suitable skills exist in the company or are readily obtainable. • Consider ease of operating system recovery • Ensure that the server system chosen will be able to attach enough storage devices to meet your capacity requirements into the future. • Consider how the operating system supports clustering and high availability and any fault-tolerant technology built into the hardware.
Disk Storage Considerations • For DR considerations we recommend hardware-implemented RAID storage • Another important feature for selecting disk technology is the ability to create instant and local mirrors of the data • Helpful if you go to tier 4 or above at a future date, essential if you need this now. • Consider SAN-attached storage. • High speed, high-capacity network • Easy configuration of redundant access paths • Better than local disk in the event of server hardware failure
Tape Storage Considerations • Buy the biggest, fastest, and most automated tape library you can afford due to: • Constant growth in quantity of data (often much faster than forecast), changing retention requirements • As data volumes increase, backup windows are shortening • Automation helps reduce overall costs, increases reliability (less human error) and makes the system as a whole faster. • SAN-attached libraries can be accessed between primary and hot site, making production of off-site copies much quicker
Server Database Considerations • To restore the TSM server after a disaster you will need these four files: dsmserv.dsk, dsmserv.opt, devcnfg.out, and volhist.out. • Copy at least after each full database backup and keep off-site. • You will need the last full database backup, and subsequent incrementals unless using electronic vaulting (see below). • Do frequent full database backups or snapshot database backups rather than a long series of incremental backups. • This reduces TSM Server recovery time • At Tier Levels 4-6 you can backup the database using device class of type FILE on a remote mirrored disk space. • If you have multiple TSM servers in your configuration, you could use server-to-server virtual volumes to make additional copies of your database backups.
Backup Strategy Considerations • Consider alternatives to pure incremental backup strategy for critical systems: • Image backups • Backupsets • Restoration from tape may not be possible for large, critical servers with short Recovery Time Objective • Periodic consolidation of tape storage (via move data or move nodedata) can improve restore times.
Definition • What is Bare Machine Recovery (BMR)? • Technology for System Recovery • Methodology: • Save configuration, driver, and system settings • Ensure fast operating system recovery (faster than it would take to reinstall and configure from scratch). • Recover to a point that enables further application and data recovery • How does BMR relate to Disaster Recovery? • Bare Machine Recovery – or System Backup and Recovery – essential part of complete Disaster Recovery • Not a complete solution in itself • BMR + Disaster Recovery Management ~= Disaster Recovery
Issues/Risks • Open Files • There may be OS or application files on the boot volume that are in an inconsistent state at the time of the backup. The hope is that the restore will result in at least a boot volume that is in a ‘crash consistent’ state. An OS should be able to recover from this. • Different Hardware • Many OS’s are sensitive to hardware changes • NT4 is famous for ‘blue screening’ when it is booted while configured for different hardware. • Backup Client Requires Libraries not Available in Recovery Environment • By definition, minimal boot environments do not include the same capabilities as the normal production environment.
Bare Machine Recovery (BMR) • The integration of Tivoli Storage Manager with Bare Machine Recovery Solutions has only recently become well defined. Solution Varies by platform:
What Tivoli DRM does: Tivoli DRM helps you maintain business continuance by: • Managing server database and storage pool backup volumes • Establishing a thorough DRP for the TSM server • Tracking and reporting client systems destroyed, in the event of a disaster • Automating vital server recovery steps to bring your business back to normal operation • Prioritizing client system restores.
Sample extract from DRM-generated DR Plan begin RECOVERY.SCRIPT.DISASTER.RECOVERY.MODE script @echo off rem Purpose:This script contains the steps required to recover the server rem to the point where client restore requests can be satisfied rem directly from available copy storage pool volumes. rem Note:This script assumes that all volumes necessary for the restore have rem been retrieved from the vault and are available.This script assumes rem the recovery environment is compatible (essentially the same)as the rem original.Any deviations require modification to this script and the rem macros and scripts it runs.Alternatively,you can use this script rem as a guide,and manually execute each step. if not %1.==.if not %2.==.goto start echo Specify the following positional parameters: echo administrative client ID and password. echo Script stopped. goto end :start rem Set the server working directory. pushd "C:\PROGRA~1 \tivoli \tsm \server1 \“ And so on…
Discussion Topic: Whattierare you using? • Tier 0 - Do Nothing, No off-site data • Tier 1 - Offsite vaulting • Tier 2 - Offsite vaulting with a hotsite • Tier 3 - Electronic Vaulting • Tier 4 - Electronic vaulting to hotsite (active secondary site) • Tier 5 - Two-site two-phase commit • Tier 6 - Zero data loss