790 likes | 1.79k Views
Performing Disaster Recovery with Tivoli Storage Manager. Disaster Recovery as Part of Business Continuance.
E N D
Disaster Recovery as Part of Business Continuance • Disaster RecoveryDisaster recovery is traditionally defined as the ability to recover from a catastrophic outage of IT systems and services. Protecting and maintaining IT data and processes is the primary goal. • Business ContinuanceAn enterprise wide planning process which creates detailed procedures to be used in the case of an unplanned outage or disaster. Maintaining continuity of business processes is the overall objective. Disaster Recovery Planning (DR Planning) is a logical subset of the BCP process that focuses on continuity of IT operations.
The Chicago Flood – High Cost of Data Loss • Example: The Chicago Flood, April 12, 1992 • Affected: About 6 square miles of the Loop (downtown business district) • Lives lost: None • Business losses: $1 billion • Duration: About a week • Number of businesses shut down temporarily due to lack of power, phones, networking & data: More than 400 • Number of affected businesses that later shut their doors: About 150 • Estimated total of lost revenues & productivity, cleanup costs, etc.: $1 billion (SANS Institute, August 2001)
Seven Tiers of Disaster Recovery Source: SHARE PTAM=Pickup Truck Access Method
Offsite Vaulting of Tape Cartridges • Requires a disaster recovery plan and careful management of offsite volumes. • Use Disaster Recovery Manager (DRM) to automate the TSM server recovery process and to manage offsite volumes. • Strategy must include vaulting of TSM database backup, volume history information, device configuration information, disaster recovery plan file and copy pools for storage at an offsite location. • Consider using tape encryption hardware • Ship the TSM database backup separately
DRM Overview • What is TSM’s Disaster Recovery Manager? • Manage the tape cartridge life-cycle (MOVE DRMEDIA) • Create an offsite disaster recovery plan (PREPARE) • Record client machine recovery information (DEFINE MACHINE) • Disaster Recovery Planning • Develop a documented plan • Be sure all operational procedures, especially tape movement, are documented, tested and working • Consider using Active Data Pools for offsite vaulting • Automation is important; use available tools to reduce human error as much as possible: O/S scripts, TSM scripts and macros, TSM DRM, scheduling tools, electronic transfer when possible • Test the recovery plan often
Configuring DRM Settings QUERY DRMSTATUS SET DRMDBBACKUPEXPIREDAYS days tsm: SERVER1>q drmstatus Recovery Plan Prefix: /tsm/drm/ Plan Instructions Prefix: /home/tsminst1/drm/DRM.planfile Replacement Volume Postfix: @ Primary Storage Pools: Copy Storage Pools: DRPOOL Active-Data Storage Pools: Not Mountable Location Name: Room M156 Kevin Shelf Courier Name: Gburg Currier Service Vault Site Name: Gburg Vault DB Backup Series Expiration Days: 2 Day(s) Recovery Plan File Expiration Days: 2 Day(s) Check Label?: Yes Process FILE Device Type?: No Command File Name: /home/tsminst1/drm/drmcmd.sh
The Daily Process Steps • TSM - backup storage pool • TSM – copy to active data pool • TSM - backup database • TSM - identify off-site cartridges • Library - eject cartridges • TSM – create the disaster recovery planfile • Operator - remove cartridges and turn over to courier • Courier - carry to vault • TSM - identify empty cartridges in the vault • Courier - carry empty cartridges back to on-site • Operator - insert empty (scratch) cartridges in library
Using a Maintenance Script to Automate Daily Tasks PARALLEL BACKUP STGPOOL DRM_PRIMDISK DRM_COPYTAPE WAIT=YES BACKUP STGPOOL DRM_PRIMTAPE DRM_COPYTAPE WAIT=YES SERIAL PARALLEL COPY ACTIVEDATA DRM_PRIMDISK DRM_ACTTAPE WAIT=YES COPY ACTIVEDATA DRM_PRIMTAPE DRM_ACTTAPE WAIT=YES SERIAL BACKUP DB DEVCLASS=TS7650G_N34 WAIT=YES TYPE=FULL MOVE DRMEDIA * WHERESTATE=MOUNTABLE TOSTATE=VAULT SOURCE=DBBACKUP WAIT=YES PREPARE SOURCE=DBBACKUP WAIT=YES MIGRATE STGPOOL DRM_PRIMDISK WAIT=YES LOWMIG=0 EXPIRE INVENTORY SKIPDIRS=NO WAIT=YES RESOURCE=4 PARALLEL RECLAIM STGPOOL DRM_PRIMDISK WAIT=YES THRESHOLD=50 RECLAIM STGPOOL DRM_PRIMTAPE WAIT=YES THRESHOLD=50 /* RECLAIM STGPOOL DRM_COPYTAPE WAIT=YES THRESHOLD=50 */ SERIAL
Identify the Cartridges Ready for Off-Site QUERY DRMEDIA * WHERESTATE=MOUNTABLE • Generate a list of tapes that are ready to go off-site and give it to the operations staff • One option is to email the list to tape operations – this would be an indication that it is time to remove tapes from the library – tape operations can check-off the tapes as they remove them from the tape library tsm: SERVER1>q drmedia * wherestate=mountable Volume Name State Last Update Automated Date/Time LibName ---------------- ----------------- ------------------- ---------------- ATS300L3 Mountable 10/28/2010 18:53:16 TS3584 ATS308L3 Mountable 03/24/2011 16:09:39 TS3584 ATS305L3 Mountable 03/24/2011 16:09:39 TS3584
Eject the Cartridges from the Library MOVE DRMEDIA * WHERESTATE=MOUNTABLE TOSTATE=VAULT -SOURCE=DBBACKUP • This command will eject tapes from most libraries • Has the effect of CHECKOUT LIBVOL REM=YES • Tape operations staff must be trained to interact with the library to ensure that all cartridges have been removed • Having an eject list is important here (created in previous step) • Note: With TSM V6, the Q DRMEDIA and MOVE DRMEDIA commands will process the Active Data Pool volumes
Run the PREPARE Command PREPARE SOURCE=DBBACKUP WAIT=YES • Generates the TSM plan file based on the setup commands • Put the plan file somewhere where it can be retrieved at the disaster site • Send it to another site via NFS or CIFS mount • Email it • Put it on some media (flash drive, CD, etc.)
Generate a List of Empty Tapes to Retrieve From the Vault QUERY DRMEDIA * WHERESTATE=VAULTRETRIEVE • Generate a list of tapes that are empty and have satisfied their reuse delay setting and are ready to return from the vault • Package this list with the set of tapes going to the off-site • Email this list to the off-site provider • Use an electronic interface to the off-site provide Save the output and use it as an audit trail. tsm: SERVER1>q drmedia * wherestate=vaultretrieve Volume Name State Last Update Automated Date/Time LibName ---------------- ----------------- ------------------- ---------------- ATS307L3 Vault Retrieve 11/24/2010 18:53:16 TS3584 ATS309L3 Vault Retrieve 02/19/2011 16:09:39 TS3584
Perform the State Change for Tapes that Have Returned MOVE DRMEDIA * WHERESTATE=VAULTRETRIEVE - TOSTATE=ONSITERETRIEVE • This command will change the state for all VAULTRETRIEVE tapes • This must be done before tapes can be re-inserted to the library • Can be issued for individual tapes as they are received at the primary locationMOVE DRMEDIA ATS300L3 TOSTATE=ONSITERETRIEVE
Check-in Empty Tapes to be Reused as Scratch CHECKIN LIBVOL TS3584 CHECKLABEL=BARCODE SEARCH=BULK STATUS=SCRATCH WAITT=0 • This command must be issued after the tapes have been placed in the Import/Export port of the tape library • Use a value greater than zero for ‘WAITT=‘ to cause the library to wait for a prompt (a TSM REPLY command) before attempting to import the cartridges • The tape operators should use the checklist from the previous step to avoid missed cartridges
The PREPARE Output: The Planfile Stanzas • Informational Stanzas • Custom Instruction Stanzas • Recovery Scripts • TSM Macros • TSM Configuration Files SERVER.REQUIREMENTS RECOVERY.INSTRUCTIONS.GENERAL RECOVERY.INSTRUCTIONS.OFFSITE RECOVERY.INSTRUCTIONS.INSTALL RECOVERY.INSTRUCTIONS.DATABASE RECOVERY.INSTRUCTIONS.STGPOOL RECOVERY.VOLUMES.REQUIRED RECOVERY.DEVICES.REQUIRED RECOVERY.SCRIPT.DISASTER.RECOVERY.MODE script RECOVERY.SCRIPT.NORMAL.MODE script DB.STORAGEPATHS LICENSE.REGISTRATION macro ACTIVEDATASTGPOOL.VOLUMES.AVAILABLE macro ACTIVEDATASTGPOOL.VOLUMES.DESTROYED macro COPYSTGPOOL.VOLUMES.AVAILABLE macro COPYSTGPOOL.VOLUMES.DESTROYED macro PRIMARY.VOLUMES.DESTROYED macro PRIMARY.VOLUMES.REPLACEMENT macro STGPOOLS.RESTORE macro VOLUME.HISTORY.FILE DEVICE.CONFIGURATION.FILE DSMSERV.OPT.FILE LICENSE.INFORMATION Explode the Planfile
Custom Instruction Stanzas: RECOVERY.INSTRUCTIONS.XXXX • Some Ideas to include: • General information about the TSM server: hardware type, operating system level • Support staff contact names and numbers: TSM administrator(s), system administrator(s), management • Information about client systems: restore priority, unique restore requirements • Information about travel: airlines, hotels, etc. • Security information: access to the disaster recovery site, access to passwords • Hardware replacement procedures: vendor contacts, account numbers • Location of all vendor contracts – for verification in case of problems • Disaster declaration procedures: documentation procedures, responsibilities • Courier information: contacts, service level expectations • Telecommunications requirements: WAN provider contacts and procedures to institute alternate connections. Internet Service Provider contacts, TelCo contacts and alternate service plan • Plan for alternate office space and appropriate contacts • Modified Device Configuration file (pre-configured)
RECOVERY.SCRIPT.DISASTER.RECOVERY.MODE RECOVERY.SCRIPT.NORMAL.MODE script • These two scripts perform the recovery actions by issuing operating system commands and by driving a series TSM administrative macros. The macros are all stanzas in the recovery plan • These scripts can sometimes be run directly or they can be used as template to run commands manually
The DRM Commands • SET DRM…. • QUERY DRMEDIA • MOVE DRMEDIA • PREPARE • For complete syntax, consult the TSM Administrator’s Referencehttp://publib.boulder.ibm.com/infocenter/tsminfo/v6r2/index.jsp
Electronic vaulting of TSM Recovery Data • Use DRM to automate the TSM server recovery process and to manage offsite volumes. • Vault TSM database backup, volume history information, device configuration information and disaster recovery plan file to an offsite location or to the hot-site. • For TSM database backup use virtual volumes or ftp the sequential file volumes • ftp the configuration files • Copy Storage Pool vaulting to an offsite location or to the hot-site • Use virtual volumes or use an extended SAN • Use a hardware mirroring technique for TSM database backups and storage pool volumes (use with care) • Use DB2 HADR to replicate a TSM V6 database to a remote location • Export / Import is another option • Bandwidth is an important concern for any electronic vaulting configuration
Electronic Vaulting with TSM Virtual Volumes Primary Site Recovery Site Virtual Volumes IP BACKUP DB BACKUP STG TSM Server TSM Server
Electronic Vaulting with TSM Virtual Volumes • Considerations: • No distance limitations -- uses TCP/IP • No TSM synchronization issues • No need for specialized hardware or software – all support is built into TSM • Data transfer is not deduplicated • Space reclamation may require large amounts of data transfer • Must build a second TSM server at the recovery site to access the disaster recovery backups
Electronic Vaulting with a Remote Library Primary Site Recovery Site BACKUP DB SAN Extender SAN Extender BACKUP STG IP or Dark Fiber TSM Server
Electronic Vaulting with NFS Volumes Primary Site Recovery Site BACKUP DB NFS Vols IP BACKUP STG TSM Server
Electronic Vaulting with a Remote Library • Considerations: • Most closely matches the Tier 1 method of disaster recovery – may be able to reuse many processes • Does not require a TSM server at the recovery site • Requires specialized SAN extension hardware • Has distance limitations – usually metro distances work well – check with SAN vendor • Works best with physical tape or VTL device • Data transfer is not deduplicated • Space reclamation may require large amounts of data transfer
Electronic Vaulting with Disk Mirroring Primary Site Recovery Site Storage Pools IP TSM Database TSM Server (Standby) TSM Server
TSM and External Disk Mirroring • There are certain requirements which must be met in order to ensure that TSM will function properly when data replication techniques are used for the TSM Database, Recovery Log and disk Storage Pools. • Write order must be strictly maintained between the source and target systems • The TSM database, recovery log and storage pools must be part of a consistency group and write order maintained within the group • Writes must be non-volatile—writes within a transaction group must all be either permanently resident on the disk or non-resident • The hardware must report a successful write and must do so if and only if the data is permanently resident on the device • These requirements apply to both hardware and software mirroring techniques • These requirements apply to both asynchronous and synchronous mirroring techniques
TSM and External Disk Mirroring • Many disk subsystems use write caching to improve performance and provide replication functions. If these devices use battery backups and can ensure that the writes are destaged to disk before complete loss of power then they are acceptable • In a remote replication configuration, you should never start the TSM server at the target site unless the source site has failed or has been shut down for some reason. If you lose synchronization of the two sites, you should not attempt to start the target server and you should immediately resynchronize • If you start the target server for whatever reason, you must have a plan to return to the primary system. Many times the disk subsystem will provide a mechanism to fall back to the primary system after it has been completely synchronized • You should be sure your disk vendor understands these issues and has certified it’s hardware to satisfy these requirements.
Electronic Vaulting with Disk Mirroring Primary Site Recovery Site Storage Pools IP BACKUP DB TSM Server (Standby) TSM Server
Electronic Vaulting with Disk Mirroring • Considerations: • Can provide a very favorable recovery point when the TSM database is mirrored • Does not require a storage pool backup • Data can be mirrored in a deduplicated state – fewer bytes need to be transferred (best with client-side dedup) • Fast recovery of the TSM server at the disaster site with no special pre-recovery procedures • Data is mirrored byte-by-byte – difficult to synchronize with TSM • Requires careful use of consistency groups when TSM database is mirrored • May be distance limitations – check with disk vendor
Electronic Vaulting with VTL Replication Primary Site Recovery Site Storage Pools BACKUP DB IP TSM Server (Standby) TSM Server
Electronic Vaulting with VTL Mirroring • Considerations: • No distance limitations – uses TCP/IP • Does not require a storage pool backup • Data can be replicated in a deduplicated state – fewer bytes need to be transferred • Replication is done at the virtual cartridge level – easy to synchronize with TSM • Does not require use of consistency groups
Electronic Vaulting with DB2 HADR and VTL Replication Primary Site Recovery Site TSM Database DB2 HADR IP TSM V6 Server Standby TSM V6 Server Storage Pools
Electronic Vaulting with DB2 HADR and VTL Mirroring • Considerations: • Does not require a TSM database backup to replicate the TSM database • Can provide a very favorable recovery point • Requires a running standby system at the recovery site • DB2 High Availability-Disaster Recovery (HADR) is a “log-shipping” methodology to replicate a DB2 database from one DB2 system to another • TSM V6 is fully supported in HADR environments
Recovery Steps • Must haves: Planfile (includes the TSM server options file, device configuration file and volume history file), Database Backup, Copy Storage Pool Volumes, Active Data Pool Volumes • Rebuild the TSM server system from bare metal (if necessary) • Explode the Planfile – use the provided awk script (or VB script or REXX) • Replace the TSM options file, Device Configuration and Volume History files • Edit a customized Device Configuration file for TSM Database restore • Different tape hardware at the recovery site • Identify and place the database restore volumes in specific slots (manually) in the library for automatic operation. • Perform a manual library restore • Run the RECOVERY.SCRIPT.DISASTER.RECOVERY.MODE script to rebuild the TSM server, restore the database, mark all primary volumes destroyed, and mark all copy pool volumes available • Often it will be necessary to break up this script and run individual sections • Begin the client restore • Have a plan to perform a bare metal restore for each platform type • Know which clients need to be restored first • Optionally restore the primary storage pool(s)
Rebuild the TSM Server Platform • Perform a bare metal restore of the TSM server platform (if necessary) • Use an operating system backup (mksysb, etc.) OR • Reinstall the operating system and the TSM Server • Create the replacement TSM server instance (for TSM V6): • A bare metal restore from an operating system backup may restore all that is needed and it may not be necessary to recreate the TSM server instance • Run the dsmicfgx utility to configure the instance – this will also configure the TSM API for database backup/restore • To manually configure the instance – use the steps in the Install Guide • Delete the DB2 database for TSM (it will be recreated with the TSM database restore)dsmserv removedb TSMDB1
Explode the Planfile • Locate the sample explode program (awk script on Unix, Visual Basic script on Windows, REXX script on z/OS) • Modify the script if necessary • It will work without modification however not all stanzas are exploded and it will put the exploded stanzas in the directory indicated by the ‘SET DRMPLANPREFIX’ command > awk -f planexpl.awk drmtest.planfile.20091030.115549 Creating file /drm/shared/drmtest-planfiles/drmtest.planfile.RECOVERY.INSTRUCTIONS.GENERAL Creating file /drm/shared/drmtest-planfiles/drmtest.planfile.RECOVERY.INSTRUCTIONS.OFFSITE Creating file /drm/shared/drmtest-planfiles/drmtest.planfile.RECOVERY.INSTRUCTIONS.INSTALL Creating file /drm/shared/drmtest-planfiles/drmtest.planfile.RECOVERY.INSTRUCTIONS.DATABASE Creating file /drm/shared/drmtest-planfiles/drmtest.planfile.RECOVERY.INSTRUCTIONS.STGPOOL Creating file /drm/shared/drmtest-planfiles/drmtest.planfile.RECOVERY.VOLUMES.REQUIRED ...
RECOVERY.SCRIPT.DISASTER.RECOVERY.MODE # Restore server options, volume history, device configuration files. cp /drm/shared/drmtest-planfiles/drmtest.planfile.DSMSERV.OPT.FILE \ /home/tsminst2/dsmserv.opt cp /drm/shared/drmtest-planfiles/drmtest.planfile.VOLUME.HISTORY.FILE \ /home/tsminst2/volhistory cp /drm/shared/drmtest-planfiles/drmtest.planfile.DEVICE.CONFIGURATION.FILE \ /home/tsminst2/devconfig • Copy the server files to the appropriate directory • The device configuration file will need to be modified • It’s best to have a pre-prepared device configuration file available • Edit the predefined device configuration file to specify the volume id and location of the database backup tape(s)
RECOVERY.SCRIPT.DISASTER.RECOVERY.MODE # Make sure db storage paths exist. mkdir /drm/tsminst2/db1 mkdir /drm/tsminst2/db2 • These directories should be created if the operating system restore was done from an operating system backup
RECOVERY.SCRIPT.DISASTER.RECOVERY.MODE # Restore the server database to latest version backed up per the # volume history file. /opt/tivoli/tsm/server/bin/dsmserv -i /home/tsminst2 restore db todate=10/30/2009 totime=11:52:25 source=dbb • Run this with the TSM sever instance id (e.g. tsminst1) • This will restore the TSM server database • Put the modified device configuration file in place first • Put the database restore tape volume(s) in the library and note the locations and edit the device configuration appropriately
RECOVERY.SCRIPT.DISASTER.RECOVERY.MODE dsmserv • This will start the TSM server in the foreground • This is not part of the recovery script but generally is a good idea to see if the server will start properly and to note any errors • You may see this message: ANR0000I Update to server database configuration complete, ready for server restart.Just restart after the message • If the tape hardware is different, then the library may not start properly (wait for this message—note the retry message): ANR8840E Unable to open device /dev/smc2 with file handle 11 and PVRRC 153.ANR8441E Initialization failed for SCSI library TS7650G_N34.
RECOVERY.SCRIPT.DISASTER.RECOVERY.MODE • Use a TSM admin client macro to fix up the library and drive paths: /* Reset the Server Name */ set servername tsminst2-dr /* Delete the Drive Paths and the Drives */ del path tsminst2-dr n34_dr00 srct=server destt=drive libr=ts7650g_n34 del path tsminst2-dr n34_dr01 srct=server destt=drive libr=ts7650g_n34 del dr ts7650g_n34 n34_dr00 del dr ts7650g_n34 n34_dr01 /* Delete and redefine the Library */ del path tsminst2-dr ts7650g_n34 srct=server destt=libr del libr ts7650g_n34 def libr ts7650g_n34 libt=scsi def path tsminst2-dr ts7650g_n34 srct=server destt=libr devi=/dev/smc7 online=yes /* Redefine the Drives and Drive Paths */ def dr ts7650g_n34 n34_dr00 def dr ts7650g_n34 n34_dr01 def path tsminst2-dr n34_dr00 srct=server destt=drive libr=ts7650g_n34 device=/dev/rmt15 def path tsminst2-dr n34_dr01 srct=server destt=drive libr=ts7650g_n34 device=/dev/rmt28
RECOVERY.SCRIPT.DISASTER.RECOVERY.MODE # Start the server. nohup /opt/tivoli/tsm/server/bin/dsmserv -i /home/tsminst2 & print Please start new server console with command dsmadmc -CONSOLE. print Press enter to continue recovery script execution. read pause • This will start the TSM server in the background • Note the prompt to start a server console session • Look in the Activity Log for these messages to indicate the server has started properly: ANR8439I SCSI library TS7650G_N34 is ready for operations. ANR8200I TCP/IP Version 4 driver ready for connection with clients on port 1600. ANR0916I TIVOLI STORAGE MANAGER distributed by Tivoli is now ready for use.
RECOVERY.SCRIPT.DISASTER.RECOVERY.MODE # Register Server Licenses. dsmadmc -id=$1 -pass=$2 -serv=$3 -ITEMCOMMIT \ -OUTFILE=/drm/shared/drmtest-planfiles/drmtest.planfile.LICENSE.REGISTRATION.log \ macro /drm/shared/drmtest-planfiles/drmtest.planfile.LICENSE.REGISTRATION /* Purpose: Register the server licenses by specifying the names */ /* of the enrollment certificate files necessary to re-create the */ /* licenses that existed in the server. */ /* Recovery Administrator: Review licenses and add or delete licenses */ /* as necessary. */ register license file(tsmbasic.lic) register license file(tsmee.lic) • At this point the script drives a macro which resets the license registration
RECOVERY.SCRIPT.DISASTER.RECOVERY.MODE # Tell the server these copy storage pool volumes are available for use. # Recovery Administrator: Remove from macro any volumes not obtained from vault dsmadmc -id=$1 -pass=$2 -serv=$3 -ITEMCOMMIT \ -OUTFILE=/drm/shared/drmtest-planfiles/drmtest.planfile.COPYSTGPOOL.VOLUMES.AVAILABLE.log \ macro /drm/shared/drmtest-planfiles/drmtest.planfile.COPYSTGPOOL.VOLUMES.AVAILABLE /* Purpose: Mark copy storage pool volumes as available for use in recovery. */ /* Recovery Administrator: Remove any volumes that have not been obtained */ /* from the vault or are not available for any reason. */ /* Note: It is possible to use the mass update capability of the server */ /* UPDATE command instead of issuing an update for each volume. However, */ /* the 'update by volume' technique used here allows you to select */ /* a subset of volumes to be processed. */ upd vol NC0013L3 acc=READO wherestg=DRM_COPYTAPE • Change the access mode of the Copy Storage Pool volumes from ‘OFFSITE’ to ‘READONLY’