Anatomy of a disaster recovery

Anatomy of a disaster recovery Jim McBee ITCS Hawaii jim@somorita.com

Setting the stage “Approximately 80 percent of unplanned downtime is caused by people and process issues, while the remainder is caused by technology failures and disasters” Gartner Group study, March 16, 1999

Jim McBee – Shameless self promotion  • Consultant, Writer, MCSE, MVP, and MCT – Honolulu, Hawaii • Principal clients SAIC, Dell, and Microsoft • Author – Exchange 2003 24Seven (Sybex) • Contributor – Exchange and Outlook Administrator • Blog – Mostly Exchange – http://mostlyexchange.blogspot.com

Audience Assumptions • Level 200 session • You have at least a few months experience running Exchange 5.5, 2000, or 2003 • You have worked with Active Directory • You can install and configure Windows and Exchange

Session’s coverage • Presentation – About 60 minutes • Common causes of downtime • Case studies • Summary common things that delay recovery • Some things that can speed recover • Book give away – Drop off your business card or write your name on a slip of paper • Questions and answers – 15 – 20 minutes • Catch me afterwards also, I’m here all week

Getting us on some common ground • Disaster means different things to different people • The word “disaster” usually carries the connotation of data loss or possibly financial loss • In this presentation, disaster recovery is really restoration of service

Quick poll… • Over the last two years, how many of you have had unplanned downtime of: • 4 hours? • 8 hours? • Entire day? • More than a day? • Think about the positive and negative factors that contributed to the downtime and its recovery.

Common disaster causes • 80% of unplanned downtime - People and Processes • Infrastructure problems (DNS, DCs, GCs, LAN, WAN, storage)

Common Exchange failure reasons • 5 File based A/V software corrupted EDB • 4 virus outbreaks requiring a shutdown • 4 SAN failures • 4 Shutdowns due to insufficient disk space • 3 OUs were deleted that contained user accounts • 1 Exceeded 16GB limit on Exchange standard • 1 Admin applied wrong security template • 1 Operator could not restore database – 5 days! • 1 Database corrupt, 1018 error (device driver) • 1 Database corrupt, operator plugged external SCSI subsystem in while live • 1 Loss of organization’s only global catalog • 1 Loss of organization’s only DNS server • 1 Administrator incorrect configured directory replication – loss of GAL • 1 Server blue screening every few hours (service pack / firmware issue) • 1 Motherboard failure • 1 SCSI controller failure • 1 Power to the campus data center failed

What is the cost for delaying restoration of e-mail service? • User productivity • Missed contractual obligations • Missed sales or customer contact • Failure to respond to customers promptly • Loss of end user good will • Loss of credibility (the company’s and your own) • Loss of your job! 

Common causes of delays in restoration of service • People, processes, training • Lack of resources • Not asking for help soon enough

Let’s look at some real-world situations (I hope some of these are therapeutic)

Case Study 1 - Anything that can go wrong will go wrong • Exchange 2000 A/P cluster, with 800 mailboxes across multiple locations • Administrator deletes entire OU (approximately 600 users) • Some mailboxes still active, cannot restore from backup • Previous backup tape was overwritten accidentally, next most recent in off-site storage • Server administrator gets locked out of computer room and cannot get back in • Tape device had to be moved from production network to recovery network.

Case Study 2 - DNS Failure • Single E2K3 server, 700 mailboxes, and 2 W2K3 domain controllers • One of the two domain controllers failed; it was hosting the only functioning DNS • They were under the impression that they had redundancy • Lack of DNS troubleshooting skills delayed repair by 4 hours • DNS was never set up on the second domain controller even though it was defined on the clients and member servers as an alternate

Case Study 3 - Database corruption • Exchange server database corruption, database would not mount • “Reboot” mind frame and hope the problem will go away • Delayed calling for help for over a day. Affected 500+ mailboxes • Don’t be afraid to call for help

Case Study 4 - Operator ineptitude • Exchange 5.5 server with 350 mailboxes • User deletes important public folder • Inexperience operator spends the next 4 business days trying restore • Got an error each time they restored and tried to start the store • Boss did not want to front $245.00 for a PSS call • Error was due to GUID mismatch. Run ISINTEG -PATCH.

Case Study 5 - Generic problem - Server out of disk space • Server runs out of disk space - Very generic • Almost always due to transaction logs • Often a low-level Exchange admin may take hours to diagnose this problem. • Event logs are helpful here • My solution: Select some of the older log files and move them to another disk. Exchange usually does not have outstanding transactions in logs older than a few minutes. Pick something from hours or days ago.

Case Study 6 - All your eggs in one basket • Exchange 2003 server with 300 mailboxes • Single 150GB mailbox store • Sales organization with a few key mailboxes used for customer communications • Dial-tone restore • Restore entire mailbox store to RSG • Could not / did not segment users that could be restored. • Server ran out of transaction log space during merge back in to store (while D/R team was at lunch) • Database files exceeded storage limits due to loss of single instance store

Case Study 7 - Recovery Server restoration • Restored 5000 mailboxes to a recovery server • Recovery server was on a test network that had a 10Mb/s connection to main network

Key Factors that Slowed Recovery – Human Factors • Indecision / no one managing the crisis • No clear escalation path / No SLA to guide recovery process • Timelines for escalation not established (at time X, call PSS, at time Y, ask for escalation) • Not calling for help in a timely fashion • Lack of training • Employee fatigue • Everyone tends to get caught up in the “fire” • Large scale interruptions of service may take 24+ hours to recovery • Bad decisions tend to be made when everyone gets tired • Poor communications with users and management • Doing further harm (deleting database files or event logs) • Poor planning • Incorrect / unrealistic expectations w/r to time, restore rates, data restored • Blame-storming first

Key factors that slowed recovery - Documentation / Infrastructure • Unknowns in your environment • Infrastructure • Time to restore, retrieve tapes, get decisions made • Service levels for infrastructure such as LAN, WAN, storage • Inadequate spare / replacement hardware • Lack of a good, recent backup or cannot locate tapes • No documentation on how to rebuild • Large environments often have separate backup personnel that must be available when restore operations need to take place • Lack of resources to do a disaster recovery (CD ROMs, license keys, documentation) • Server complexity (servers handling multiple roles)

Options that can speed recovery • Training • Practice, Practice, Practice • Written D/R plan and escalation procedures • Keep a written journal of everything you are doing to restore service no matter how mundane. • Dial-tone restoration • Rapidly available replacement hardware • Documentation • Creating a disaster recovery kit • Restore critical mailboxes first (either using Dial-tone and RSG, or segment users to different mailbox stores) • Reading the event logs

Disaster Recovery Kit (Crash cart) • Printed telephone list, operations procedures, and escalation procedures • Server hardware / Windows / Exchange documentation • Product keys / activation codes / key disks • Windows and Exchange product CDs • Don’t forget the service pack CDs • Current versions of all device drivers • Third party CDs (antivirus, gateway, fax servers, etc..) • Emergency repair disk for each server • Keep the kit up to date. • Do not loan this kit or the contents to anyone

Demonstrations • Exchange outage • Dial-tone restore • Use Recovery Storage Group

Recovery Storage Group and Dial-Tone Recovery • Mount “empty” database • Here after known as the “dial tone” database • Users can go back to work • Restore last backup to Recovery Storage Group • Here after known as the “RSG” database • Dismount “Dial-tone” database • Rename the physical file names • Dismount “RSG” database • Move the RSG database to production location and rename to production file names • Move Dial-tone database to RSG location and rename to RSG database name • Using ESEUTIL /Y is fastest way to move/copy files if on separate disks • Mount dial-tone database and RSG databases • Use Merge tools to merge changes from the database that is NOW in the RSG to the database that is NOW in production • Production database is now up-to-date!

Why do the database swap? • Merging RSG database in to dial-tone database destroys SIS • Dial-tone database will not get mailbox metadata such as rules and permissions

Neat 3rd party tools • OnTrack PowerControls • http://www.ontrack.com/powercontrols/ • Quest Recovery Manager for Exchange • http://wm.quest.com/products/Exchange/

Additional information • The Exchange 2003 Technical Library contains a number of documents on disaster recovery and preparedness. • http://tinyurl.com/2pua2 • The Definitive Guide to Exchange Disaster Recovery and Availability by Paul Robichaux • http://tinyurl.com/73ghr • Support Web Cast: Recovery Storage Groups and Disaster Recovery in Microsoft Exchange Server 2003 • KB 832436 • Understanding and analyzing -1018, -1019, and -1022 Exchange database errors • KB 314917

Book Giveaway • Has everyone given me something to draw from?

Questions? • You can always catch me this week if you don’t get your questions answered. • Thanks for attending! • My blog is Mostly Exchange – http://mostlyexchange.blogspot.com

Anatomy of a disaster recovery

Anatomy of a disaster recovery

Presentation Transcript

Anatomy of a Wildfire Disaster Response

Disaster Recovery

Disaster recovery

Disaster Recovery

Disaster Recovery

Disaster Recovery

Recovery Disaster Recovery Updates

Disaster Recovery

Disaster Recovery

Anatomy of 4GL Disaster

Disaster Recovery

Disaster Recovery

Disaster Recovery

Disaster Recovery

Advantages of a Disaster Recovery Plan

Disaster Recovery