Techniques for Voyager Disaster Recovery and Business Continuity

Techniques for Voyager Disaster Recovery and Business Continuity Christopher Manly Coordinator, Library Systems and Discovery Services Cornell University Library Information Technology

Saving your data (and maybe your bacon) Context Business continuity concepts Cornell’s approach Finding the right solution

Context: Cornell Library Private R1 University Voyager supports 17 unit libraries 8 million volumes 125 Librarians, 312 staff 3.5 million visits per year 880,000 circulations per year

Context: Cornell IT environment • CUL has good relationship with central IT • Solid systemadmin support • Mature & extensive SAN environment • Multiple data centers on-campus • Centralized tape backup service • Oh, and I used to be one of “them”, which helps.

Business Continuity & Disaster Recovery • What kinds of disaster do you anticipate? • Fat fingers • Hardware failures • Power outages • Disease outbreak • Natural disaster • More readiness = more cost

Business Continuity • Disasters happen • Just because it hasn’t happened to you yet, doesn’t mean it won’t. • No, really. It will. • When (not if) disaster strikes, • How long can you be down? • How much data can you lose?

Secret bonus A well-built BC approach will make non-emergency maintenance and upgrades easier, too

BC Metrics • Recovery Time Objective (RTO) • The time it takes you to detect, respond, and get things up and running again • Recovery Point Objective (RPO) • The amount of time going backwards from the event to your latest safe data capture.

Develop a Business Continuity Strategy • Determine your desired RTO and RPO • Develop an approach and assess costs • Re-architect or revise targets as needed • Lather, rinse, repeat

Cornell’s approach – original design • Triple mirror of Voyager database • Solaris, disksuite, UFS • Split 1 mirror weekly to make snapshot for backups • Keep 3rd mirror separate • Back up archive logs in between

Cornell’s approach – current design Mirrored across buildings on SAN ZFS snapshot for backup Test server in other datacenter Second on-disk copy replicated on test server

What does this give us? • RPO of 2 hours • We back up the archive logs every 2 hours so the most data loss we can suffer is 2 hours’ activity • RTO of 4 hours for most scenarios • Unless we lose both buildings • We may choose to take a longer downtime in some circumstances

What can you do? Backups Work with IT Explore technologies

Backups Do you have them? Are they automated? Have you tested a restore lately? How about an on-disk copy?

Working with IT • Talk to them before the crisis • Build a relationship • Lunch is a good thing • Don’t specify technology • Do specify your parameters

Explore technologies • Vendors will sell you anything if you have money to throw at the problem • Free/cheap solutions abound • Rsync • OS level snapshots

Go forth and recover… Assess your needs Make a plan Put it together Test it out

Techniques for Voyager Disaster Recovery and Business Continuity