270 likes | 404 Views
No Fallen ANGELs! Redundancy, Backup, Recovery. Andrea Chappell : University of Waterloo Adam Hauerwas : Providence College Ruomiao Wang & Jie Li : Kelly Direct, Indiana University Terry O'Heron & Crystal Foust : Penn State. Agenda. How do you backup/archive courses?
E N D
No Fallen ANGELs! Redundancy, Backup, Recovery Andrea Chappell: University of Waterloo Adam Hauerwas: Providence College Ruomiao Wang & Jie Li: Kelly Direct, Indiana University Terry O'Heron & Crystal Foust: Penn State
Agenda • How do you backup/archive courses? • What policies and procedures guide your response to requests to recover a course, a file, an internal ANGEL page, a student upload file? • How do you protect your system from various failures, and in what time do you “promise” to have it back online?
University of Waterloo (Andrea) • ANGEL is the centrally supported LMS since summer 2004. • Core to university business. • Need to configure against various types of failures, e.g.: • Disaster (fire, flooding, etc.) • Partial system failure (ANGEL/IIS or SQL server systems, disks, etc.)
Constraints (what we can’t change) • Support coverage is not 24x7: Central IT (IST) provides extended support for critical systems but not 24x7 support. • Cannot survive lengthy power outages. • Cannot survive some network outages. • Network support is also not 24x7.
Backup Processes • System data backup • Database (dump of db file), Transaction logs (cut once per day) and Upload files backed up nightly bycampus backup service. • Course archives • Long term: Archive courses at end of term. • Shorter term: Remove from system after 4 terms. (Note: to offer a course again, copy course rather than reuse same instance).
Recovery Process • Recover data to dev system and copy lost data to production. • This can be very complex if the missing data is a quiz that was run, a bulletin board, etc.! • Currently no policies on what to recover, or promise of time to recovery. Requests considered on individual basis.
Protecting against failures • Current strategy: Buy robust equipment, configure to minimize points of failure. Production Systems Development System • Dual RAID disks • Dual power supply • 7x24 4 hour hardware support (from vendor) • Housed in access-controlled machine room • Uninterrupted Power Supply ANGEL/IIS (Dell server) ANGEL/IIS and SQL server SQL Server (Dell server)
Vulnerabilities in Current Strategy • The ANGEL/IIS or SQL Server hardware, e.g., system motherboard failure • Don’t have ready back-up machine. • Could temporarily use development system. • Likely a minimum half day down-time. • Machine room “fire” • All hardware lost. • Up to one day of lost data (if 24 hours from last backup). • Days of down time!
Configurations under Investigation Looking for faster recovery time, less potential data loss, through increased redundancy. • Config 1: Identical production and development systems, different locations. • Config 2: Identical production and dev systems, shared data (data filer), Load Balancer (Cisco), different locations.
Config 1 • Identical production and development systems, different locations. • Gains: • In system failure: • If possible, move disks to duplicate system – 4 working hours. • Or, recover data to duplicate systems – perhaps 8 working hours. • Issues: • People intervention still required. • Cost: • Two new systems. ANGEL/IIS (Dell server) SQL Server (Dell server)
Config 2 • Identical prod and dev systems, shared data, load balancer, different locations. • Gains: • Failure of one ANGEL/IIS system - instantaneous fall over to remaining. • Failure of SQL Server - reconfigure dev system to point to data filer. • Issues: • Single point of failure unless filer clustered. • Greater complexity may cause downtime. • Cost: • 3 new systems, plus filer (~$30 USD) Load Balancer ANGEL/IIS (Dell server) ANGEL/IIS (Dell server) Data Filer SQL Server (Dell server)
Providence College (Adam) • Like Waterloo, ANGEL has been our LMS since Fall, 2001. • Support coverage is not 24x7. • Cannot survive lengthy power outages or network outages.
PC Backup and Recovery • System data backup • Back up database and logs to files once per day. • Use Tivoli to back up both DB and file system nightly. • Creates “backup of a backup.” • Course archives • Short term: Archive courses 90 days after term end. • Long term: Store archives to DVD. • Recovery • Like Waterloo, recover Production database in Development environment.
PC’s Redundancy • Today: Robust Production Server Development System Production System ANGELIIS/SQL (Desktop) ANGELIIS/SQL (HP DL380) • Multiple RAID disks (System, DB, Data) • Dual Power Supplies and NIC’s • Access-controlled machine room • UPS
PC’s Future Architecture • This Summer: New Server and SAN IBM Storage Area Network Development System Production System ANGELIIS/SQL (Old HP) ANGELIIS/SQL (New HP) • Purchase new server and install O/S and SQL Serveron local RAID. • Store database and web files on SAN disk. • In the event of Production hardware failure, connect Production disk to Development server with little downtime.
Kelley Direct On-Line Programs, Indiana University (Ruomiao) • Road to ANGEL • Piloted ANGEL as LMS in Fall 2003 • Spring 2004: all courses delivered via ANGEL • Critical learning platform that connects KD to the students
Kelley Direct On-Line Programs, Indiana University • Current Data Protection Measures • Backup System Backups • Full Backups once a week starting Friday night • Differential Backups every night around 11 PM Database Backups • Full ANGEL SQL database backup every night at 10PM. The database backup output files are then backed up by system tape backups for that night. • Transaction log backups every six hours. The backup tapes are then taken to an offsite location.
Kelley Direct On-Line Programs, Indiana University • Current System Protection Measures • Disk • Configured with RAID 5 with a spare disk • Dual power connections • UPS System connection (30 min.) • Spare Chassis • Test server has identical hardware and server as a spare chassis
Kelley Direct On-Line Programs, Indiana University • Current Recovery Practices • File or Database Restore • Restore from disk, tape backups, or individual developer’s machines. • System Component Failure • Replace the faulty component(s) from the spare chassis (test server) or move entire disk array to from production to test server • Total System Failure or disk array failure • Rebuilt entire system, possibly to alternate hardware. • All the ANGEL components will either need to be installed from scratch, or restored from backup tapes. Some system components have to be reconfigured manually.
Kelley Direct On-Line Programs, Indiana University • Challenges for KD ANGEL Environment • Security • ANGEL web server resides on the same physical machine that hosts the ANGEL databases • Scalability • Limited capability to scale performance based on volume • Availability • No redundancy built in. Single server design. Any component failure means downtime • Shrinking Maintenance Window (or do we still have one?) • (continue on next slide)
Kelley Direct On-Line Programs, Indiana University • Challenges for KD ANGEL Environment • Storage Capacity • Limited expansion capability • Recoverability • Single copy of production data on disk. Tape restoration is time consuming and means data loss • Availability • No redundancy built in. Single server design. Any component failure means downtime • Growth • Significant enrollment growth is expected for the programs in the next three years • Development Environment • Developers are coding on own machines. Configurations differ from production environment. Less efficient.
Kelley Direct On-Line Programs, Indiana University • Some Questions • How can backend infrastructure better support the vision of the on-line programs? • How to plan system capacity when progarm changes (such as enrollment growth)? • How to better protect student data? • What the available options for long-term data retention? • How to better meet the requirements for less service interruption? • What should we do to ensure a faster ANGEL systems recovery?
Penn State Environment (Terry, Crystal) • Support coverage is 24x7 • Backup Power (generator) • Redundant network connectivity • Failover capability • Mirrored storage • Daily Backups/Off-site storage • Daily Maintenance (5-7 am) • Archive (courses, inactive groups)
Constraints • Backup • SQL: 3 hours • File: 3-4 days • Restoration • SQL: 1.5 hours • File: 2 min. - ??