1 / 27

No Fallen ANGELs! Redundancy, Backup, Recovery

No Fallen ANGELs! Redundancy, Backup, Recovery. Andrea Chappell : University of Waterloo Adam Hauerwas : Providence College Ruomiao Wang & Jie Li : Kelly Direct, Indiana University Terry O'Heron & Crystal Foust : Penn State. Agenda. How do you backup/archive courses?

Download Presentation

No Fallen ANGELs! Redundancy, Backup, Recovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. No Fallen ANGELs! Redundancy, Backup, Recovery Andrea Chappell: University of Waterloo Adam Hauerwas: Providence College Ruomiao Wang & Jie Li: Kelly Direct, Indiana University Terry O'Heron & Crystal Foust: Penn State

  2. Agenda • How do you backup/archive courses? • What policies and procedures guide your response to requests to recover a course, a file, an internal ANGEL page, a student upload file? • How do you protect your system from various failures, and in what time do you “promise” to have it back online?

  3. University of Waterloo (Andrea) • ANGEL is the centrally supported LMS since summer 2004. • Core to university business. • Need to configure against various types of failures, e.g.: • Disaster (fire, flooding, etc.) • Partial system failure (ANGEL/IIS or SQL server systems, disks, etc.)

  4. Constraints (what we can’t change) • Support coverage is not 24x7: Central IT (IST) provides extended support for critical systems but not 24x7 support. • Cannot survive lengthy power outages. • Cannot survive some network outages. • Network support is also not 24x7.

  5. Backup Processes • System data backup • Database (dump of db file), Transaction logs (cut once per day) and Upload files backed up nightly bycampus backup service. • Course archives • Long term: Archive courses at end of term. • Shorter term: Remove from system after 4 terms. (Note: to offer a course again, copy course rather than reuse same instance).

  6. Recovery Process • Recover data to dev system and copy lost data to production. • This can be very complex if the missing data is a quiz that was run, a bulletin board, etc.! • Currently no policies on what to recover, or promise of time to recovery. Requests considered on individual basis.

  7. Protecting against failures • Current strategy: Buy robust equipment, configure to minimize points of failure. Production Systems Development System • Dual RAID disks • Dual power supply • 7x24 4 hour hardware support (from vendor) • Housed in access-controlled machine room • Uninterrupted Power Supply ANGEL/IIS (Dell server) ANGEL/IIS and SQL server SQL Server (Dell server)

  8. Vulnerabilities in Current Strategy • The ANGEL/IIS or SQL Server hardware, e.g., system motherboard failure • Don’t have ready back-up machine. • Could temporarily use development system. • Likely a minimum half day down-time. • Machine room “fire” • All hardware lost. • Up to one day of lost data (if 24 hours from last backup). • Days of down time!

  9. Configurations under Investigation Looking for faster recovery time, less potential data loss, through increased redundancy. • Config 1: Identical production and development systems, different locations. • Config 2: Identical production and dev systems, shared data (data filer), Load Balancer (Cisco), different locations.

  10. Config 1 • Identical production and development systems, different locations. • Gains: • In system failure: • If possible, move disks to duplicate system – 4 working hours. • Or, recover data to duplicate systems – perhaps 8 working hours. • Issues: • People intervention still required. • Cost: • Two new systems. ANGEL/IIS (Dell server) SQL Server (Dell server)

  11. Config 2 • Identical prod and dev systems, shared data, load balancer, different locations. • Gains: • Failure of one ANGEL/IIS system - instantaneous fall over to remaining. • Failure of SQL Server - reconfigure dev system to point to data filer. • Issues: • Single point of failure unless filer clustered. • Greater complexity may cause downtime. • Cost: • 3 new systems, plus filer (~$30 USD) Load Balancer ANGEL/IIS (Dell server) ANGEL/IIS (Dell server) Data Filer SQL Server (Dell server)

  12. Providence College (Adam) • Like Waterloo, ANGEL has been our LMS since Fall, 2001. • Support coverage is not 24x7. • Cannot survive lengthy power outages or network outages.

  13. PC Backup and Recovery • System data backup • Back up database and logs to files once per day. • Use Tivoli to back up both DB and file system nightly. • Creates “backup of a backup.” • Course archives • Short term: Archive courses 90 days after term end. • Long term: Store archives to DVD. • Recovery • Like Waterloo, recover Production database in Development environment.

  14. PC’s Redundancy • Today: Robust Production Server Development System Production System ANGELIIS/SQL (Desktop) ANGELIIS/SQL (HP DL380) • Multiple RAID disks (System, DB, Data) • Dual Power Supplies and NIC’s • Access-controlled machine room • UPS

  15. PC’s Future Architecture • This Summer: New Server and SAN IBM Storage Area Network Development System Production System ANGELIIS/SQL (Old HP) ANGELIIS/SQL (New HP) • Purchase new server and install O/S and SQL Serveron local RAID. • Store database and web files on SAN disk. • In the event of Production hardware failure, connect Production disk to Development server with little downtime.

  16. Kelley Direct On-Line Programs, Indiana University (Ruomiao) • Road to ANGEL • Piloted ANGEL as LMS in Fall 2003 • Spring 2004: all courses delivered via ANGEL • Critical learning platform that connects KD to the students

  17. Kelley Direct On-Line Programs, Indiana University

  18. Kelley Direct On-Line Programs, Indiana University • Current Data Protection Measures • Backup System Backups • Full Backups once a week starting Friday night • Differential Backups every night around 11 PM Database Backups • Full ANGEL SQL database backup every night at 10PM. The database backup output files are then backed up by system tape backups for that night. • Transaction log backups every six hours. The backup tapes are then taken to an offsite location.

  19. Kelley Direct On-Line Programs, Indiana University • Current System Protection Measures • Disk • Configured with RAID 5 with a spare disk • Dual power connections • UPS System connection (30 min.) • Spare Chassis • Test server has identical hardware and server as a spare chassis

  20. Kelley Direct On-Line Programs, Indiana University • Current Recovery Practices • File or Database Restore • Restore from disk, tape backups, or individual developer’s machines. • System Component Failure • Replace the faulty component(s) from the spare chassis (test server) or move entire disk array to from production to test server • Total System Failure or disk array failure • Rebuilt entire system, possibly to alternate hardware. • All the ANGEL components will either need to be installed from scratch, or restored from backup tapes. Some system components have to be reconfigured manually.

  21. Kelley Direct On-Line Programs, Indiana University • Challenges for KD ANGEL Environment • Security • ANGEL web server resides on the same physical machine that hosts the ANGEL databases • Scalability • Limited capability to scale performance based on volume • Availability • No redundancy built in. Single server design. Any component failure means downtime • Shrinking Maintenance Window (or do we still have one?) • (continue on next slide)

  22. Kelley Direct On-Line Programs, Indiana University • Challenges for KD ANGEL Environment • Storage Capacity • Limited expansion capability • Recoverability • Single copy of production data on disk. Tape restoration is time consuming and means data loss • Availability • No redundancy built in. Single server design. Any component failure means downtime • Growth • Significant enrollment growth is expected for the programs in the next three years • Development Environment • Developers are coding on own machines. Configurations differ from production environment. Less efficient.

  23. Kelley Direct On-Line Programs, Indiana University • Some Questions • How can backend infrastructure better support the vision of the on-line programs? • How to plan system capacity when progarm changes (such as enrollment growth)? • How to better protect student data? • What the available options for long-term data retention? • How to better meet the requirements for less service interruption? • What should we do to ensure a faster ANGEL systems recovery?

  24. Kelley Direct On-Line Programs, Indiana University

  25. Penn State Environment (Terry, Crystal) • Support coverage is 24x7 • Backup Power (generator) • Redundant network connectivity • Failover capability • Mirrored storage • Daily Backups/Off-site storage • Daily Maintenance (5-7 am) • Archive (courses, inactive groups)

  26. Constraints • Backup • SQL: 3 hours • File: 3-4 days • Restoration • SQL: 1.5 hours • File: 2 min. - ??

More Related