120 likes | 255 Views
Data Center Outage BRIEFING. Information and educational technology. January 10–11, 2014. Agenda. Review of Events Cause Analysis and Current Efforts Communications Vulnerabilities Mitigation Plans Lessons Learned Communication Improvements. Review of Events:. Summary: 3 incidents
E N D
Data Center Outage BRIEFING Information and educational technology January 10–11, 2014
Agenda • Review of Events • Cause Analysis and Current Efforts • Communications • Vulnerabilities • Mitigation Plans • Lessons Learned • Communication Improvements
Review of Events: • Summary: 3 incidents • Friday, Jan 10: Virtualization and uConnect firewall • Saturday, Jan 11: Virtualization • Virtualization outage affected most major systems on campus • Some mitigation lessened impact on Saturday • uConnect firewall outage • Extended email, authentication and DNS service outage for uConnect users (additional 4 hours)
Outage Timeline: 3 incidents Friday, January 10th Saturday, January 11th Most services restored Most services restored except uConnect All services restored Virtualization Outage Virtualization Degradation (critical services stable) 1 VM hosts rebooted 3 Virtualization Outage VM guests started to restore Services uConnect Firewall Outage VM hosts rebooted CAS & Smartsite restored VM guests started to restore Services Email routing restored 2 Firewall fail over to secondary w/o success Hard power cycle restores firewall and uConnect Services
Services Impacted • Admissions • Banner • Central Authentication Services (CAS)* • Computing Accounts • Electronic Death Registry System • Data Center File Services • DaFIS • DavisMail • Data Center Virtualization • Final Grade Submission • Geckomail • Kuali Financial Services • Identity and Access Management • IET Web Sites • MyInfoVault • MyUCDavis • ServiceNow and SSC Case Management • Shibboleth • Smartsite* • Time Reporting System • Web Content Management System • uConnect Services • UC Davis Directory Listings • UC Davis Home Site * CAS was restored to physical hardware on Fri 1:40pm which restored dependent services such as departmental applications and Smartsite.
Communications • Regular outage communication channels were unavailable • Email • Websites (status page, www.ucdavis.edu) • Communications issued • Automated notices on IT-Express phone system (updated 3 times) • Twitter updates (8 on 01/10; 5 on 01/11) • Progress updates on Status web page (status.ucdavis.edu) starting Friday mid-afternoon • Email to 300+ campus technologists (01/11)
Vulnerabilities • Hardware is redundant, but many services are hosted in single location on a single SAN • Critical uConnect directory services reside on a single network • The system status page is dependent on the local infrastructure • IET is not aware of all critical services that rely on our infrastructure
Mitigation Plans • SAN Software Upgrade completed • Implement diversification for critical services (Authentication, uConnect Directory Services, Status Page, WWW) • Integrate cloud services to improve diversity • Develop process to identify critical campus services dependent on IET infrastructure.
Lessons Learned • Move from disaster recovery to business continuity • Normal communication channels were unavailable • Communication and decision-making protocols when normal channels unavailable • Not prepared for normal channels being unavailable
Communication Improvements • Review service outage communication protocols, contacts, and venues • Ensure multiple modes of communication (text, cell, email, web, phone, social media) are available; leverage new WarnMe system extension for non-emergency notifications • Closer collaboration with Emergency Manager and StratComm • Ensure broad awareness of outage communication channels • Launch cloud-based status page – Status Page I/O • Leverage AggieFeed for broader communication