90 likes | 245 Views
PostalOne! Outage 2/5/2010 MTAC Report. John Edgar Vice President Information Technology Solutions. Eagan DMZ. NetScaler www.uspspostalone.com. WebSphere Cells. WebSphere Cells. WebSphere Cells. WebSphere Cells. Web Servers. Web Servers. Web Servers. Web Servers.
E N D
PostalOne! Outage 2/5/2010 MTAC Report John Edgar Vice President Information Technology Solutions
Eagan DMZ NetScaler www.uspspostalone.com WebSphere Cells WebSphere Cells WebSphere Cells WebSphere Cells Web Servers Web Servers Web Servers Web Servers Application Servers Application Servers Application Servers Application Servers Eagan PostalOne! Secure Enclave RAC Listeners Clustered DB Server Clustered DB Server Clustered DB Server Daily BCV (Eagan) Eagan SAN DR DB (San Mateo) PostalOne! Architecture
PostalOne! Details • PostalOne! is a very large scale database with high data change rate • Currently 10 TB of live data with 25% change per week • Redundant back up processes intended to ensure complete data recovery in the event of system outage • Daily disk image of full database • Weekly tape archival of full database • Multiple daily transaction level logs archived • Daily disk image is currently ~14TB. Recovery time from image is ~14 hours. Image is released daily to create new image. • Tape images are kept for 30 days. This provides four images for recovery if needed. Recovery is about 5 times longer than disk. • Multiple daily transaction backups taken and logged to support incremental recovery activities.
PostalOne! Outage Timeline • Friday 2/5/2010 • 6:08 PM – End users reported experiencing system response problems with PostalOne!. Technical teams began investigation. • 6:47 PM – Identified errors within database • Technical teams worked through multiple options to restore system • 8:00 PM – Scheduled daily disk image backup kicked off • 9:45 PM – Attempted restore of corrupted table from daily disk image backup • Restore failed • 10:57 PM – Began full restore from previous week tape backup and reapplication of incremental transactions • Wednesday 2/10/2010 • 12:30 AM PostalOne! system operational and transaction processing resumed • Friday 2/12/2010 • COB Based on counts of processed postage statements for the week, majority of backlog has been addressed
Cause and Future Prevention • Cause of Outage: Disk level storage corruption • Preventive Actions to be Taken: • Acquire and implement additional storage for second BCV copy • Revise recovery procedures to keep one BCV as database clone • Implementing SNAP backups to run twice daily, in addition to BCV • Provide further incremental copies of data to minimize future recovery • More rapid recovery • Developing/acquiring additional automated disk management and error checking routines to enhance recovery capabilities
PostalOne! Outage Contingency Operations Review Pritha Mehra Vice President Business Mail Entry and Payment Technologies
Contingency Operations • Revised Contingency Operations to extend 72 Hour Limit • Requested Manual Postage Statements and Summary Postage Statement Reports • Utilized Manual Log to record all transactions • Communications to Mailers and USPS
Contingency Operations • Conducted Extensive Communications • DMM Advisory, MTAC and MTAC workgroups • PostalOne! Users, Postalone! User Group • RIBBS, Gateway • Officers • Stakeholders – Sales, BSN, BME • P&C Weekly • BME Newsletter • Webinars • Mailers • CPP Publishers • Area Marketing Managers & CSPAs • BSN and Sales • Business Mail Acceptance, DMU Clerks
Contingency Operations • Lessons Learned • Communications worked very well • Revised Contingency Procedures worked • Contingency Plan Revisions for greater Clarity • Updated Manual Logs • Contingency Time Limits subject to expected recovery timeframe • Checklist for HDQs on Communications • Restoration Procedures for eDOC/Postage Statements • Restoration for Special Postage Payment System • Verification Results Recording upon Restoration • Continuation of Operations Plan