40 likes | 55 Views
Emergency Database Failover : Impacts & Recovery Plan. Aaron Smallwood – ERCOT IT Joel Mickey – ERCOT Market Operations. Emergency Database Failover. Summary: ERCOT conducted an emergency database failover on April 21 st , 2008 following a hardware failure
E N D
Emergency Database Failover:Impacts & Recovery Plan Aaron Smallwood – ERCOT IT Joel Mickey – ERCOT Market Operations
Emergency Database Failover • Summary: • ERCOT conducted an emergency database failover on April 21st, 2008 following a hardware failure • While ERCOT does perform controlled database failovers monthly, this was different due to the nature of the hardware failure • Normally, the database is ‘stopped’ at one site, and then ‘started’ at the other in controlled manner • In this case, the database ‘hung’ – meaning that it became unresponsive and data was unable to be written to or read from database • The impacts: • Transactions were prevented from updating downstream databases • The lack of transaction updates in downstream databases left a gap in transactional records (out of sync) • The affected extracts for April 21st through April 30th are listed in market notices for the incident • ERCOT considers this to be an isolated incident and not a systemic problem
Recovery Plan • Goal: • Recover transactions that are needed to perform price adjustment calculations that are missing in downstream databases from a restored copy of the production database • Plan: • Build an environment identical to the production environment • Servers, storage, applications • Restore data to pre-crash state (4/21) • Over 20TB of data to restore from tape (in progress) • Using the restored environment and data, extract transactions missing from downstream databases and then roll forward all subsequent transactions • ERCOT Market Operations will then review the data for reasonableness and approve the data for reporting and settlement
Questions • Actions to prevent future occurrences: • Nodal market databases will be on newer hardware with more fault tolerance and redundancy • Potential re-architecture of system integration between the databases • Lessons learned are being documented but no plan yet • Resources are focused on the data recovery efforts • Questions: • When will non-spinning reserve price adjustments for PRR 650 be completed? • When the transactional data has been restored, reviewed, and approved • What is the timeline? • The environment build is complete, we anticipate the data restore from tape to be the task that takes the longest • We are estimating weeks, not months, to complete the plan • Unknowns include the amount of time needed to restore from tape and the quality of the data once it’s been restored • Market notices will continue to be sent to indicate status