80 likes | 236 Views
Emergency Database Failover : Impacts & Recovery Plan. Trey Felton – ERCOT IT. Synopsis. ISM - Information Services Master Database DB – Database EDW – Electronic Data Warehouse. Synopsis. Failover. Emergency DB failover on April 21 st , 2008
E N D
Emergency Database Failover:Impacts & Recovery Plan Trey Felton – ERCOT IT
Synopsis ISM - Information Services Master Database DB – Database EDW – Electronic Data Warehouse
Synopsis Failover • Emergency DB failover on April 21st, 2008 • Market DB (which feeds ISM) became unresponsive • Data could not be written/read • Synchronization issues caused a 24 hr gap in data • Propagated through to ISM Out of synch(24 hrs) ISM - Information Services Master Database DB – Database EDW – Electronic Data Warehouse
Synopsis Failover • Physical Standby brought online • ISM rebuilt through Source data to recover affected extracts ISM - Information Services Master Database DB – Database EDW – Electronic Data Warehouse
Impacts • Impacts: • Market transactions were prevented from updating ISM through Logical Standby • Market DB utilizes a standby to prevent outages / performance degradations • Logical Standby (RSS) became out of synch with Physical Standby by 24 hrs • April 22 at 11:14am through April 21 at 10:44am • Other DBs feeding ISM continued normally (only Market DB was out of synch) • Priority of rebuild led to the Standby being rebuilt before the RSS • Market DB has to be kept up • This prolonged the outage to the EDW and affected extracts • Prices had to be recalculated and extracts restored from Source • Price adjustments for NSRS were completed June 5th • Missing extracts for April 21 - April 30 completed on July 1st • Why did recovery take so long? • ISM generates up to 25-35G of data per day • Data restored from Source back to April 1st • 120 Terabytes had to be restored in order to roll-forward through transaction gap • Archive log changes applied during 24-hour gap
Emergency Database Failover • All data was restored with 100% accuracy • The affected market systems that caused the April failure: • Run the balancing energy and ancillary services markets • Not used for wholesale batch or the retail markets. • ERCOT considers this to be an isolated incident and not a systemic problem
Going Forward • Actions to prevent future occurrences: • Nodal market DBs will utilize newer Hardware • More fault tolerance • Redundancy • Change of architecture in the replication process for Nodal • Proof of Concept recently introduced into the Nodal market systems • Testing underway • ERCOT is conducting a risk/cost analysis of several options for these Zonal systems • To be presented to TAC in August • New Backups / Recovery Procedures • Project initiated to stabilize our database backup procedures • Shorter recovery time
Data Recovery NOTICE DATE: July 1, 2008 NOTICE TYPE: W-A042308-48 UPDATE Extracts - Wholesale CLASSIFICATION: Public SHORT DESCRIPTION: ERCOT has completed recovery of the missing data for April 21 through April 30, 2008. INTENDED AUDIENCE: QSEs DAY AFFECTED: April 21 through April 30, 2008 LONG DESCRIPTION: ERCOT conducted an emergency database failover on April 21, 2008 following a hardware failure. This database failover resulted in an out-of-synch data problem from April 21 through April 30. ERCOT developed a phased process to attempt to thoroughly recover the missing data. The missing data has been recovered for the following extracts. A market notice will be sent when the extracts are expected to be posted. Act_Res_Output Ancillary_Services_Daily Bids_and_Schedules_Daily Forecast_Data_Daily Market_Information_Daily Sched_and_Actual_Load Self_Sch_Energy_Services ASDEPLOYMENTS