80 likes | 94 Views
A detailed overview of the recent database outages, recovery efforts, and impacts on extract generation. Root causes, recovery timeline, and consequences are highlighted, along with future quality assurance measures.
E N D
Aug 4th Extracts Outage:Synopsis / Impacts Trey Felton – ERCOT IT
Synopsis • On Aug 4th, the Production Database used to generate Extracts encountered file corruption which persisted throughout the week • Service Requests were placed with hardware/software vendors • File corruption was limited to EMMS extracts only • On Aug 8th, the Production Database used to generate extracts encountered file corruption again • The file corruption resulted from a set of disks allocated to both the Production Database and Database Backup location on the same server • Database was temporarily taken offline to allow recovery efforts • File corruption was limited to EMMS extracts only • On Aug 13th, a 3rd Incident occurred where the Production Database used to generate extracts encountered File System corruption • The Production Database was taken offline to rebuild entire disk and resolve File System Corruption issue • Currently there is 3-4 day lag on several extracts • Root Cause: • The file corruption resulted from a set of disks allocated to both the Production Database and Database Backup location on the same server • This error is now included in the final quality assurance script for disk allocation and has been included in the computing vendors health check process
Timeline • Aug 4— • Allocated additional storage capacity to the Production database • File corruption detected in Production database • Opened Service Requests with vendors • Aug 5/6— • System health checks completed by computing and storage vendors – all systems pass • Forecast_Data_Initial* extract for 8/5 failed – Successfully rerun and posted @ 8/7 10:55PM • Forecast_Data_Initial* extract for 8/6 failed – Successfully rerun and posted @ 8/8 1:10AM • Forecast_Data_Initial* extract for 8/7 failed – Successfully rerun and posted @ 8/8 1:11AM • Market Notice Posted • Recovery of Production database began • Aug 8— • Recovery of Production database completed • Second file corruption incident detected in Production database occurred in the afternoon • Emergency shutdown of Production database, recycle servers • Extract Scheduler on TML unavailable to the market • Ancillary_Services_Initial extract for 08/08 failed • Forecast_Data_Initial extract for 08/08 failed • Market_Information_Initial extract for 08/08 failed • Recovery of the production database began *Forecast_daily was posted as per SLA on 8/5 through 8/8. Forecast_daily data is used for Shadow settlement by MP’s. Forecast Initial is just a consolidated data of last 30 daily extracts.
Timeline (continued) • Aug 9 • Continued recovery of Production • Aug 10 • Recovery Status: Partial • Production database brought online and the following replication streams re-started: LODSTAR, SIEBEL, EIF • Aug 12— • Recovery Status: All files successfully recovered with no lost data • Replication restarted on Production database from the point the database was shutdown on Friday afternoon • Roughly 4 days behind in replicating the data into the Production database • Aug 13 • The database detected a file system corruption with same disk group - a known possibility with the prior recovery • Shutdown of Production database (Back online at 245pm) • Extract Scheduler on TML unavailable the market (Back online at 245pm) • Self Scheduled Energy Services – posted on 8/13 – removed and reposted on 8/14 • Scheduled and Actual Load – posted on 8/13 – removed and reposted on 8/14 • Actual Resource Output – posted on 8/13 – removed and reposted on 8/14 • Ancillary Service Deployments – posted on 8/13 – removed and reposted on 8/14
Timeline (continued) • Aug 14— • Replication, report and extract processing of the market system restarted shortly after midnight • Market Notice Sent
Recovered Extracts before Aug 13 File System Corruption Operating Day Extract Name RUN_STATUS POSTED TIME Fri 8-Aug Market_Information_Initial POSTING COMPLETE 8/12/08 10:10:30 AM Forecast_Data_Initial POSTING COMPLETE 8/12/08 1:40:30 PM Ancillary_Services_Initial POSTING COMPLETE 8/12/08 10:10:30 AM Sat 9-Aug ASDEPLOYMENTS POSTING COMPLETE 8/12/08 1:10:30 PM Aggregated Bids Stacks POSTING COMPLETE 8/12/08 2:30:30 PM act_res_output POSTING COMPLETE 8/12/08 1:10:30 PM Self_Sch_Energy_Services POSTING COMPLETE 8/12/08 12:55:30 PM Sched_and_Actual_Load POSTING COMPLETE 8/12/08 12:55:30 PM Market_Information_Daily POSTING COMPLETE 8/12/08 12:55:30 PM Daily_Individual_Replacement_Bids POSTING COMPLETE 8/12/08 12:41:30 PM Daily_Individual_BESBids POSTING COMPLETE 8/12/08 12:41:30 PM Daily_Individual_AncSvc_Bids POSTING COMPLETE 8/12/08 12:41:30 PM 60day_ResourceplanDetails POSTING COMPLETE 8/12/08 12:41:30 PM 60day_ENERGY_SCHEDULES POSTING COMPLETE 8/12/08 12:41:30 PM 60day_ANCILLARYSERVICESCHEDULES POSTING COMPLETE 8/12/08 12:41:30 PM Forecast_Data_Daily POSTING COMPLETE 8/12/08 12:16:30 PM Bids_and_Schedules_Daily RUNNING anticipate posting in 30mins. Ancillary_Services_Daily POSTING COMPLETE 8/12/08 1:27:30 PM BACK