1 / 8

Aug 4 th Extracts Outage : Synopsis / Impacts

A detailed overview of the recent database outages, recovery efforts, and impacts on extract generation. Root causes, recovery timeline, and consequences are highlighted, along with future quality assurance measures.

mlieb
Download Presentation

Aug 4 th Extracts Outage : Synopsis / Impacts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Aug 4th Extracts Outage:Synopsis / Impacts Trey Felton – ERCOT IT

  2. Synopsis • On Aug 4th, the Production Database used to generate Extracts encountered file corruption which persisted throughout the week • Service Requests were placed with hardware/software vendors • File corruption was limited to EMMS extracts only • On Aug 8th, the Production Database used to generate extracts encountered file corruption again • The file corruption resulted from a set of disks allocated to both the Production Database and Database Backup location on the same server • Database was temporarily taken offline to allow recovery efforts • File corruption was limited to EMMS extracts only • On Aug 13th, a 3rd Incident occurred where the Production Database used to generate extracts encountered File System corruption • The Production Database was taken offline to rebuild entire disk and resolve File System Corruption issue • Currently there is 3-4 day lag on several extracts • Root Cause: • The file corruption resulted from a set of disks allocated to both the Production Database and Database Backup location on the same server • This error is now included in the final quality assurance script for disk allocation and has been included in the computing vendors health check process

  3. Timeline • Aug 4— • Allocated additional storage capacity to the Production database • File corruption detected in Production database • Opened Service Requests with vendors • Aug 5/6— • System health checks completed by computing and storage vendors – all systems pass • Forecast_Data_Initial* extract for 8/5 failed – Successfully rerun and posted @ 8/7 10:55PM • Forecast_Data_Initial* extract for 8/6 failed – Successfully rerun and posted @ 8/8   1:10AM • Forecast_Data_Initial* extract for 8/7 failed – Successfully rerun and posted @ 8/8   1:11AM • Market Notice Posted • Recovery of Production database began • Aug 8— • Recovery of Production database completed • Second file corruption incident detected in Production database occurred in the afternoon • Emergency shutdown of Production database, recycle servers • Extract Scheduler on TML unavailable to the market • Ancillary_Services_Initial          extract for 08/08 failed • Forecast_Data_Initial                extract for 08/08 failed • Market_Information_Initial        extract for 08/08 failed • Recovery of the production database began *Forecast_daily was posted as per SLA on 8/5 through 8/8. Forecast_daily data is used for Shadow settlement by MP’s. Forecast Initial is just a consolidated data of last 30 daily extracts.

  4. Timeline (continued) • Aug 9 • Continued recovery of Production • Aug 10 • Recovery Status: Partial • Production database brought online and the following replication streams re-started:  LODSTAR, SIEBEL, EIF • Aug 12— • Recovery Status: All files successfully recovered with no lost data • Replication restarted on Production database from the point the database was shutdown on Friday afternoon     • Roughly 4 days behind in replicating the data into the Production database • Aug 13 • The database detected a file system corruption with same disk group - a known possibility with the prior recovery • Shutdown of Production database (Back online at 245pm) • Extract Scheduler on TML unavailable the market (Back online at 245pm) • Self Scheduled Energy Services    – posted on 8/13 – removed and reposted on 8/14 • Scheduled and Actual Load            – posted on 8/13 – removed and reposted on 8/14 • Actual Resource Output                   – posted on 8/13 – removed and reposted on 8/14 • Ancillary Service Deployments        – posted on 8/13 – removed and reposted on 8/14

  5. Timeline (continued) • Aug 14— • Replication, report and extract processing of the market system restarted shortly after midnight • Market Notice Sent

  6. Status of Extracts

  7. Backup Slides

  8. Recovered Extracts before Aug 13 File System Corruption Operating Day Extract Name                                                   RUN_STATUS               POSTED TIME Fri 8-Aug          Market_Information_Initial                                   POSTING COMPLETE   8/12/08 10:10:30 AM                         Forecast_Data_Initial                                           POSTING COMPLETE   8/12/08 1:40:30 PM                         Ancillary_Services_Initial                                    POSTING COMPLETE   8/12/08 10:10:30 AM Sat 9-Aug         ASDEPLOYMENTS                                           POSTING COMPLETE   8/12/08 1:10:30 PM                         Aggregated Bids Stacks                                      POSTING COMPLETE   8/12/08 2:30:30 PM                         act_res_output                                                    POSTING COMPLETE   8/12/08 1:10:30 PM                         Self_Sch_Energy_Services                                  POSTING COMPLETE   8/12/08 12:55:30 PM                         Sched_and_Actual_Load                                     POSTING COMPLETE   8/12/08 12:55:30 PM                         Market_Information_Daily                                     POSTING COMPLETE   8/12/08 12:55:30 PM                         Daily_Individual_Replacement_Bids                     POSTING COMPLETE   8/12/08 12:41:30 PM                         Daily_Individual_BESBids                                    POSTING COMPLETE   8/12/08 12:41:30 PM                         Daily_Individual_AncSvc_Bids                             POSTING COMPLETE   8/12/08 12:41:30 PM                         60day_ResourceplanDetails                                 POSTING COMPLETE   8/12/08 12:41:30 PM                         60day_ENERGY_SCHEDULES                           POSTING COMPLETE   8/12/08 12:41:30 PM                         60day_ANCILLARYSERVICESCHEDULES          POSTING COMPLETE   8/12/08 12:41:30 PM                         Forecast_Data_Daily                                           POSTING COMPLETE   8/12/08 12:16:30 PM                         Bids_and_Schedules_Daily                                  RUNNING                      anticipate posting in 30mins.                         Ancillary_Services_Daily                                     POSTING COMPLETE   8/12/08 1:27:30 PM BACK

More Related