1 / 26

NERC Lessons Learned and EAS Report

NERC Lessons Learned and EAS Report. OPEAS Meeting 1/14-1/15 2014. NERC Lessons Learned and EAS Report. Five Lessons Learned published since last OPS meeting All are EMS related. Inappropriate System Privileges Caused Loss of SCADA Monitoring.

chakra
Download Presentation

NERC Lessons Learned and EAS Report

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NERC Lessons Learned and EAS Report OPEAS Meeting 1/14-1/15 2014

  2. NERC Lessons Learned and EAS Report • Five Lessons Learned published since last OPS meeting • All are EMS related

  3. Inappropriate System Privileges Caused Loss of SCADA Monitoring • An entity experienced a loss of SCADA telemetry—specifically a loss of the channel status indicators—for 76 percent of its transmission system • This problem occurred during the implementation of a scheduled SCADA database update that caused one of the front-end processors to be in an abnormal state. • An incorrect command was used to remedy the situation, which resulted in the channel status indicators being set to a failed state. • Full SCADA functionality was restored 42 minutes later

  4. Inappropriate System Privileges Caused Loss of SCADA Monitoring • Lesson Learned • Entities need to ensure that they fully understand, and verify with their EMS vendors, the correct procedures and commands required in all situations. • Specifically, entities need to better understand the behavior of the system to various commands.

  5. Indistinguishable Screens during a Database Update Led to Loss of SCADA Monitoring and Control • The operations analyst had two remote desktop sessions open, one for on-line SCADA server and other for backup SCADA server. • The analyst had a third desktop window that was integral to the maintenance workstation. • The operations analyst proceeded to change the maintenance workstation’s database mode from remote (online) to local (offline) to complete final database modifications before the failover.

  6. Indistinguishable Screens during a Database Update Led to Loss of SCADA Monitoring and Control • Due to the identical screens, the mode change and database modifications were inadvertently performed on the on-line SCADA server database instead of the maintenance workstation’s database. • The front-end processor (FEP) processes stopped polling remote terminal units (RTUs) and failover functionality was disabled • Loss of SCADA monitoring and control of BES facilities for 31 minutes and 44 seconds

  7. Indistinguishable Screens during a Database Update Led to Loss of SCADA Monitoring and Control • Lesson Learned • EMS operations analyst mistakenly switched the on-line SCADA server database to local (offline) mode from remote (online) mode, which caused the FEP processes to stop and disabled the failover functionality • The remote desktop screens to each SCADA server, which the operations analyst set up on the maintenance workstation, were identical in appearance. This made it difficult to differentiate between screens and contributed to the analyst mistakenly selecting the inappropriate screen.

  8. Indistinguishable Screens during a Database Update Led to Loss of SCADA Monitoring and Control • Another cause was the lack of a formal written process or checklist for implementing database changes. • The operations analyst was unable to follow proper protocol and failed to verify that he was working on the appropriate device prior to initiating the database modification. • Discussion with the EMS vendor indicated that switching between the local and remote database modes was not an intended feature for the SCADA servers. • The vendor indicated that a future release of EMS software would eliminate the ability to switch database modes on a server.

  9. Loss of EMS – IT Communications Disabled • Scheduled control center server maintenance was being performed, which required the local authentication server to be taken out of service. • By design, control center EMS application authentication should have rerouted automatically to a remote authentication server when the local server was taken out of service. • Contrary to expectations and design, the automatic rerouting of authentication traffic did not occur and the EMS application was impacted.

  10. Loss of EMS – IT Communications Disabled • As a result, maintenance on the local authentication server was curtailed and was brought back on-line. Once local authentication was re‐established, full EMS functionality was available. • The root cause analysis determined that a specific firewall policy allowing authentication failover from the local authentication server to the remote authentication server was inadvertently deleted.

  11. Loss of EMS – IT Communications Disabled • Lessons Learned • EMS network design should, where possible, include a redundant local authentication server on the same internal network as the primary local authentication server. Having the primary and redundant local authentication servers on the same internal network (i.e., behind the same firewall) eliminates the dependency on a firewall rule for internal communications to both the primary and redundant local authentication servers. • Test plans need to be comprehensive and include regression-level testing.

  12. Loss of EMS – IT Communications Disabled • Lessons Learned • The IT Change Management process should consider applying the following principles: • Apply a thorough test process that is reviewed with the client for all changes that could affect EMS function. • Test the design redundancy or back-out plan prior to implementing a change. • Test plans need to be comprehensive and include regression-level testing.

  13. SCADA Failure Resulting in Reduced Monitoring Functionality • An entity’s primary control center SCADA Management Platform (SMP) servers became unresponsive • A partial loss of monitoring and control functions for more than 30 minutes. • This was a result of a conflict between security software configuration changes and core operating system functions • A cybersecurity event was quickly ruled out

  14. SCADA Failure Resulting in Reduced Monitoring Functionality • The primary control center SMP servers ceased network functionality and were unresponsive to login attempts from the local console. • Physical reboots of the servers were only able to resolve the problem momentarily. Recovery plans were immediately activated. • Root cause stemmed from a planned change to the security policy configuration of the host-based intrusion detection (HIDS) and intrusion prevention (HIPS) software. • The HIDS/HIPS security software on the SMP server hosts began to block certain core operating system processes executed in a specific order.

  15. SCADA Failure Resulting in Reduced Monitoring Functionality • The block did not occur until several days after the change was implemented • The SMP servers performed the specific functions that triggered the conflict and caused the HIDS/HIPS security software to lock down the core operating system. Primary SMP servers were stabilized and used to operate only noncritical SCADA circuits until root cause was established and full remediation was completed. • The design of the overall Energy Management System, allowed the entity’s primary control center EMP servers to operate in a mixed mode, combining capabilities at both primary and backup control centers.

  16. SCADA Failure Resulting in Reduced Monitoring Functionality • Lesson Learned • Security software configurations need careful analysis, design, testing, and implementation, as they may impact reliability in unpredictable ways. • Registered entities should consider a “multisite hosting” configuration. This configuration provides flexibility and convenience for rapid recovery capability of EMS and SCADA functions. • Frequent exercise of and training on recovery plans ensures that actual event responses go according to plan and promptly mitigate operational impacts.

  17. Failure of Energy Management System While Performing Database Update • While performing edits to the EMS database, the entity received alarms that indicated errors for the communications servers. • Tried to restore the database to its original state. • The standby communications server in the Primary Control Center (PCC) was manually restarted. This caused the reversal of the database edits to fail and create faulty data files that synchronized across the integrated system servers. • Alarms were received for all communication servers, only the standby communications server in the PCC failed; the EMS remained fully operational.

  18. Failure of Energy Management System While Performing Database Update • The faulty data files were manually removed from all servers, and a SCADA server failover was completed. • The EMS group executed a system warm restart, but since the EMS is an integrated system, the system warm restart resulted in the faulty data in the database being loaded into the remaining two communications servers, whereby all three communications servers failed. • The EMS lost functionality and was operational on a sporadic basis. At no point was the EMS off-line for a period exceeding 30 minutes.

  19. Failure of Energy Management System While Performing Database Update • Lessons Learned • When the EMS was purchased, the vulnerability of an integrated system architecture was unknown. • To eliminate this now-exposed vulnerability, it is recommended that functional separation of the PCC from the Auxiliary Control Center (ACC) be implemented.

  20. EMS Task Force / Working Group • 90 Category 2b events (Oct 26, 2010 –Dec 28, 2013) reported • 73 events –thoroughly analyzed and reviewed • 59 entities reporting -22 entities experiencing multiple outages • Restoration time for partial outages: 18 to 411min • Restoration time for complete outages: 12to 253 min • Several noticeable themes

  21. EMS Task Force / Working Group • Top Root Causes • Information to determine cause LTA (AZ) • Software Failure (A2B6C07) • Testing of Design/Installation LTA (A1B4C02) • Inadequate risk assessment of change (A4B5C04) • Insufficient Job scoping (A4B3C08) • Design output scope LTA (A1B2C01) • Post modification testing LTA (A2B3C03)

  22. EMS Task Force / Working Group • Top Contributing Causes • Software Failure (A2B6C07) • Design output scope LTA (A1B2C01) • Inadequate vendor support of change (A4B5C03) • Defective or failed part (A2B6C01) • Testing of Design/Installation LTA (A1B4C02) • Undesirable operation of coordinated system (A2B7C04) • Post Modification Testing LTA (A2B3C03) • Inadequate risk assessment of change (A4B5C04) • System Interactions not considered (A4B5C05)

  23. EMS Task Force / Working Group • EMS availability has achieved “visibility” • EMSWG has been tasked to: • Sponsor Monitoring and Situational Awareness workshop(s) or technical conference(s) annually • First was October 2014 after NERC OC/PC meetings • Publish Lessons Learned • Share good industry practices • Provide updates to Event Analysis Subcommittee and Operating Committee

  24. NERC Website – Event Analysis / Lessons Learned

  25. NERC Website – Event Analysis / Lessons Learned

  26. NERC EA Process and Compliance • ERO Compliance Monitoring and Enforcement Program • 2014 ERO CMEP Implementation Plan • Version 1.0 • Revised: November 1, 2013 • Page 24 “Compliance Assessments for Events and Disturbances” • The “voluntary” EA Process is referenced in this section • NERC and WECC compliance functions are notified of any EOP-004, OE-417, or EA reports • The CMEP ties event reporting and compliance functions together

More Related