280 likes | 378 Views
RMS Workshop Retail Systems Disaster Recovery ERCOT May 6 th , 2014. Addressing Retail DR Needs Mandy Bauld. Addressing Retail DR Needs. Understand the Current State Define & Understand the Requirement Close the Gap (Next Steps). ERCOT’s Primary Objective for Today.
E N D
RMS Workshop Retail Systems Disaster Recovery ERCOT May 6th, 2014
Addressing Retail DR Needs Mandy Bauld
Addressing Retail DR Needs • Understand the Current State • Define & Understand the Requirement • Close the Gap (Next Steps)
ERCOT’s Primary Objective for Today • Understand the Current State • Define & Understand the Requirement • What is the RMS position on a Recovery Time Objective (RTO) for the retail systems in the event of retail system outages? • This information is required in order for ERCOT to effectively take any “next step” actions regarding DR for retail systems. • Close the Gap (Next Steps)
RMS Feedback Summarized (April 1, 2014 RMS Meeting) • Understand the Current State • Market needs education on what it takes for ERCOT to failover to level set expectations of the current state • Market needs to understand dependencies and decision making process around a failover • Market needs information about ERCOT’s failover testing and planning for returning to the primary site • Define & Understand the Requirement • Various opinions expressed regarding downtime (need a limit on the downtime; need full operation within 24 hours; 24 hours downtime is unacceptable) • Even after systems were “back up” there were still configuration issues in the DR environment and still had backlog to work through… more than acceptable… it was not fully restored • Cleanup work was more than expected • The market’s manual work-around capability is limited, is not sustainable, and is not without risk • Incomplete or delayed communication has a direct impact on market ability to make decisions and manage customer expectation through the event • Close the Gap (Next Steps) • Identify options that improve ERCOT’s and the market’s ability to meet the defined requirements
Agenda • March 11th Outage Timeline • Overview – Retail DR Capability & Process • System Outage Communications • System Outage Requirements • Work-Around Processes • Planned Failover Communications
March 11th Outage Timeline Dave Pagliai
System Outage on March 11, 2014 Impacted Services Retail Transaction Processing, MarkeTrak, eService (service requests and settlement disputes), Retail Data Access & Transparency through MIS and ercot.com, Settlement & Billing processes, ercot.com, Retail Flight Testing (CERT), MPIM, Texas Renewables website (REC), NDCRC through the MIS
System Outage on March 11, 2014 Infrastructure Failure - Tuesday 03/11/14 @ 9:27 AM Restoration of Market Facing Systems Tuesday 03/11/14 Texas Renewables (REC) – www.texasrenewables.com – 03/11/14 @ 4:01 PM Settlement & Billing processes – 03/11/14 @ 4:36 PM Ercot.com – 03/11/14 @ 5:20 PM Wednesday 03/12/14 Registration (Siebel) – 03/12/14 @ 1:15 PM MPIM – 03/12/14 @ 1:15 PM Retail Transaction Processing – 03/12/14 @ 2:10 PM MarkeTrak API - 03/12/14 @ 2:10 PM MarkeTrak GUI - 03/12/14 @ 2:10 PM MIS Retail Applications - 03/12/14 @ 2:10 PM NDCRC through the MIS – 03/12/14 @ 3:02 PM Retail Transaction Processing backlog complete – 03/12/14 @ 10:15 PM
System Outage on March 11, 2014 Restoration of Market Facing Systems (continued) Thursday 03/13/14 CERT – 03/13/14 @ 3:13 PMThis was actually unrelated to the infrastructure outage. Friday 03/14/14 eService (service requests and settlement disputes) – 03/14/14 @ 5:30 PM
Communications • 1st Market Notice: Tuesday, 3/11 at 12:27 PM • 1st Retail Market Conference Call: Tuesday, 3/11 at 1:00 PM • Regular Market Notices and Retail Market Conference Calls through Friday, 3/14 • Final notice on Monday, 3/17
Challenges • The outage impacted various internal ERCOT applications and tools required to communicate both internally and externally, and access procedural documentation • ERCOT’s registration system failed to initialize in the alternate data center, requiring application servers to be rebuilt, delaying the restoration of several other Retail systems/services
Overview – Retail DR Capability & Process Aaron Smallwood
Retail/Commercial Systems DR Background • Pre-2011– No DR environment or failover capability, strategy was to utilize iTest environment if extended outage occurred • December 2010 – Retail/Commercial Systems DR environment delivered and tested by the data center project • Evolving state of maturity • Strategy: failover in a major outage event when primary system cannot be restored within required timeframe • Recovery Time Objective (RTO): generally, 24 hour recovery of core systems and services • Recovery Point Objective (RPO): 0 data loss • Testing Strategy : recovery site operability tested annually
Retail/Commercial Systems – Current DR Capability • Capable of operating out of the primary or alternate data center • Capable of a 24 hour Recovery Time Objective (RTO) • Moving operations to an alternate data center within 24 hours • Historically have accomplished in less than 24 hours • Capable of a 0 data loss Recovery Point Objective (RPO) • Historical use of DR capability: • December 2012 – Unplanned failover • First “real” use of the DR environment • June 2013 – Planned failover • First transition back to the primary environment • March 2014 – Unplanned failover
ERCOT Retail DR - Context For security reasons ERCOT cannot provide specific details regarding environments or business continuity plans. • Outage events are like fingerprints… each is unique. • Issue are initially handled as an IT Incident • First Priority – find the issue • Identify the scope of impacted systems and functions • Identify options and limitations for restoration and recovery • Make recommendations and coordinate restoration • Engage Business SMEs and Market Communications team • Potential for an extended outage? • Mobilize the Disaster Management Team • Executive management and Director-level leadership across the company • The DMT monitors the situation throughout the event to understand the scope of problem, impacts, and make “the big decisions” • Examples: the 12/3/2012 and the 3/11/2014 outages
ERCOT Retail DR - Process • The systems are not currently capable of “real-time” failover • Once the failover decision is made, IT follows procedures to: • Verify readiness of the alternate environment • Control data replication streams during the transition • Configure integrated systems to point to the alternate environment • Business follows procedures to: • Prioritize recovery efforts • Work with IT to determine where processes left off and where they should start after recovery • Determine and mitigate potential for data loss • Determine if/what work-arounds may be necessary upon recovery • Verify ability to use systems in the alternate environment • Support Market Communications
System Outage Communications Ted Hailu
System Outage - Communications • Initial and Subsequent Communications (when/how) • Contingency Plans for email communication • Communication outside the Market Notice process
System Outage Requirements Dave Michelsen
System Outage Requirements • What is the market’s priority for recovery of retail services? • What is the quantitative and/or qualitative impact of the unavailability of each service? • What is the RTO for each? May need to Consider: • Operating timelines • Importance of time of day (AM vs PM) or day of week and time of day • Tolerance level relative to invoking safety net procedures • Cost to support increases as the RTO decreases
Work-Around Processes Dave Michelsen
Workaround Process • Are the current processes sufficient? • Move-In(s) • Switch Hold Removals • Move-Outs(s) • Others?
Planned Failover Communications Dave Michelsen
Planned Failover Communications • Communication for a planned transition between sites • i.e., transition from the alternate data center back to the primary data center • Follow normal processes regarding outages/maintenance communication • Transition would be performed in a Sunday maintenance window
Parking Lot Items • TBD