110 likes | 247 Views
Gemini OSU - UKLC Update. Annie Griffith December 2007. Discussion Items. Focus of discussions will be around the system elements of the recent Gemini incident :- Summary of findings to date Review of learning points Lessons learned applicable to UKLTR Open discussion.
E N D
Gemini OSU - UKLC Update Annie Griffith December 2007
Discussion Items Focus of discussions will be around the system elements of the recent Gemini incident :- • Summary of findings to date • Review of learning points • Lessons learned applicable to UKLTR • Open discussion
Overview of events • 21st Oct – upgraded system implemented – • API errors identified • 22nd Oct – issue with Shipper views of other shippers’ data • shipper access revoked • 24th Oct – Code fix for data view implemented • internal National Grid access only • 26th Oct – external on-line service restored • 1st Nov – hardware changes implemented to external service • 2nd Nov – API service restored • Further intermittent outage problem occurring to APIs • 5th Nov – Last outage on API service recorded at 13:00 • Root cause analysis still underway
Summary - Causes • Two problems identified • Application Code Construct – associated with high volume instantaneous concurrent usage of same transaction type. Fix deployed 05:00 23/10/07 • API error – associated with saturation usage displaying itself as “memory leakage”, builds up over time and eventually results in loss of service. Indications are that this is an error with a 3rd party system software product. Investigations continuing
Fixes since Go-live • Since 4th November • 10 Application defects • All minor • All fixed • No outstanding application errors
Gemini OSU Testing • Extensive testing programme • 2 months integration and system testing • 6 weeks OAT Performance Testing • Volume testing of 130% of current user load • 8 weeks UAT • 4 weeks shipper trials (voluntary) • 3 participants • 7 weeks dress rehearsal • Focus was on actions needed to complete the technical hardware upgrade across multiple servers and platforms
Testing Lessons Learnt • UAT • each functional area tested discretely • Issues around concurrent usage unknown and therefore not specifically targeted for testing • “Field” testing of system under fully loaded conditions may have highlighted this problem, but this is not certain. • OAT • Although volume and stress testing completed successfully, reliability testing/soak testing over a prolonged period not undertaken
Other Observations • Communications during incident • Undersold the scale of the change • Engagement - right individuals/forums ? • Planning for failure…..as well as success
UKLTR – What’s different ? • Main workhorse of the system is the batch processing • Predictable transaction volumes • Far easier to replicate load and volume testing • Easy to verify outputs • Shipper interaction is batch driven • Low volume of on-line users • Doesn’t have same level of real-time/instantaneous transaction criticality • Ability to do more verification following cut-over before releasing data from upgraded system to the outside world.
UKLTR – Lessons to be applied • Plan for failure • Differing levels, problems vs. incident • Technical and resource planning • Fully prepared Incident Management procedure established in advance and understood by all parties • Escalation routes • Communications mechanisms • Status Communications to be issued during outage period • Milestone Updates ? • Who to ? • Fall-back options • Old system provides straight forward option • However, once interface data has been propagated to other systems will be in a “fix-forward” situation.