90 likes | 165 Views
GOCDB failover status and plans. COD-19, 01/04/2009 G.Mathieu, A.Cavalli, C.Peter, P.Sologna. Assessment and progress. Last week's outage at RAL a good (!) usecase for testing our procedures and listing improvements DNS aspect new DNS machine at CNAF. Last RAL outage. Timeline
E N D
GOCDB failoverstatus and plans COD-19, 01/04/2009 G.Mathieu, A.Cavalli, C.Peter, P.Sologna
Assessment and progress • Last week's outage at RAL • a good (!) usecase for testing our procedures and listing improvements • DNS aspect • new DNS machine at CNAF
Last RAL outage • Timeline • 5:20 UTC - power glitch at RAL. • 8:00 – Start failover process • 9:20 - DNS switch complete. • 10:00 - Failover working properly. • 13:25 - reverse DNS switch
Post mortem • good things • failover worked • DNS swap quick, efficient and transparent • Good synchronisation • CNAF IRC channel was useful • encountered problems • Problems with CNAF DB schema • DB Connection from ITWM to RAL • SSL issues • The overall process to swap completely took a rather long time (2h)
Proposed improvements (1) • Improve manual process • Reduce the number of needed people. we need to allow different people to carry on the whole chain alone. • Create scripts to reduce number of actions • Sort out CNAF schema issue • Improve current synchronisation mechanism • Contacts and documentation • Keep somewhere a list of phone contacts, or alternative mail addresses to use in case main mail system does not work • Document all processes
Proposed improvements (2) • Regular tests • Test CNAF replica DB • ITWM web interface • All possible scenarios • Configuration improvements • Simplify configuration file • have the service publish itself the fact that it is in read-only mode. • Automation • Work with OAT monitoring group • Automate DB switch • Automate portal switch the same way
Actions list (1) • Doc and processes • Gilles to draft process + test documentation • Christian to add goc@itwm tests to ITWM procedures • All: provide contacts (phone, alternate mail, etc.) • Access to machines • Christian to give failover team access to gocdb@itwm • Gilles to give failover team access to gocdb@ral- Gilles to write goc portal • Scripting • Gilles to write scripts to change GOC portal conf • Peter/Ale to write DNS configuration scripts
Actions list (2) • Improvements on CNAF-RAL DB sync • Gilles to provide a dump to CNAF whenever the schema changes • Peter/Ale/Gilles to study encryption solution to secure the dump • Gilles to check the dump solution is valid • Peter/Ale to implement new procedures • Ale to do speed tests in different scenarios
Actions list (3) • Test • Test again • Re-test • Test • Test • Test (if there is some time left)