90 likes | 223 Views
Improving Change Management Andreas Unterkircher. Post mortem of 2008 releases . Releases of VOMS and BDII (related) rpms caused problems. Update 22: VOMS (short FQANs, FQAN ordering ) Reliance on unsupported functionality Update 30: GFAL (update incompatible with gLite 3.0 BDII )
E N D
Improving Change Management Andreas Unterkircher
Post mortem of 2008 releases • Releases of VOMS and BDII (related) rpms caused problems. • Update 22: VOMS (short FQANs, FQAN ordering) • Reliance on unsupported functionality • Update 30: GFAL (update incompatible with gLite 3.0 BDII) • Reliance on obsoleted services • Update 33: various BDII related problems • 1) bdb backend running out of available locks which was fixed by adding a configuration parameter. #42727 • Insufficient testing in production context • 2) chown bomb for Quattor site, caused by not setting a configuration parameter #42799 • Alternative Fabric Management method • 3) Recursive protection commented out causing information system to fail. - no bug or ggus ticket ('the security bug' fixed in patch #2649 release update 37 • Lack of regression testing • 4) Change in schema file causing WMS to not match when using the CESEBind. - bug #45278 / ggus #44201 • Complex service interaction – bug was in WMS
Post mortem of 2008 releases • Proper fast track fix would have been a rollback of the release. • We did not roll back but tried to produce updates quickly. • This stems from the fact that our release model currently does not support rollbacks properly. • Some of the issues with updates could have been only spotted by running production workflows • E.g. user/framework relying on the order of FQANs • For a quick fix we were able to produce and distribute updated rpms quickly (< 24h), but • Updates were not well documented on the release pages. • Broadcasts were sent by different people potentially confusing the sites. • Information on the release pages has to be improved (GGUS ticket complaining that last release notes are not easy to understand)
Post mortem of 2008 releases • “There is a tendency to bundle too many changes in a release” (CMS) • More, smaller releases? • Much less efficient for the release process. • Would the sites follow? • Less change in general? • Which changes do we drop?
Actions to take • Implement well defined rollback procedure (SA3) • Implement procedure for fast track releases (SA3 & SA1) • Implement a managed rollout of updates in production (SA1) • Improve and maintain quality of the release pages (SA3) • Maximize representation of experiment use cases in certification (SA3)
Rollback • Current status: • One shared repository for all node types • Planned: • “current” repository for every node type (no longer shared) + repositories of the previous update • In case of rollback “current” can point to previous update. This can be done per node. Prevents sites of picking up the bad update. • Rollbacks are per node type, not per individual rpms. • Sites which have already installed the bad update need to downgrade manually. • We can provide a recipe • The release team needs to sort out how to achieve this from a technical point of view (has implications for our scripts, AFS space, we need to ensure consistency of the repositories etc.) • Timeline: end of February 09
Procedure for fast track releases • Will be defined together with SA1 (PPS team) • Roles, broadcasts, release page. • Need a consensus on which issues are treated this way. • If a problem occurs rollback should be the default action. • Timeline: should be available when rollback is implemented.
Managed rollout • Done by SA1. • Identify sites wanting to be the first to install an update. • Node types will be exposed to production • Should capture production workflows. • Could use PPS manpower/resources to do this. • This is already partly being done now • After a new release only few sites upgrade immediately • PPS installed pilot services for certain services (WMS, CREAM) • Timeline: to be determined by SA1
Near future • No major releases until the rollback has been implemented. • Only dedicated, low risk or security updates • Among next update candidates: CREAM, VDT, yaim • Next release only for CREAM • We try to get more experiment use cases and add them to our test base (contact person: Gianni Pucciani).