130 likes | 272 Views
Persistency Framework News for ATLAS. Andrea Valassi (IT-ES) For the Persistency Framework team ATLAS Database Meeting, 19 th October 2011. Outline and summary. Recent developments and releases POOL, CORAL, COOL (releases since July 2011 talk at ATLAS sw week ) Support for ATLAS Tier0
E N D
Persistency Framework Newsfor ATLAS Andrea Valassi (IT-ES) For the Persistency Framework team ATLAS Database Meeting, 19th October 2011
Outline and summary • Recent developments and releases • POOL, CORAL, COOL (releases since July 2011 talk at ATLAS sw week) • Support for ATLAS Tier0 • COOL validation on Oracle 11g • Other issues • Ongoing enhancements in CORAL and COOL • Long-term plans for POOL: any news from ATLAS?
LCG 61 for LHCb (August 2011) • Motivation: major upgrade in ROOT (to 5.30.00) • Also many other external upgrades (e.g. frontier_client 2.8.4) • Not sure if this was used by ATLAS (that requested 61a instead) • POOL 2.9.15 • Minor fixes for gcc 4.5 and C++0x • CORAL 2.3.16a • Rebuild of previous CORAL 2.3.16 (for ATLAS in LCG 60c) • COOL 2.8.10a • Rebuild of previous COOL 2.8.10 (for ATLAS in LCG 60c) • For full details see the release notes on TWiki
LCG 60d for ATLAS (Sept. 2011) • Motivation: changes in ROOT and frontier_client • ROOT 5.28.00g • frontier_client 2.8.4: long-awaited performance patch (bug #84067) • POOL 2.9.16 • Many fixes and enhancements for ATLAS in collection packages • CORAL 2.3.17 • Minor fix for Oracle 11g servers (remove warnings, bug #86406) • Port to boost 1.47 for CMS (bug #85896) • Minor fixes in monitoring and CORAL_SERVER packages • COOL 2.8.10b • Rebuild of previous COOL 2.8.10 (for ATLAS in LCG 60c) • For full details see the release notes on TWiki
LCG 59c for ATLAS (Sept. 2011) • Motivation: patch upgrade in ROOT (to 5.26.00g) • No other external change, no rebuild of Persistency packages • POOL 2.9.15 • Same as in previous LCG 59b for ATLAS – no rebuild • CORAL 2.3.14 • Same as in previous LCG 59b for ATLAS – no rebuild • COOL 2.8.8 • Same as in previous LCG 59b for ATLAS – no rebuild • For full details see the release notes on TWiki
LCG 61a for ATLAS (October 2011) • Motivation: patch upgrade in ROOT (to 5.30.02) • Also include performance optimizations in CORAL • POOL 2.9.17 • Port to gcc 4.6 and SLC6 (disable LFC dependencies) • CORAL 2.3.18 • Port to SLC6 (disable LFC dependencies) • Remove unnecessary data dictionary queries (task #10844) • Fix memory leak in the Blob class (bug #87279) • Minor fixes in CORAL_SERVER sockets and monitoring • Major cleanup of CORAL_SERVER tests (allow SPI to execute them to complete the release validation; start port to ATLAS 17.0.2) • COOL 2.8.11 • Complete internal cleanup of transaction handling (task #3271) • Fix tests for memory leaks reported by valgrind (task #12670) • For full details see the release notes(still in progress)
Support for ATLAS Tier0 issues • Tier0 observed higher Oracle load since August • Most of the followup is ongoing in ATLAS and IT-DB • CORAL team is helping to analyze specific issues • 5 minute interval between retrial attempts (bug #87759) • ATLAS parameter settings (300s) change the CORAL defaults • Some jobs have been reported to fail with ORA-25408 • Complex issue also seen in CMS since September (bug #87164) • Caching solutions can help offload Oracle • Frontier has been already tested, CORAL server also discussed • Discrepancy in physics results observed between Oracle and Frontier (bug #87266), probably a muon software issue (bug #87963). Specific to “cacheAlign”, not to Frontier (is Oracle more ‘wrong’ than Frontier?). • Promised to help set up CORAL server for tests: never had time, sorry • Problem is still pending • Must understand and fix the higher load from Oracle, and/or fix and deploy a caching solution as an alternative
Oracle server move to 11g • IT-DB plans to upgrade all servers to 11g in January 2012 • Need both functional and performance tests (using int8r cluster) • Thanks to Gancho, Roman and IT-DB for their help • COOL functional tests: ok so far • Minor fixes still needed in the tests for ORA-01466 (bug #87935) • CORAL tests pending (need a few accounts), but lower coverage • COOL performance tests: unexpected execution plans • “Do not expect big surprises” (AV, July) – apologies, I was wrong • COOL ad-hoc small-scale scalability tests (see next slide) • Query time increases linearly for IOV position (should be flat as in 10g) • Using the exact same SQL and hints as in 10g • Seems to improve if reverting to the 10g optimizer (work in progress) • To do: cannot understand the 10g CBO exec plans (why do they work ok?!) • To do: is 10g CBO acceptable or should we use better hints (better SQL)?? • To do: only tested one single-version use case, need to analyze them all • To do: release changes in CORAL/COOL if necessary (or ‘alter system’?) • Recommend larger scale ATLAS performance tests too
COOL performance tests on 10g • Is query time identical (flat) for retrieving first and last IOV? • Small-scale scalability test (a slope extrapolates to larger tables…) • Test 6 cases each time (good/bad/no stats; peek low/high IOVs) • In Oracle 10g • Without hints: some cases (e.g. good statistics and peek high) look good out of the (Oracle) box – the 10g optimizer likes COOL’s SQL • Hints are only for exec plan stability: all six cases look good (e.g. even with bad statistics) – the 10g optimizer likes COOL’s hints
COOL performance tests on 11g (1) • In Oracle 11g (and out-of-the-box 11g optimizer) • Without hints: never get a decent performance (not even with good statistics and peek high) for the same SQL used in 10g • With hints: not much difference (even worse?) • Execution plans were checked and are different from 10g • First attempts at adding hints to make them similar failed • The fact that performance is never good without hints signals a much bigger problem somewhere…
COOL performance tests on 11g (2) • In Oracle 11g (forcing 10g optimizer in CORAL session) • Without hints: better, good at least if good statistics and peek high • With hints: seems to fix all issues (all 6 curves are flat again) • Need to look again at execution plans • First attempt with 10g CBO (not used to make a plot) did not seem good enough to result in flat response...? Should be understood. • May try to develop new hints from tests above and avoid 10g CBO • Is this trick of using 10g CBO acceptable for IT and ATLAS DBAs?
Other issues and work in progress • POOL long term support • LHCb (like CMS previously) is essentially no longer using POOL • Replaced by direct ROOT; only Gaudi (not for long) still needs POOL • Any news from ATLAS effort to build POOL in-house • More work on CORAL connection management • Along the lines of (not yet complete) network glitch handling • Analyzing more phase space variables to timeouts and failovers (sqlnet and tnsnames parameters) – e.g. ORA-25408 in CMS • Bind variables in FrontierAccess may be useful for ATLAS • As mentioned in Roman’s talk • Not a priority at the moment (very limited manpower)
Conclusions • Three very busy months! • A few releases • But especially a large support load • And the Oracle 11g migration to prepare • The two main issues are still work in progress • The “Tier0 saga”, in many respects • Oracle-Frontier difference? Why did Oracle load increase (and what is ORA-25408)? How to use Frontier or CORAL server? • COOL validation on Oracle 11g, in many respects • Can we (must we) use the 10g CBO? In CORAL or in system? Are exec plans understood? Are all COOL use cases ok? • It will be hard to meet Luca’s early November deadline (work on this will only resume early November…) Team effort of many in ATLAS and IT, thanks to all!