80 likes | 273 Views
Summary of ORA-07445 COOL tests for ATLAS. Andrea Valassi (IT-ES) For the Persistency Framework team ATLAS Database Meeting, 23 rd August 2010 Thanks to the PDB team in IT and to the DB team in ATLAS! . Introduction. Signature of the problem
E N D
Summary of ORA-07445 COOL tests for ATLAS Andrea Valassi (IT-ES) For the Persistency Framework team ATLAS Database Meeting, 23rd August 2010 Thanks to the PDB team in IT and to the DB team in ATLAS!
Introduction • Signature of the problem • Server process crash, “ORA-07445: exception encountered: core dump [ksxpmprp()+267] [SIGSEGV]” in trace files • First observation of the problem • ATLAS and LHCb production databases on 2010 June 2-3 • Observed primarily (only?) on COOL applications • Service degraded (high load spikes, connections refused…) • https://twiki.cern.ch/twiki/bin/view/PDBService/DBServicePostMortem#Database_issues_after_patching_J • Problem observed after applying the Oracle June PSU • Rolled back the June PSU until problem better understood • Following efforts concentrated on trying to reproduce the problem on a test database to validate the possible patches
Analysis of COOL nightly tests • Early tests in June (suspect: connection sharing) • Attempted in June to reproduce problem with simple OCI-only test with connection sharing: failed to cause any issue • Explanation a posteriori: need high load to trigger it • Analysis of COOL nightly tests • June PSU was not rolled back on test1 (lcg_cool_nightly) • From alert logs of test12: “ORA-07445 [ksxpmprp()+267]” happened 15 times in two months (May 30 to August 2)! • Always associated to “update sys.aud$” on disconnecting • Always the same test “test_RelationalCool_RelationalFolder” • Client applications (COOL nightly tests) succeed: pattern is a crash of server-side process when clients disconnect • Did not look explicitly for spikes of high load or freezing…
COOL-based nightly tests • Based on COOL nightlies, developed a test script to maximize chances of reproducing ORA-07445 • See https://savannah.cern.ch/task/?16836 • The script ora07445.csh executes internally several cycles of the test_RelationalCool_RelationalFolder executable • http://cool.cvs.cern.ch/cgi-bin/cool.cgi/cool/contrib/ExternalTests/OracleConnectionSharing/ora07445.csh?rev=1.5&content-type=text/vnd.viewcvs-markup • Typically I have run several scripts in parallel • http://cool.cvs.cern.ch/cgi-bin/cool.cgi/cool/contrib/ExternalTests/OracleConnectionSharing/allOra07445.csh?rev=1.4&content-type=text/vnd.viewcvs-markup • Using 30 scripts with 30 cycles each (~4 hours) I observed ~90 occurrences of ORA-07445 in a reproducible way!
Database ‘freezing’ in COOL tests • Observed freezing of database server • Observed freezing of client CPU at same time…
Which applications are hit? • COOL applications with connection sharing • Clearly demonstrated using the COOL based test script • No errors observed if connection sharing is disabled • But no attempt was done to understand this better qualitatively or quantitatively… • Applications other than COOL can be hit too • The problem was also observed in other cases: • On int8r: ATLAS tags (CORAL-based POOL collections) • On other production databases? (eg LCGR?) • I do not know if these use OCI connection sharing or not
Validation of 6196748 patch • Several tests on int8r (thanks to Marcin!) • No June PSU: OK… • No ORA-07445 • June-PSU: NOT OK • ORA-07445 appeared • COOL test: ~90 errors on 900 cycles (30x30) • June PSU and July PSU: NOT OK • ORA-07445 still there • June PSU, July PSU and 6196748 patch: OK! • ORA-07445 disappeared • COOL test: 0 errors on 300 cycles (30x10), expected ~30
Reusing the test script? • COOL test script can now be used by PDB team • Further tests of the ORA-07445 issue • Generate some COOL load with/without connection sharing • Note however that this is not fully representative of production-like activities • Much more DDL (create/drop tables) than in production!