1 / 8

Summary of ORA-07445 COOL tests for ATLAS

Summary of ORA-07445 COOL tests for ATLAS. Andrea Valassi (IT-ES) For the Persistency Framework team ATLAS Database Meeting, 23 rd August 2010 Thanks to the PDB team in IT and to the DB team in ATLAS! . Introduction. Signature of the problem

hewitt
Download Presentation

Summary of ORA-07445 COOL tests for ATLAS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Summary of ORA-07445 COOL tests for ATLAS Andrea Valassi (IT-ES) For the Persistency Framework team ATLAS Database Meeting, 23rd August 2010 Thanks to the PDB team in IT and to the DB team in ATLAS!

  2. Introduction • Signature of the problem • Server process crash, “ORA-07445: exception encountered: core dump [ksxpmprp()+267] [SIGSEGV]” in trace files • First observation of the problem • ATLAS and LHCb production databases on 2010 June 2-3 • Observed primarily (only?) on COOL applications • Service degraded (high load spikes, connections refused…) • https://twiki.cern.ch/twiki/bin/view/PDBService/DBServicePostMortem#Database_issues_after_patching_J • Problem observed after applying the Oracle June PSU • Rolled back the June PSU until problem better understood • Following efforts concentrated on trying to reproduce the problem on a test database to validate the possible patches

  3. Analysis of COOL nightly tests • Early tests in June (suspect: connection sharing) • Attempted in June to reproduce problem with simple OCI-only test with connection sharing: failed to cause any issue • Explanation a posteriori: need high load to trigger it • Analysis of COOL nightly tests • June PSU was not rolled back on test1 (lcg_cool_nightly) • From alert logs of test12: “ORA-07445 [ksxpmprp()+267]” happened 15 times in two months (May 30 to August 2)! • Always associated to “update sys.aud$” on disconnecting • Always the same test “test_RelationalCool_RelationalFolder” • Client applications (COOL nightly tests) succeed: pattern is a crash of server-side process when clients disconnect • Did not look explicitly for spikes of high load or freezing…

  4. COOL-based nightly tests • Based on COOL nightlies, developed a test script to maximize chances of reproducing ORA-07445 • See https://savannah.cern.ch/task/?16836 • The script ora07445.csh executes internally several cycles of the test_RelationalCool_RelationalFolder executable • http://cool.cvs.cern.ch/cgi-bin/cool.cgi/cool/contrib/ExternalTests/OracleConnectionSharing/ora07445.csh?rev=1.5&content-type=text/vnd.viewcvs-markup • Typically I have run several scripts in parallel • http://cool.cvs.cern.ch/cgi-bin/cool.cgi/cool/contrib/ExternalTests/OracleConnectionSharing/allOra07445.csh?rev=1.4&content-type=text/vnd.viewcvs-markup • Using 30 scripts with 30 cycles each (~4 hours) I observed ~90 occurrences of ORA-07445 in a reproducible way!

  5. Database ‘freezing’ in COOL tests • Observed freezing of database server • Observed freezing of client CPU at same time…

  6. Which applications are hit? • COOL applications with connection sharing • Clearly demonstrated using the COOL based test script • No errors observed if connection sharing is disabled • But no attempt was done to understand this better qualitatively or quantitatively… • Applications other than COOL can be hit too • The problem was also observed in other cases: • On int8r: ATLAS tags (CORAL-based POOL collections) • On other production databases? (eg LCGR?) • I do not know if these use OCI connection sharing or not

  7. Validation of 6196748 patch • Several tests on int8r (thanks to Marcin!) • No June PSU: OK… • No ORA-07445 • June-PSU: NOT OK • ORA-07445 appeared • COOL test: ~90 errors on 900 cycles (30x30) • June PSU and July PSU: NOT OK • ORA-07445 still there • June PSU, July PSU and 6196748 patch: OK! • ORA-07445 disappeared • COOL test: 0 errors on 300 cycles (30x10), expected ~30

  8. Reusing the test script? • COOL test script can now be used by PDB team • Further tests of the ORA-07445 issue • Generate some COOL load with/without connection sharing • Note however that this is not fully representative of production-like activities • Much more DDL (create/drop tables) than in production!

More Related