360 likes | 572 Views
Ten Database Mysteries Chris Lawson Database Specialists, Inc. www.dbspecialists.com clawson@dbspecialists.com. Focus of Presentation. Explore some “strange” database problems that have baffled some DBAs
E N D
Ten Database MysteriesChris LawsonDatabase Specialists, Inc.www.dbspecialists.comclawson@dbspecialists.com
Focus of Presentation • Explore some “strange” database problems that have baffled some DBAs • Most of the mysteries occurred on critical production systems, although some were on development systems • ALL of the mysteries were eventually explained • Depending on your personal experience, some of these “mysteries” will seem trivial or commonplace; others will indeed seem mysterious • Most mysteries have a simple explanation • Most mysteries have a simple fix
Why Spend Time on These Database Mysteries? • Each DBA has a unique set of experiences and biases. What one DBA thinks is obvious, another will not. • An Oracle “detective” is part scientist, part artist. Many solutions require creativity, not just logic. • A superior DBA will look for ways to “stretch” and learn ways to handle difficult problems. • Without working out difficult problems, you will not advance as a DBA. • You will be the “hero” if you encounter a mystery and solve it; remember the solution--you may see it again!
A Word About Oracle Versions • This presentation was originally written in 1998 • Most of these mysteries involve Oracle 7 databases • Although some of the mysteries might not apply directly to Oracle 8i, they still offer insight into the problem-solving process
1 The Case of the Berserk Application Clue #1: “Big Phone Company” 1997 • Using HPUX, Oracle 7.3.2.3 • Help desk application (Vantive) that connects to Oracle database suddenly goes berserk, creating thousands of connections • Program had worked normally for many months • DBAs watch helplessly as CPU load driven from 1 to 50 • As DBAs kill extra processes, more take their place • Alert log and recent trace files show nothing unusual • DBAs are united in accusing the application as the culprit
Berserk Application(continued) Clue #2: “Big Publisher Ltd.” 1998 • Running Sun Solaris, Oracle 7.3.2.3 • Users complain that performance has degraded in recent months • Manager states that “something must be wrong with the network” • Application is CORIS, a document management/printing application • DBA investigates. Discovers that time to connect in SQL*Plus is 30-45 seconds, even though server load is low • Connect time is bad whether remote (PC) or directly on server • Server load (file I/O and CPU) is generally low
Berserk Application: Solution • OTRACE is the culprit. It is active by default on many 7.3 Oracle versions • Excerpt from Oracle Corporation Alert: “Problems described here can occur when Oracle Trace is not configured and is widely enabled.”
Berserk Application: Solution(continued) To Detect: • Check directory ORACLE_HOME/rdbms/otrace: As size of files process.dat and regid.dat approach 10mb, problems arise -rw-r--r-- 1 oracle dba 3161424 Jun 05 09:43 process.dat -rw-r--r-- 1 oracle dba 263808 Jun 05 09:43 regid.dat • To correct: simply remove these two files, then issue command otrccref
Berserk Application: Solution(continued) To Prevent: • Add line to listener.ora for each database (after ORACLE_HOME): ( ENVS=’EPC_DISABLED=TRUE’) • Set and export environment variable EPC_DISABLED=TRUE for all users. Put standard profile in /etc directory • Restart all databases and restart listener
2 The Case of the Reluctant Patch Background: • To correct several bugs, decision is made to upgrade from 7.3.2.2 to 7.3.2.3 (HPUX) • Patch is obtained from Oracle and applied to test server. DBA notes that patch ran very quickly and runs again “just to be sure” • Bug is now gone on test server
The Case of the Reluctant Patch(continued) Problem: • Patch is similarly applied to production server--same operating system and version. • Production application is tested, but bug is still there! • Another DBA reviews patch file, location, etc. All seem correct.
The Case of the Reluctant Patch: Solution • DBA happens to notice that upon SQL*Plus startup, database is 7.3.2.2! • The patch was really only applied on the second run. This is apparently a quirk in the patch readme file. • The command what oracle (then grep for patch#) can be used to determine which patches are applied
3 The Case of the Sleazy SQL • “Big Publisher Ltd.” runs an MRP system called “AVALON,” similar to Oracle Manufacturing. Database stores inventory, part information, vendors, etc. • Server is ATT3555, running NCR UNIX. Database is Oracle 7.1.6 • Issue: Users report that certain common operations are very slow
The Case of the Sleazy SQL(continued) • DBA investigates and queries v$sqlarea using: SELECT sql_text FROM v$sqlarea WHERE disk_reads/executions > 1000; • Query yields troublesome SQL statement, with these stats: DISK_READS PER EXECUTION =5,000 BUFFER_GETS PER EXECUTION =5,100
The Case of the Sleazy SQL(continued) • Statement has been accidentally designed to ensure worst possible performance by making index usage impossible: SELECT * FROM ABC WHERE NVL (COL_W) = NVL (:1) AND NVL (COL_X) = NVL (:2) AND NVL (COL_Y) = NVL (:3) AND NVL (COL_Z) = NVL (:4); • TABLE ABC SIZE = 3mb, about 25,000 rows • DBA requests developers to alter statement to eliminate NVL (COL_N) functions • DBA advised that no resources available to make change • Problem: If code can’t be changed, what can be done to improve performance?
The Case of the Sleazy SQL:Solution • CACHE the table! For example: alter table xyz cache; • Normally, blocks from full-table scans are designated for rapid age-out; otherwise, they would “wipe-out” the db cache. Cache of table causes blocks to be treated “normally.” • Caching table disables rapid age-out of this table • Logical reads will not be reduced, but disk reads approach zero! • Note: DB_BLOCK_BUFFERS was slightly increased to compensate for the cached table that now consumes a few megabytes of database cache
4 The Case of the Non-Optimal Optimizer • A large software company based in “Cedar Shores” has designed a large financials application. Program has been tuned for Rule Based Optimizer. • The application runs very well, is a mature product, which is used in thousands of companies around the world. • Some users clamor for new features: more horns and whistles
The Case of the Non-Optimal Optimizer(continued) • The new development team, afraid to become obsolete, wants to convert to Cost-Based Optimizer (CBO). They also wisely consider that Oracle recommends using CBO on new projects. • The older developers, now nearing peaceful retirement, predict disaster if the database is switched to CBO, because the execution plans will change. • Issue: How can Optimizer be selectively switched to CBO without changing the code?
The Case of the Non-Optimal Optimizer: Solution • Simply substitute a view having a “hint” for the table needing CBO For example: rename DEPT to DEPT_ORIG create view DEPT as select / * + ALL_ROWS */ * from DEPT_ORIG; • Now, application will use the VIEW when it looks for DEPT • All queries using DEPT will use CBO • Note: Upon renaming a table, the indexes and constraints will “move” with the table; however, synonyms and grants may need to be reset.
5 The Case of the Forgetful Memory • A new internet-transaction application, ECXpert, and its database have been installed on a Sun Ultra Enterprise Server • Sun Solaris 2.5.1, Oracle 7.2.3 • Application appears to run smoothly for several months, although it occasionally creates large dump files • Trace files appear occasionally with ORA-4030 “Out of Process Memory” and recommends “increase process memory quota” • Server seems to hang occasionally. Server reboot fixes • SysAdmin checks kernel parameters related to memory. All correct and match other servers. Not using any large stored procedures • Problem: What is causing memory/hang problems?
The Case of the Forgetful Memory: Solution • DBA checks /tmp (swap area on server) and notes 99% consumed • Investigation reveals that application occasionally goes berserk and consumes ENTIRE SWAP area with log files • Deletion of log files does not return disk space, since application is still “holding” the files • Reboot of server cleaned up /tmp area, thereby correcting problem • Suggestion: If memory-related error messages exist, check swap area first
6 The Reluctant Index Affair Background: • DBA asked to analyze and tune Australia manufacturing database. Database is running CBO. One particularly bothersome SQL statement is identified • The WHERE condition is perfect for a new index, because of its excellent selectivity • Index is quickly created. Table is also analyzed Problem: • Even though index is a “perfect” solution to the query, a full table scan is used instead
The Reluctant Index Affair:Solution • The values in the table are very lopsided. Optimizer, however, will assume uniform distribution, which is incorrect in many cases • Re-analyze and specify histogram: ANALYZE TABLE XYZ FOR ALL INDEXED COLUMNS SIZE 75 • This creates histogram of 75 “buckets” for each indexed column • With these statistics, optimizer will “know” how values are distributed, and will more often make right decision to use an index or not
7 Mystery of the Hanging Database • At random intervals, a 7.3.2.3 database hangs. No trace files, and nothing unusual in the alert log. • When problem occurs, no response to new connections requests; over 1200 existing connections “hang.” • Oracle Support is alerted to priority 1 problem; experts across the world investigate for weeks • DBA is using OEM Lock Manager tool and notices user who is blocking about 25 other users. The hang occurs soon after. • Oracle Australia recommends checking indexes. This suggestion led to the solution. Problem: • How did index problems hang database?
Mystery of the Hanging Database: Solution • Application design flaw • There are hundreds of foreign keys in the database; 99% had indexes. A few did not, violating good design practice. When batch program began updates, locking increased rapidly. • Without FK index, updates on parent table completely block updates on child (vice versa for 7.1.6) • Reference: Server Application Developers Guide • Although not admitted as database “bug,” database was overwhelmed by the locks • Once indexes on all FK’s created, problems disappeared
8 The Case of the Mysterious Package • Manufacturing application was installed on a Sun Ultra 3000 server. A small database was created for testing purposes. Oracle version 7.2.3. Shared pool size about 60mb. • At first, all went well. Then, seemingly randomly, when the users began to try new features, they would receive a “funny” error message and the application failed. • A trace file recommended increasing shared pool Problem: • How can application fail with such a sizable shared pool? • Aside from massive increase in shared pool, what can be done?
The Case of the Mysterious Package: Solution • The application uses about 20 massive PL/SQL packages. Some are 5x the SYS.STANDARD package. When a package load is attempted, it will not fit in the shared pool. • Memory-intensive packages should be “pinned” or “kept” in shared pool after database startup EXECUTE SYS.dbms_shared_pool.keep ('OBJECT_NAME'); • But first, must find the “big” packages (will also list SYS.STANDARD): SELECT owner, name, sharable_mem FROM v$db_object_cache WHERE sharable_mem > 100000;
The Case of the Mysterious Package: Solution(continued) • Example script to find “big” packages and generate SQL script to “pin” them in memory SELECT 'EXECUTE SYS.dbms_shared_pool.keep('''|| owner||'.'||name||''');' FROM v$db_object_cache WHERE sharable_mem > 100000 AND type NOT IN ('VIEW', 'SYNONYM', 'TABLE') AND name NOT LIKE '%SHARED_POOL%' AND owner IS NOT NULL;
9 The Case of the Uncooperative Rollback • In mid-afternoon, DBA (running “OEM Top Sessions”) notices many users “ACTIVE” but showing 0 file I/O. Lock Manager reveals one user performing big update blocking all. • Culprit tracked down--agrees to be terminated. DBA disconnects session. • Locks are not released, but user is “marked for kill.” • Very little file I/O activity. Alert log shows very slow switching of redo logs. • DBA performs shutdown abort then startup. Database starts up after 2 minutes. All is well. Problem: • Why did user not rollback and release locks?
The Case of the Uncooperative Rollback: Solution • If session is terminated, speed of rollback is proportional to init.ora parameter CLEANUP_ROLLBACK_ENTRIES • If default value (20) is used, rollback of killed session can take 50x time of original update. Alternatively, shutdown abort/startup cleans up database much faster. • Rationale: Parameter prevents rollback of one user from hogging all the resources on a busy system • Solution: Increase parameter to reduce rollback time (since shutdown abort is usually not an option)
10 The Singular Case of the Phantom Users • A manufacturing database in Sydney, Australia, needed performance tuning. SQL tuning on US databases had yielded good results. • The table v$sqlarea was queried to find resource-intensive SQL statements. Several commonly-run statements were isolated. Performance was improved through index additions. • Statistics were re-examined over the next 4 hours, in order to confirm improvements. • However, repeated looks at execution statistics showed no change. • DBA puzzles over enigma for several hours, then realizes that NOTHING is WRONG! What did he finally realize?
The Singular Case of the Phantom Users: Solution • Nothing is wrong because the users were still asleep. It was only 5:00am in Sydney!
11 The Case of the Slow Physician(Bonus Mystery) • Health application is experiencing slow run times. Analysis shows following SQL statement causing 3000 disk reads • COSIGN_VIEW is a join of 2 tables (DOCS + COSIGN), joined on patient_id (indexed) • Search criteria ‘DR. MCKENZIE’ is very selective; thus, nested loop IS expected choice for optimizer, with DOCS as Driving table. SELECT * from COSIGN_VIEW WHERE doctor_id = 'DR. MCKENZIE'
The Case of the Slow Physician(continued) • Even with index on DOCS(doctor_id), optimizer (CBO) insists on using hash-join, and refuses to ever use index on doctor_id! • Repeated analyze table commands do not correct • Substituting query not using a view yields expected NL result Problem: • Why does using the view cause optimizer to make the “wrong” choice?
The Case of the Slow Physician:Solution • Everything seemed to point to a problem with the view, because all worked normally as long as the view was excluded • Finally, DBA compared view definition (in OEM Schema Manager) to definition seen using “describe table” syntax. The columns did NOT match! • Examining the object-create script revealed that the view switched column names, so that column DOC_ID in the view did NOT match DOC_ID in the table! • Once the correct column was indexed, a Nested-Loop Join was selected by the optimizer
Contact Information Chris Lawson clawson@dbspecialists.com http://www.dbspecialists.com Database Specialists, Inc. 388 Market Street, Suite 400 San Francisco, CA 94111