450 likes | 621 Views
Root Cause and Other DBA Urban Legends. Brian Hitchcock OCP 10g DBA Sun Microsystems brian.hitchcock@sun.com brhora@aol.com www.brianhitchcock.net. SunFed DBA. Brian Hitchcock October 19, 2006. Page 1. For DBA Issues. I'm told that I must find root cause,
E N D
Root Cause and Other DBA Urban Legends Brian HitchcockOCP 10g DBA Sun Microsystems brian.hitchcock@sun.com brhora@aol.com www.brianhitchcock.net SunFed DBA Brian Hitchcock October 19, 2006 Page 1
For DBA Issues • I'm told that I must find root cause, • Can't resolve the issue without root cause • Must have testing environment that is • Similar enough to production to recreate the issue • And the fix • Need extensive expertise to solve performance issues • 10046 trace • Wait events • Extents size/number • Physical spindles • Rebuilding indexes • Undocumented init parameters
What is My Experience • Centralized DBA support team • 2000+ databases • Database users open cases with DBA team • Cases randomly assigned to DBAs • Not assigned based on experience or expertise • My cases are probably typical of all the cases
My Experience • Finding root cause costs money • Building and maintaining test system(s) costs money • Test system • Needs to be able to recreate production issue • Database, network, apps and web servers, load balancers • Users around the world • Root cause and test system(s) are only worthwhile • If having them is less expensive than not having them. • Perhaps these are Urban Legends? • Root cause • Test system(s) • Specific expertise required
My Experiences • Review major cases I worked on over the last 2 years • Cases are presented in chronological order, oldest first • For each case • What worked, i.e. what was the real-world solution? • Was root cause identified? • Was a test system available to verify root cause • and the solution? • What extensive expertise was required?
Keeping Score • For each case, look for 3 things • Root cause • Test system(s) • Specific expertise • Listed earlier (10046, wait events, extents, etc.) • Were any of the following useful in resolving the issue? • Open Mind (anything can and will happen) • Communication (it's difficult) • Simple solutions • Any active db users? • Reboot one or more components?
Case 1 – CRM Local Language • Moved to UTF8, add local language columns • SQL ran slower • 10046 trace, explain plan • Found the single step slowing execution • Existing index wasn’t updated to have new local language column • Recreated index • Performance returned to normal
Case 1 – Three Things • Root cause • Existing indexes not optimal for new schema • Corrected indexes fixed the problem • Test system identified the issue • Specific Expertise • 10046 • Explain plan • Indexes
Case 2 – Storefront Slow • App support team reports database slow • Check database, 1-3 active users at most • Active users gone in 1-2 seconds • App users report 30 second response time • Have App support person use app • Watch database • See single active user connect, complete, disconnect • 2 seconds maximum • Remainder of response time • Web servers, load balancers etc. • Application server was locked up – reboot fixed it
Case 2 – Three Things • Root cause • Not identified • Test Environment • Exists, but not even close to production • Complex production application environment • Accurate test environment • Expensive to build and maintain • Hard to find qualified testers • No one completely understands how production was built • Impact of layoffs, outsourcing • Hard to recreate what you don’t understand • No specific expertise required
Case 3 – Contracts Slow • App users report message • ‘Problem contacting the database: No available resource…’ • Sar shows CPU 90% idle • Statspack snapshot for last hour shows top wait events • 70% CPU time • 20% db file read • App server rebooted, performance returns to normal
Case 3 – Three Things • Root cause unknown • How to determine what caused the app server to hang? • Test system does not have all the components of prod • Don’t know why production locked up • Tough to reproduce in test system • Simple solutions – no specific expertise required • Reboot app server • Low-risk, cheap, quick • No special experience required • If it doesn’t work, time to put more resources into the issue
Case 4 – Storefront • Users report app hanging • Few or no active users in database • Active users taking much longer than normal • Isolate single SQL statement that is running slow • Explain plans show good/bad execution plans • Changes from good to bad at random times • Various attempts to isolate give conflicting results • We assume that the problem is stable, it isn’t • 10053 trace, watch optimizer choose execution plan • Optimizer changes from good to bad plan • one column of one table has 2 versus 3 distinct values
Case 4 – Three Things • Root cause • not sure… • App developers gone (outsourced) • No one knows how code really works • Why are these values appearing and disappearing? • App is inserting/deleting the rows with the 3rd distinct value • This change causes optimizer to choose bad plan • Bug or feature? • Fix? • Create rows so column always has 3 distinct values
Case 4 – Three Things (cont’d) • No test system • Test system has small percentage of production data • App system too complex to reproduce • Multiple strings, load balancers, app servers • Hardware, support, upgrades, patches • Difficult to get testing resources – people are expensive • Changes are tested for functionality • Changes aren’t tested for performance • Until released in production • Specific expertise • Explain plan • 10053 trace
Case 5 – Customer Demo • Users report slow performance • Database shows 4 inactive sessions • Inactive sessions holding locks • Kill these 4 sessions • Performance is fine
Case 5 – Three Things • Root cause – unknown • Why some sessions inactive and holding locks? • App developers gone • Easier to kill sessions once in a while • Real root cause would be expensive to find • Test system • Don’t have test system – app developed on the cheap • Don’t know how app works • No specific expertise required • Kill database sessions
Case 6 – Executive Dashboard • User reports database is sorting dates incorrectly • Phone on mute – laugh • Users have really good drugs today! • Connect to db • Verify that indeed dates are being sorted backwards • From NLS experience, look at actual bytes of dates • There are extra bytes that I can’t explain • Go back to the docs for DATE type • Expand DATE format to include all the possible fields • Suddenly it all makes sense!
Case 6 – What’s the Story? • The dates are being correctly sorted – because • They are from BC! • OK, now what? • This is sales data from Fred Flintstone? • Is Barney setting up his own IT department? • Check the basics • App code uses 10g JDBC • User says this was a requirement • But the database is 9i – it just gets better and better • User finds Metalink note • 10g JDBC issues with 9i database • Doesn’t describe our issue, but it’s a start…
Case 6 – What to Do? • Setup a 10g database and test the app code against it • No test system, in fact, no other system of any kind • Critical executive reporting system, can’t be down for long • Very limited disk space • Why not upgrade existing 9i database? • Upgrades can cause problems • If this isn’t a 10g to 9i issue, why risk the db upgrade? • Install 10g db, full export 9i db, import into 10g db • User tests app code against 10g database • No more dates from BC! • Remove 9i database, expand 10g database to match • User happy • Executive’s reports don’t show sales to the Flintstones
Case 6 – Three Things • Root cause • Not clear • Tried 10g database and issue went away • Did we really identify the root cause? • Was it a feature of 10g JDBC? A bug? • Test system • None • Expertise required • Minimal configuration control would have prevented this • Hardest part was accepting what Oracle was telling us • Dates from BC • Not at all the way things are supposed to be
Case 7 – Customer Demo • Users calls, application is slow • Blocking processes, restart database twice in one day • App still slow • Ask user to connect and start using app • I watch database for active users • Over 20 seconds before new db user appears • Db user is done and gone in less than a second • User reports 30 seconds before results appear in browser • I tried ping between the db server and the app server • 200ms with packet loss • Ask network group to investigate • Network switch in data center has failed
Case 7 – Three Things • Root cause • Network switch – no question • This was not a database problem • But we could have wasted a lot of time with SQL tracing etc. • Need to confirm that the database is the problem • Then work the issue as a db tuning issue • No test system available • How would you reproduce the data center network? • Expertise required • Look for active users in database • ping between db and app server
Case 8 – Data Warehouse • User reports application slow • User can’t truncate a table – command hangs • Loading data into warehouse is also hanging • Watch database while user starts load process • During load • User is using third-party app to do the load • No active users performing inserts • When no load is happening • Truncate table runs quickly • User contacts app vendor • Vendor “changes some parameters for the load process” • Data load runs normally, truncate table runs normally
Case 8 – Three Things • Root cause • Looking at the db showed no active inserts during data load • Needed to verify that db wasn’t the issue • Force vendor to perform • Test system • This was a dev database so there was a ‘test’ system • Data load issues identified, resolved in dev environment • Expertise required • Quickly check for database problems • Communicate with user to understand what is happening • Politics – dealing with vendor • Vendor really, really wants it to be a database problem
Case 9 – Software Registry • User trying to truncate table • Part of a data feed process • Other users in database • Performing deletes on same table • Blocking truncate • Kill delete processes • Truncate still blocked • Watch database processes • Delete process starts every hour on the hour • Cron job starts delete hourly • Part of data feed process • Why don’t the app owners know this?
Case 9 – Three Things • Root cause • Unknown • Why was cron in place that user didn’t know about? • Test system – exists • But doesn’t have the production data feeds • Expertise required • None – basic DBA skills • Not a database performance issue • Basic app configuration was the issue
Case 10 – CRM Reporting • User reports app is too slow • Specific selects are taking 10-15 seconds • STATSPACK snapshots are being taken • Check snapshots over same time for last week • Performance of the top SQL hasn’t changed • User agrees that performance hasn’t changed • Resources not available to make performance better • When is a performance problem not a problem? • When you don’t want to pay to fix it
Case 10 – Three Things • Root cause • User perception • Test system • None • Expertise required • Document that performance hasn’t changed • Offer to work the issue if user will get funding
Case 11 – Configurator • Users getting error • Can’t allocate memory in shared pool • What is causing this? • Spend some time looking at the database • Looks normal • Reboot database to clear shared pool • Watch database after restart • Shared pool doesn’t fill up
Case 11 – Three Things • Root cause • Don’t know why shared pool fills up • Test system • How to recreate problem in test system? • Expertise needed • Basic DBA skills • Flush shared pool • Problem didn’t reoccur • One-time or infrequent set of circumstances
Case 12 – Pricing Database • User reports slow performance • Check database server • CPU and iowait very high • Watch SQL executing • Explain plan for worst SQL shows full table scans • No indexes on tables being scanned • Test server • Same tables do have indexes • While recreating indexes • All the needed indexes reappear • Cron job for pricing data runs every two weeks • Rebuilds indexes and loads data
Case 12 – Three Things • Root cause • What caused indexes to be dropped? • Unknown • Test system • Used to verify that indexes were missing • Can’t reproduce whatever dropped the indexes • Expertise required • Basic DBA skills
Case 13 – Contracts Database • User reports error when selecting from table • Invalid row id • Some queries on same table don’t report error • Looking at table and row ids • Some queries use index • No errors • Other queries use table • Specific rows have invalid row ids • Rebuild table • Create table as select * from <table>
Case 13 – Three Things • Root cause • Why did row ids become invalid? • Unknown • Test system • Exists, can’t reproduce cause of invalid row ids • Expertise required • Basic DBA skills
Case 14 – Software Download • User reports application slow in Dev environment • SQL so slow, application server times out • Sar shows CPU near 0% idle • User executes problem SQL • Watch database • SQL completes, very slowly, database is working • Explain plan shows full table scans • Look at tables involved • Production – last analyzed NULL, indexes, runs fast • Test – recently analyzed, indexes, runs slow • Drop stats, rebuild indexes, reanalyze • Performance returns to normal
Case 14 – Three Things • Root cause • What happened to indexes, statistics in production? • Unknown • Test system • Exists • Doesn’t match production – which is correct? • Can’t reproduce dropped indexes and/or statistics • Expertise required • Basic DBA skills
Case 15 – BRIO • User reports SQL failing with error • Can’t allocate memory in shared pool • Watch database • Same SQL runs most of the time • Error occurs once in a while • Solutions • Reduce sort area size • Free up more memory for shared pool • Increase physical memory assigned to database • Error occurred so infrequently • Users decided not to change system
Case 15 – Three Things • Root cause • Unknown • Just too many users for a brief time? • Test system • Yes, but how to reproduce this issue (user load)? • Expertise required • Basic DBA skills
Case 16 – Revenue App • Automated alert • Report log switching hung • Database stopped for transactions • Examine database • Can’t find anything wrong • Restart database • Log switch works • No further alerts • No user problems
Case 16 – Three Things • Root Cause • Unknown • Test system • Exists, but no help • Expertise required • Basic DBA skills
Observations • How many times did we • Identify root cause? • Have a complete test system? • Need specific expertise? • Un-scientific results • Root cause – 2 out of 16 = 13% • Test system – 3 out of 16 = 19% • Specific expertise – indexes 1/16 = 6% • Simple solutions worked 94% of the time
Conclusions • Brian’s off his meds • Ignore him – this isn’t typical • Urban legends? • You don’t have to find root cause • You don’t have to have a complete test environment • You don’t need extensive expertise in specific DBA areas • You do need • DBA experience • Open mind – strange stuff happens all the time • Communication skills – you don’t know what’s happening • Looking at training resources • Why become more expert at things that are rarely needed? • Why not become familiar with things you know little about?
Opinion • Specific expertise can be obtained when needed • Based on the results shown • 94% of the time, don’t need expertise to be productive • 15 of 16 cases solved with general DBA skills • Some expertise can be automated or outsourced • Tracing, wait events • Focus on skills that can’t be put into a GUI • Communication • Ability to solve problems, not just database issues