480 likes | 825 Views
The DB2 Detective – The Case of the Diligent DBA. Bill Minor IBM. Sept. 2009. Highlights. No matter the extent of preparation, it seems that unexpected outages pop up when least expected. As a result, unplanned downtime can cause
E N D
The DB2 Detective – The Case of the Diligent DBA Bill MinorIBM Sept. 2009
Highlights No matter the extent of preparation, it seems that unexpected outages pop up when least expected. As a result, unplanned downtime can cause significant impact to business operations. When misfortune strikes, there are always two predominant questions, "What happened?" and "When will the system be available again?" Mysteries can be resolved by good DB2 Detective work - identifying and circumventing actions that can lead to problems. When unexpected problems do occur, it is vital to be able to act decisively and swiftly. The goal of this session is to introduce database inspection tools and provide insights that can be used to help monitor the integrity and maintain the consistency of your data. I will also provide guidance for salvage, repair, recovery, and mitigation operations that are vital when crisis strikes and time is of the essence.
Help with defining a problem Help with collecting and understanding information/diagnostics related to a problem Tools to help in the event of an outage Tools to help avoid outages in the first place Features to help avoid, control, and reduce the impact of an unexpected problem (Technology Preview or TP) RAS (Reliability, Availability and Serviceability) Agenda
In the Beginning …. • ‘There is an outage’ !!!!!! Where do we start? …… Take a deep breath … (DB2 911) • Keep it simple to begin; fundamentals are essential • Order of Investigation/ ‘Case’: • Problem Description • Problem Determination • Problem Resolution
‘Tools’ in the Detective Kit • The problem description • db2diag.log • db2trc • db2pd • db2dart and inspect • db2top • db2exfmt • admin functions • db2cos, db2support, db2fodc • Resiliency (mitigation/toleration/avoidance)
An ‘Outage’ • “Outage” – subjective terminology: • Data availability: • slow access • no access • Unexpected crashes • Notes: • There is always a trade-off between how much time can be spent investigating a problem and making system available again to users • Other factors: risk mitigation; establishing cause; development of a workaround (in absence of a ‘fix’ obviously don’t want to hit this again) An outage can be planned (i.e. maintenance) or unplanned (unexpected errors, terminations, slow or unavailable access to data, …). ‘Path to DBA diligence’ • Handling and avoiding unplanned outages via: investigation, problem definition, monitoring, prevention, education, …… (hopefully something from this presentation ) …
What Information Needs to be Collected? • Each problem has it own set of one or more signatures • Problems (even those with the same source – so ‘same problem’) can manifest themselves in different ways • Systems and environments are different: hardware, software, different users, different data (scope and volume), … • There is a set of fundamental or initial information/diagnostics that needs to be collected for every problem .. No matter what • Keep in mind that additional information or diagnostics could be required depending on the initial analysis investigation • Goal: To minimize the amount of effort required in collecting diagnostics
The Problem Description • VERY VERY VERY IMPORTANT (may seem obvious) • Avoid taking the wrong path • Invest in: • Establishing “What happened?” • Getting the correct information and diagnostics REQUIRED FUNDAMENTALS: • What is the current state/situation? • Describe what happened and what needs to happen • Include recent problem or change history • Client/Server configuration • What changed? No really, what changed? (Usually something has but just not apparent how, why or when) • Include specific actions known to be performed at the time of the incident • Product version and release number (FP? Special build?) • SQL error codes • Scope: System Wide? Instance Wide? Database? Application(s)?
Identification of such an issue requires a baseline – a before and after comparison Typically the issue is, ‘things are slow’ (really mean ‘slower’) Slow query After running RUNSTATS After applying a fixpack After migrating to a new release of DB2 After making a database configuration change Performance of triggers, stored procedures, UDFs with SQL, crashes/traps during query execution Differentiate between overall system performance and specific actions Date/time correlation? Other maintenance activities? Different systems? (db2pd –osinfo) Performance Issues – What? and When?
Problem ‘Scope’ • Scope and Cause are required in order develop and execute action plan; action plan may be ‘straightforward’ i.e. restore the database • Manifest itself in one of the following outage scenarios: • Instance trap – db2start required • Database trap • Application failure
DB2 Build Level Operating System type and level DB2 registry variable settings MPP? SMP? Single-processor? Database partitioned? Which partition? Syntax of command/statement associated with issue DB2 error received db2diag.log file which sufficient history Notification log DB2 Trap files (Call stacks), DB2 dump files History File db2 trace Messages file Core file, system error logs (Unix:errpt Windows: event logs) Log files and log file control structures … Diagnostics to Collect at the Outset Fundamental
Where is it? $HOME/sqllib/db2dump/ What is it? ‘Supplementary’ diagnostic logging facility How to read? Each entry: Time ordered entries Database specific; Database node/partition (i.e. DPF specific) Process/thread ID; Application specific identifiers/handles Function name where entry was generated Messaging and perhaps some data 2009-03-03-10.26.14.523824-300 I37414837E502 LEVEL: Error PID : 8083 TID : 54 PROC : db2sysc 20 INSTANCE: svtdbm3 NODE : 020 DB : MASUN1 APPHDL : 0-138 APPID: *N0.svtdbm3.090303150827 AUTHID : SVTDBM3 EDUID : 54 EDUNAME: db2agntp (MASUN1) 20 FUNCTION: DB2 UDB, buffer pool services, sqlbDMSCreatePool, probe:800 MESSAGE : ZRC=0x850F000C=-2062614516=SQLO_DISK "Disk full." DIA8312C Disk was full. Can be parsed with diag tool and/or admin function (examples in Backup Slides) db2diag.log file
Useful for problems/issues that can be recreated Generates a flow and format of the code paths taken by DB2 while the trace is active Traces all DB2 activity in an instance on a physical node (database partition). If many logical db2 nodes, all activity is traced Trace to memory buffer or file (slower) Trace to file for trace wraps Always include corresponding db2diag.log file with a db2 trace Examples: db2trc on –l 8M … trace to memory db2trc on –f mytrace.trc … trace to file db2trc inf …. Information returned when trace is active) db2trc flw mytrace.trc mytrace.flw …… db2trc fmt mystrace.trc mytrace.fmt …… Readable versions of trace data db2trc – Trace facility
Primarily to examine databases for architectural correctness Check validity of meta-data structures, page headers, row headers, … Runs against data as it exists on disk (primarily an offline tool) Critical for investigating problems associated with data ‘corruption’ Granularity: database /db, tablespace /ts, table /t EXAMPLES: db2dart SAMPLE /db db2dart SAMPLE /ts /tsi 4 db2dart SAMPLE /t /tsi 2 /oi 5 Marking problematic objects as invalid; recovering or extracting data Consistency: Index key data checking HWM – space tablespace space management db2dart (Database Analysis and Reporting Tool)
Example: db2dart SAMPLE /ddel … Table ID or name, tablespace ID, first page, num of pages: (suffic page number with 'p' for pool relative) 13 2 0 100 7 of 7 columns in the table will be dumped. Column numbers and datatypes of the columns dumped: 0 SMALLINT 1 VARCHAR() -VARIABLE LENGTH CHARACTER STRING 2 SMALLINT 3 CHAR() -FIXED LENGTH CHARACTER STRING 4 SMALLINT 5 DECIMAL 6 DECIMAL Default filename for output data file is TS2T13.DEL, do you wish to change filename used? y/n n Filename used for output data file is TS2T13.DEL. If existing file, data will be appended to it. Inaccessible Table – Salvaging Data
Example: db2dart testdb /mi /tsi 2 /oi 4 Time of index rebuild dictated by setting of INDEXREC database cfg variable Usually one of RESTART (default) or ACCESS RESTART – database restarted ACCESS – when index is first accessed Messages written to db2diag.log file indicating index rebuild Marking Indexes As Invalid – db2dart /MI
Database backup images represent one of your ultimate recovery options Hence, checking their integrity is prudent db2ckbkp Check the integrity of a database backup image Can extract pages from the backup image to a file (-e option) This may be part of a problem causal analysis or repair procedure What ‘state’ is the data in the backup image? Tivoli Storage Manager (TSM): db2adutil -verify Checking Database Backup Images
Records information specific to the following database actions: BACKUP, ROLLFORWARD, DROPPED TABLE, LOAD, CREATE TABLESPACE, RENAME TABLESPACE, REORG, ALTER TABLESPACE, ARCHIVE LOG Dropped table recovery CREATEor ALTER TABLESPACE option DROPPED TABLE RECOVERY Allows recovery of dropped table data using table space-level restore and rollforward operations Faster than database recovery and database available to users History File
Identify the table which was dropped Op Obj Timestamp+Sequence Type Dev Earliest Log Current Log Backup ID -- --- ------------------ ---- --- ------------ ------------ -------------- D T 20090223125627 000000000000693f00020013 ---------------------------------------------------------------------------- "BMINOR "."STAFF3" resides in 1 tablespace(s): 00001 USERSPACE1 ---------------------------------------------------------------------------- Comment: DROP TABLE Start Time: 20090223125627 End Time: 20090223125627 Status: A ---------------------------------------------------------------------------- EID: 9 DDL: CREATE TABLE "BMINOR "."STAFF3" ( "ID" SMALLINT NOT NULL , "NAME" VARCHAR(9) , "DEPT" SMALLINT , "JOB" CHAR(5) , "YEARS" SMALLINT , "SALARY" DECIMAL(7,2) , "COMM" DECIMAL(7,2) ) IN "USERSPACE1" ; ---------------------------------------------------------------------------- LIST HISTORY –Dropped Table Recovery
Restore a database- or table space-level backup image taken before the table was dropped Create an export directory to which files containing the table data are to be written Roll forward to a point in time after the table was dropped, using the RECOVER DROPPED TABLE option on the ROLLFORWARD DATABASE command (alternatively, roll forward to the end of the logs, so that updates to other tables in the table space or database are not lost) Re-create the table using the CREATE TABLE statement from the recovery history file. Import the table data that was exported during the rollforward operation into the table Dropped Table Recovery(continued)
Optimizer access plans change … could be legitimate, maybe not Compare plans, stats, configuration: before vs after Monitor Example: db2 –tvf $HOME/sqllib/misc/EXPLAIN.DDL db2 explain all for “select * from bminor.staff where id=10” db2exfmt -d sample -n % -e % -s % -w -l -# 0 -o explainPlan.out -g TIC Performance – Explain Plans
Access Plan: ----------- Total Cost: 11.5946 Query Degree: 1 Rows RETURN ( 1) Cost I/O | 6.28 FETCH ( 2) 11.5946 1 /---+---\ 6.28 157 IXSCAN TABLE: BMINOR ( 3) STAFF 3.2496 Q1 0 | 157 INDEX: BMINOR I1_STAFF Q1 Objects Used in Access Plan: --------------------------- Schema: BMINOR Name: I1_STAFF Type: Index Time of creation: 2009-03-02-15.52.41.836445 Last statistics update: Number of columns: 1 Number of rows: 157 Width of rows: -1 Number of buffer pool pages: 1 Distinct row values: No Tablespace name: USERSPACE1 Tablespace overhead: 7.500000 Tablespace transfer rate: 0.060000 Source for statistics: Single Node Prefetch page count: 32 Container extent page count: 32 … Schema: BMINOR Name: STAFF Type: Table Time of creation: 2009-03-02-15.50.29.727941 Last statistics update: Number of columns: 7 Number of rows: 157 Width of rows: 40 Number of buffer pool pages: 1 Number of data partitions: 1 Distinct row values: No Tablespace name: USERSPACE1 Tablespace overhead: 7.500000 Tablespace transfer rate: 0.060000 db2exfmt Output - Example
db2pd – Monitor and Troubleshoot DB2 • Indispensable for monitoring and troubleshooting • Non-intrusively capable of obtaining various statistics, snapshot information, internal meta-data, … from an instance • Relevant options (some of my favorites): • -osinfo, -mempools, -memblocks, -edus (more to come), -stack, -applications, -transactions, -bufferpools, -logs, -pages, -tablespaces, -tcbstats index, -apinfo, -fvp, …
DB2 has a very rich and diverse set of SQL routines that can be used probe and monitor one’s system For example: PD_GET_DISG_HIST PD_GET_LOG_MSGS PDLOGMSGS_LAST24HOURS DB_HISTORY LOG_UTILIZATION LONG_RUNNING_SQL LOCKS_HELD LOCKWAITS CONTAINER UTILIZATION TBSP_UTILIZATION ADMIN_GET_TAB_INFO, ADMIN_GET_INDEX_INFO APPL_PERFORMANCE BP_HITRATIO, BP_READ_IO, BP_WRITE_IO TOP_DYNAMIC_SQL MON_GET_BUFFERPOOL, MON_GET_TABLE, MON_GET_TABLESPACE, …. Achieving Understanding via SQL Routines
What is it? DB2 dynamic monitoring tool. But so are snapshots? Snapshot data is typically cumulative (point in time) db2top provides delta values in real time db2top integrates multiple types of snapshot data together Diagnose performance and resource problems, optimize Provides a unified, single-system view of a multi-partition database or single-partition database for the purposes of providing a dynamic real-time view of a running DB2 system Availability: V8 FP17, V9.1 FP6, V9.5 FP2 developerworks whitepaper: http://www.ibm.com/developerworks/data/library/techarticle/dm-0812wang/ db2top
Invoked automatically by Database invoked by default when the database manager cannot continue processing due to a panic, trap, segmentation violation or exception Each default db2cos script will invoke db2pd commands to collect information in an unlatched manner Order of events when trap/panic occurs: Trap file is created Signal handler is called db2cos script is called An entry is logged in the administration notification log An entry is logged in the db2diag log file Information is generated by db2pd (operating system, the Version and Service Level installed, the database manager and database configuration, as well as information about the state of the agents, memory pools, memory sets, memory blocks, applications, utilities, transactions, …) When Unexpected Problems Occur – db2cos :DB2 Call Out Script
Collects environment data about either a client or server machine and places the files containing system data into a compressed file archive Collects db and dbm cfg info, reg var settings, db2level, db2diag.log, … all into one file called db2support.zip Collects up diagnostics generated by db2cos* scripts db2supp_opt.zip within db2support.zip This file contains a host of optimizer related info: query text, exfmt output (explain of query), db2look and db2batch output, catalog table info, … To assist with Optimizer (plan) problems db2support . –db <dbname> -st “select * from bminor.staff” db2support – Diagnostic Data Collection
Basically a tool which can set or change flags which influence the database system behavior for problem determination purposes Control behaviors of cos and fodc (more to come on this one) Neat option: db2pdcfg –flushbp –db <dbname> Allow faster failover from primary to secondary in HADR setups Monitor and tune bufferpools for number of dirty pages (use inconjunction with –pages and –bufferpools) Sync in-memory tablespace meta-data structures with on-disk versions Execute before using db2dart /DHWM against active database db2pdcfg
A facility (V9.5) to capture diagnostic information when a problem first occurs Automatic mode: invoked by DB2 engine when outage or error condition is detected Traps, panics, corruptions Manual mode: user initiated – db2fodc Hangs, performance issues Current setting: db2pdcfg –fodc status db2pdcfg can be used to change defaults Information collected and placed in subdirectory under DIAGPATH FODC_<outagetype>_<timestamp> Run db2support after db2fodc to collect all the fodc information and package into one zip file that can be sent to Support First Occurrence Data Capture (FODC)
Example: db2fodc –hang –db SAMPLE 2009-03-03-11.33.19.657457-300 E1683E654 LEVEL: Warning PID : 619 TID : 47350926685104PROC : db2fodc INSTANCE: bminor NODE : 000 FUNCTION: DB2 UDB, RAS/PD component, pdDb2FODCMain, probe:30 MESSAGE : ADM14003W FODC has been invoked by the user from db2fodc tool for symptom "hang" and diagnostic information has been recorded in directory "/home/bminor/sqllib/db2dump/FODC_Hang_2009-03-03-11.33.19.657318". Please look in this directory for detailed evidence about what happened and contact IBM support if necessary to diagnose the problem. File Listing for ../sqllib/db2dump/FODC_Hang_2009-03-03-11.33.19.657318 6116.1.000.stack.txt 6116.17.000.stack.txt 6116.27.000.stack.txt 6116.31.000.stack.txt DB2CONFIG 6116.11.000.stack.txt 6116.18.000.stack.txt 6116.28.000.stack.txt 6116.32.000.apm.bin 6116.12.000.stack.txt 6116.19.000.stack.txt 6116.29.000.apm.bin 6116.32.000.dump.bin 6116.13.000.stack.txt 6116.20.000.stack.txt 6116.29.000.dump.bin 6116.32.000.stack.txt DB2PD 6116.14.000.stack.txt 6116.21.000.stack.txt 6116.29.000.stack.txt 6116.33.000.apm.bin DB2SNAPS 6116.15.000.dump.bin 6116.22.000.stack.txt 6116.30.000.apm.bin 6116.33.000.dump.bin DB2TRACE 6116.15.000.stack.txt 6116.23.000.stack.txt 6116.30.000.dump.bin 6116.33.000.stack.txt OSCONFIG 6116.16.000.apm.bin 6116.24.000.stack.txt 6116.30.000.stack.txt 6116.34.000.apm.bin OSSNAPS 6116.16.000.dump.bin 6116.25.000.stack.txt 6116.31.000.apm.bin 6116.34.000.dump.bin OSTRACE 6116.16.000.stack.txt 6116.26.000.stack.txt 6116.31.000.dump.bin 6116.34.000.stack.txt db2eventlog.000 db2fodc_hang.log db2fodc Example
Resiliency • Avoiding and dealing with ‘errors’ that can result in unplanned outages
Index to Data consistency problems can be categorized as: Orphaned index entry: An index entry points to a deleted or non-existent record slot Invalid index entry: An index entry points to a valid slot on a data page, but index key data does not match record data Missing key: A data row is not indexed in at least one of the indexes on the table Duplicate Record IDs (RIDs): A data row is indexed more than once These are classified as logical corruptions. Online Data Consistency Checking (V9.5)
New INDEXDATA cross-object checking clause to the INSPECT utility Example: db2 inspect check table name STAFF indexdata results keep insp.out NOTE: The INDEXDATA option only examines the logical inconsistency between index and data. Therefore, it is recommended that you first run INDEX and DATA checking separately, to rule out any physical corruption, before running INDEXDATA checking. Key-Data Checking via DB2 INSPECT (V9.5)
DB2 has always had a threaded engine model on Windows platforms In V9.5, DB2 Unix engine has now evolved from process-based to thread-based model All threads within a process can share the memory allocated by, or within, that process Several advantages: basically performance and memory management From a DBA perspective: Better administration and tuning of memory through administration Resiliency improvements (upcoming slides) Threaded Engine Model for Unix in V9.5
V9.1: bminor 6107356 4780132 0 13:00:53 - 0:00 db2sysc bminor 6148108 6717616 0 13:01:03 - 0:00 db2agent (SAMPLE) bminor 6180952 1458238 0 13:08:22 - 0:00 db2pclnr bminor 6193316 1458238 0 13:08:22 - 0:00 db2pclnr bminor 6660150 6107356 0 13:00:53 - 0:00 db2licc bminor 6717616 6107356 0 13:00:53 - 0:00 db2ipccm bminor 884882 1458238 0 13:08:22 - 0:00 db2loggw (SAMPLE) bminor 1237024 1458238 0 13:08:22 - 0:00 db2lfr (SAMPLE) bminor 1458238 6107356 0 13:00:53 - 0:00 db2gds bminor 1556708 6717616 0 13:01:07 - 0:14 db2stmm (SAMPLE) bminor 1757202 1458238 0 13:08:22 - 0:00 db2dlock (SAMPLE) bminor 1908750 1458238 0 13:08:22 - 0:00 db2pclnr bminor 2195682 1 0 13:08:21 - 0:00 /home4/bminor/sqllib/bin/db2bp V9.5: bminor 10364 10362 1 13:15 pts/19 00:00:00 db2sysc bminor 10377 10362 1 13:15 pts/19 00:00:00 db2acd,0,0,0,1,0,0,1,0,8a18 bminor 10385 1 0 13:15 pts/19 00:00:00/home/bminor/sqllib/bin/db2bp DB2 Processes vs. Threads: ps -ef
Database Partition 0 -- Active -- Up 2 days 17:59:14 List of all EDUs for database partition 0 db2sysc PID: 1187 db2wdog PID: 1185 db2acd PID: 1201 EDU ID TID Kernel TID EDU Name USR SYS ========================================================================================================= 65 47726691608896 20490 db2agntdp (SAMPLE ) 0.000000 0.000000 64 47726695803200 999 db2agent (SAMPLE) 0.240000 0.110000 63 47726708386112 998 db2agent (SAMPLE) 0.720000 0.160000 61 47726712580416 308 db2evmli (DB2DETAILDEADLOCK) 0.000000 0.000000 60 47726716774720 307 db2wlmd (SAMPLE) 0.000000 0.000000 59 47726720969024 306 db2pfchr (SAMPLE) 0.000000 0.000000 57 47726737746240 304 db2pfchr (SAMPLE) 0.000000 0.000000 55 47726733551936 302 db2pclnr (SAMPLE) 0.000000 0.000000 54 47726729357632 301 db2pclnr (SAMPLE) 0.000000 0.000000 53 47726746134848 300 db2dlock (SAMPLE) 0.000000 0.000000 52 47726762912064 32767 db2lfr (SAMPLE) 0.000000 0.000000 51 47726741940544 32766 db2loggw (SAMPLE) 0.020000 0.030000 50 47726758717760 32765 db2loggr (SAMPLE) 0.000000 0.030000 49 47726699997504 1554 db2taskd (SAMPLE) 0.000000 0.000000 25 47726754523456 1365 db2stmm (SAMPLE) 8.020000 0.460000 16 47726767106368 1265 db2agent (SAMPLE) 1.420000 0.810000 15 47726771300672 1199 db2resync 0.000000 0.000000 14 47726775494976 1196 db2ipccm 0.080000 0.030000 13 47726779689280 1195 db2licc 0.000000 0.000000 12 47726783883584 1194 db2thcln 0.000000 0.000000 11 47726788077888 1192 db2alarm 0.020000 0.020000 1 47726528031040 1191 db2sysc 5.420000 5.620000 Running Threads: db2pd -edus
DB2 is leveraging technology which is specific to the POWER6 platform and AIX (TL5300-06 or greater) Hardware feature called Storage Keys Allows for memory protection functionality DB2 V9.5 has the capability to detect erroneous access to critical regions of it’s own memory, primarily bufferpool memory This is pro-active protection: When such access is detected it is stopped before a corruption can result Offending UDF or Stored Procedure receives an error code indicating that an incorrect memory access was performed Cannot protect against all types of bugs but it does provide an added layer of protection Memory Protection via Storage Keys (V9.5)
Registry variable: DB2_MEMORY_PROTECT = [ NO | YES ] Error DBI1301E on attempted set if: Hardware, OS does not support storage keys Large page support (i.e. DB2_LARGE_PAGE_MEM) is enabled Enablement of Memory Protection viaStorage Keys (V9.5)
Thread Suspension: An additional level of protection that will increase availability of systems when abnormal termination of a DB2 Kernel Thread occurs DB2_THREAD_SUSPENSION = [ OFF | ON ] What is it? It allows you to control whether a DB2 instance sustains a trap by suspending a faulty DB2 kernel thread (a thread that has tried illegally to access memory protected with storage protection keys). Higher Availability with Storage Key Support
Goal: Reduce unplanned outages Error toleration via: Trap resilience Physical read error toleration Logical read error toleration Registry variable: DB2RESILIENCE=ON/TRUE or not defined (ie default behavior) activates error toleration support for the above DB2RESILIENCE=OFF/FALSE disables these capabilities. Error Toleration (TP)
Registry Variable: DB2RESILIENCE= [ ON | OFF ] When a DB2 server detects a trap the following actions will be taken by DB2: First Occurrence Data Capture (FODC) will be activated Execution stack is dumped If Storage Key support is active then process thread suspension decisions If Storage Key suspension decisions allow for thread suspension, suspend thread If Storage Key support is not active or fails decision points to suspend thread, then proceed with non Storage Key survival decisions (ie. Utilize this new capability) DB2 will determine if the current environment is safe to allow recovery Extended Trap Resilience (TP)
2008-11-17-09.56.27.542572-300 I75439E563 LEVEL: Severe PID : 28040 TID : 47643040409920PROC : db2sysc INSTANCE: marvin NODE : 000 DB : TESTDB APPHDL : 0-21 APPID: *LOCAL.marvin.081117145537 AUTHID : MARVIN EDUID : 29 EDUNAME: db2agent (TESTDB) FUNCTION: DB2 UDB, RAS/PD component, pdResilienceIsSafeToRollback,probe:800 DATA #1 : String, 37 bytes Trap Sustainability Criteria Checking DATA #2 : Hex integer, 8 bytes 0x00000000000A0400 DATA #3 : Boolean, 1 bytes true DATA #1 A generic description of the data following DATA #2 The hex dump of the trap sustainment criteria DATA #3 The decision on whether to attempt to sustain the trap (true/false) Messaging:
Physical read errors are detected when retrieving DB2 pages from the storage medium (for disk drive). This error could be caused by issues with the contents of the data on disk or by problems accessing the physical device on which the data resides. Currently, when DB2 detects either of these errors the DB2 instance is shutdown. This solution will provide enhanced toleration of physical read errors such that if such an error is encountered, rather than force the DB2 instance offline, the SQL statement or operation being executed will fail, a new SQLCODE will be returned to the application when appropriate, logging of appropriate diagnostics will occur, and the DB2 instance will remain active. Physical Read Error Toleration (TP)
When a DB2 server detects a physical read error there will be a maximum of two more attempts to read the page from disk and validate the DB2 page header. If at the end of the multiple attempts an error is still determined to exist then: If DB2RESILIENCE=ON Sqlcode -1655E will be returned to the application The SQL statement or DB2 operation fails Administration message ADM6006E will be placed in the DB2 Administration log at notify level URGENT Administration message ADM6006E will be placed in the DB2 diagnostic log at level URGENT First Occurrence Data Capture (FODC) will be activated when the error is detected and will be turned off when the sqlcode is returned to the application. This will allow collection of diagnostic data generated due to this failure detection for further analysis by DB2 Support. The database instance remains available Application connection remains available for continued operations If DB2RESILIENCE=OFF then panic the instance (that is, revert to old behavior) This algorithm will aid in reducing application errors which are a result of transient file system I/O errors. Toleration of Physical Read Errors
Inconsistency detected by DB2 when processing data contained within a DB2 page Instead of hard failure (database or instance shutdown), isolate the issue and allow overall database use to continue Problematic page is removed from the bufferpool, relevant diagnostics are generated, notification is returned (messages to admin and db2diag.log and new sqlcode to application) Leveraged within the engine where it makes sense Logical Read Error Toleration (TP)
SQL1656E An error occurred while processing data. The operation could not be completed, but the database remains accessible. Contact IBM Software Support. Explanation: The failure to complete the operation is due to a detected DB2 page inconsistency. The SQL statement has failed or the operation has been aborted. The database remains accessible. User response: Contact IBM Software Support for instructions on what diagnostic data to collect to assist in resolving this issue. The application can retry the operation, although it may continue to fail. sqlcode: -1656 sqlstate: 58004 ADM6007C DB2 detected an error while processing page page-number from table space tbspace-id for object object-id of object type object-type. Explanation DB2 was not able to complete the operation, but the database remains accessible. This may indicate the presence of a serious problem that could result in a future outage. User response Please contact IBM Software Support and they will guide you through the proper corrective actions. Messaging
Developerworks (Best Practices Papers, …) IBM Database Magazine IDUG Web Site and Solutions Journal DB2 LUW Chat with the Lab http://www-01.ibm.com/software/data/db2/9/labchats.html IBM DB2 Support home: www.ibm.com/software/data/db2/support/db2_9 DB2 Fixes: http://www-01.ibm.com/software/data/db2/support/db2_9/download.html Conferences, RDUGs, Customer Lab Advocate, Early Adoption Programs, … Resources To Help You
The DB2 Detective – The Case of the Diligent DBA Bill Minor IBM bminor@ca.ibm.com