600 likes | 1.22k Views
错误定位. 刘睿. 概要. 错误日志分类 Console 文件 Symrecs 文件 CSMT.out 文件 Dump 和 Traceback 文件 CICS Dump Traceback 文件 可用的系统工具 综合分析. 错误日志分类. 错误日志文件. CONSOLE 文件 CICS Region Startup, Shutdown ,Transaction Error/Failure … SYMRECS 文件 CICS Error Conditions, Symptom Records and Stack trace.
E N D
错误定位 刘睿
概要 • 错误日志分类 • Console文件 • Symrecs文件 • CSMT.out文件 • Dump和Traceback文件 • CICS Dump • Traceback文件 • 可用的系统工具 • 综合分析
错误日志文件 • CONSOLE文件CICS Region Startup, Shutdown ,Transaction Error/Failure … • SYMRECS文件CICS Error Conditions, Symptom Records and Stack trace. • TRACEBACK文件stack information when an Exception occurs in CICS. (Illegal address or Illegal instruction) • CSMT.out文件Transaction Error Messages, Communication Errors… • .env文件 • Dump文件 • XA Log文件 • CICS Trace
Console文件 • Console.nnnnnn 位于 /var/cics_regions/<regionname>/ • 最大长度取决于RD:MaxConsoleSize。 • 如果写信息到stderr的话,就会写入Console文件。 • 信息分类: • Information消息例:ERZ010054I/0205 2006-07-24 11:34:48.827332000 CICSNT01 3264/0001 : CICS 自相一致性检查完成 • Warning消息例:ERZ016050W/0234 2006-07-24 11:36:53.175833000 CICSNT01 2260/0001 : 事务处理 'CRTE' 的逻辑工作单元已被逆序(现场)恢复;分布式事务处理服务 (TRAN) 原因 'ENC-tra-1025: 客户机(不是事务服务)异常中止‘ • Error消息例:ERZ058009E/0020 2006-07-24 11:36:55.629361000 CICSNT01 2260/0001 : 代码页‘IBM-1381’和代码页‘IBM-850’之间的数据转换例程不可用,错误号为 2
Console信息举例 ERZ010054I/0205 2006-07-24 11:34:48.827332000 CICSNT01 3264/0001 : CICS 自相一致性检查完成 ERZ014040E/0107 2006-07-24 11:36:29.011086000 CICSNT01 2260/0001 AB34: 在程序‘DFHCEMT’中事务‘CEMT’发生首次异常结束‘A28B’ ERZ014016E/0028 2006-07-24 11:36:29.071172000 CICSNT01 2260/0001 AB34: 事务处理 'CEMT',在 'AB34' 异常终止'A147'。 SERVICE_MESSAGE 2006-07-24 11:36:29.101216000 CICSNT01 2260/0001 : Abend 'A147' (first abend 'A28B') is reported as transaction 'CEMT' is force-purged. ERZ058009E/0020 2006-07-24 11:36:53.075689000 CICSNT01 2260/0001 : 代码页‘IBM-1381’和代码页‘IBM-850’之间的数据转换例程不可用,错误号为 2 ERZ014010E/0012 2006-07-24 11:36:53.105732000 CICSNT01 2260/0001 : 不能初始化远程系统的通信以运行事务处理 'CRTE'。 ERZ014016E/0036 2006-07-24 11:36:53.135776000 CICSNT01 2260/0001 : 事务处理 'CRTE',在 '????' 异常终止'A28D'。 ERZ016050W/0234 2006-07-24 11:36:53.175833000 CICSNT01 2260/0001 : 事务处理 'CRTE' 的逻辑工作单元已被逆序(现场)恢复;分布式事务处理服务 (TRAN) 原因 'ENC-tra-1025: 客户机(不是事务服务)异常中止' ERZ058009E/0020 2006-07-24 11:36:55.629361000 CICSNT01 2260/0001 : 代码页‘IBM-1381’和代码页‘IBM-850’之间的数据转换例程不可用,错误号为 2 ERZ014010E/0012 2006-07-24 11:36:55.659404000 CICSNT01 2260/0001 : 不能初始化远程系统的通信以运行事务处理 'CRTE'。 ERZ014016E/0036 2006-07-24 11:36:55.699462000 CICSNT01 2260/0001 : 事务处理 'CRTE',在 '????' 异常终止'A28D'。 ERZ034114E/0731 2006-07-24 11:36:55.699505000 CICSNT01 3884/0001 : 从运行时数据库中尝试卸装条目 '@1AN' 失败。条目标记为使用。 ERZ016050W/0234 2006-07-24 11:36:55.729505000 CICSNT01 2260/0001 : 事务处理 'CRTE' 的逻辑工作单元已被逆序(现场)恢复;分布式事务处理服务 (TRAN) 原因 'ENC-tra-1025: 客户机(不是事务服务)异常中止'
Symrecs文件 • symrecs.nnnnnn 位于 /var/cics_regions/<regionname>/ • Symrecs记录的格式: SYMPTOMS = primary symptom data SECONDARY SYMPTOMS = secondary symptom data
Symrecs记录举例 SYMPTOMS = PIDS/5765E2820 LVLS/430 PTFS/ RIDS/TasLU_UpdateTidState LINE/-1 MS/016001 MSN/63 SRC/11 PRCS/2097152 AB/U1601 PID/40608 TID/1 TIME/030112052304 IST SECONDARY SYMPTOMS = PostMortem (Error Path is offset x'594' in TasLU_UpdateTidState<TasLU_ISyncpoint <TasLU_Syncpoint<PinCA_Route<CICSAPIE) logging where error occurred
CSMT.out文件 • 位置:/var/cics_regions/<regionname>/data • 也可以写应用信息到CSMT.out文件:EXEC CICS WRITEQ TD QUEUE("CSMT") FROM(debug-data) • TXSeries v6.2以后可以限制文件的最大长度为:TDD:CSMT:MaxSize。
CSMT.out信息举例 ERZ015033E/0003 03/10/03 09:35:09 CICSMAIN UHHP: Transaction 'QA55' attempts to run program 'QA55SUB' which is not defined in the database ERZ015033E/0003 03/10/03 09:36:52 CICSMAIN YHHP: Transaction 'QA55' attempts to run program 'QA55SUB' which is not defined in the database
CSMT.out信息举例 • ERZ028001E/6212 01/28/03 15:19:45 CICSPROD PS12: The connection to the remote system 'HOST' cannot be started. Communications error 15a00002/15a00102 • ERZ042028I/0159 01/28/03 15:19:56 CICSPROD : Terminal 'PS12' with NETNAME 'PS12' has been uninstalled. • ERZ042043I/0802 01/28/03 15:20:00 CICSPROD : Waiting for tasks to finish with CD entry '04PS' for remote system 'PS04' • ERZ016050W/0234 01/28/03 15:23:26 CICSPROD : Logical unit of work for transaction 'TPSL' has been backed out; Distributed Transaction Service (TRAN) reason • 'ENC-tra-1025: A client (not the transaction service) aborted'
一些Communications error code的含义 • 15a00002/15a00102 • 原因之一:远端域宕机,或者网络不通 • 原因之二:在本地TD:timeout超时,远端交易还在队列 • LINK返回SysIdErr • 15a00007/a0000100 • Connection failed • 原因之一:在本地TD:timeout超时,远端交易还在运行 • LINK返回TermErr • #注:在cics_eci.h中说明了所有主码为15a00007的解释。 • 15a00007/84b6031 • Transaction unavailable • 原因之一:RD:MaxTClassLim引起的Reject • LINK返回TermErr • 15a00007/8640000 • Remote Transaction Abend • 引发本地交易肯定Abend
CICS Dump 内容 • Transaction Dump • 将内存写入 Dump 文件中 • System Dump • 上一次 CICS 命令执行的细节 • 每一笔交易的执行细节 • Region 的在 Dump 当时的配置 • 非交易类进程的情况, 例如 recovery server • Encina Client 的信息 • 所有 Enabled CICS Trace 信息
CICS Dump 文件 • 目录 • RD.DumpName 缺省为 dumps • RD.CoreDumpName 缺省为 dir1 • 文件名 • AAAANNNN.dmpmm • AAAA - ASRA, ASRB, SYSA, SHUT, SNAP, dump code, abend code • NNNN –序列号 • mm –文件序号, 如果文件太大, 自动被切成多个, 用 01, 02 标记
CICS Dump 的设置 • 离线设置 • TD.TransDump • RD.SysDump • RD.PCDump • RD.ABDump • 在线设置 • CEMT INQUIRE/SET DUMP • CEMT INQUIRE/SET DUMPOPTIONs • User Exit • Dump Request User Exit (UE052017)
Transaction Dump EXEC CICS DUMP EXEC CICS ABEND CECI DUMP CECI ABEND 交易异常终止, 包含 ASRA 和 ASRB System Dump 系统异常终止 CICS shutdown CEMT PERFORM SNAP ASRA ASRB 生成 CICS Dump
CICS Dump 转换工具 • Cicsdfmtcicsdfmt –r <域名> <DMP文件名>
CICS Dump – Region Config – RD **** DATABASE CLASS MODULE (RegDC) **** DUMP START FOR CLASS RE Runtime database for Class RD Buffer Address = 0x38e153f4 ...... RegRE IEntry: ...... Name of the default user identifier = CICSUSER CICS Release Number = 0430 Region system identifier (short name) = ISC0 Region application identifier (long name) = CICSRGN Minimum number of Application Servers to maintain = 1 Maximum number of Application Servers to maintain = 5 Region Pool Storage Size (bytes) = 2097152 Task-private Storage Size (bytes) = 1048576 Task Shared Pool Storage Size (bytes) = 1048576 Threshold for Region Pool short on storage (%age) = 90 Threshold for TSH Pool short on storage (%age) = 90 Number of Task Shared Pool Address Hash Buckets = 512 ......
CICS Dump – Region Config – TD **** DATABASE CLASS MODULE (RegDC) **** DUMP START FOR CLASS TR Runtime database for Class TD Buffer Address = 0x38e15400 ...... RegTR IEntry: ...... Resource key buffer = THLO Group to which resource belongs = Samples Activate the resource at cold start? = yes Resource description = Transaction Definition Type of RSL Checks = none Type of TSL Checks = internal First program name = PHELLO ......
CICS Dump – Region Config – PD **** DATABASE CLASS MODULE (RegDC) **** DUMP START FOR CLASS PR Runtime database for Class PD Buffer Address = 0x38e153f0 ...... RegPR IEntry: ...... Resource key buffer = PHELLO Group to which resource belongs = Samples Activate resource at cold start? = yes Resource description = Program Definition Number of updates = 0 Protect resource from modifications? = FALSE Program path name = E:\TXSeries\CICS_Samples\Exercise_BMS/Hello Program type = program ......
CICS Dump – Region Config – UD **** DATABASE CLASS MODULE (RegDC) **** DUMP START FOR CLASS US Runtime database for Class UD Buffer Address = 0x38e1540c ...... RegUS IEntry: Resource key buffer = CICSUSER Transaction Level Security Key List = 0000000100000000 Resource Level Security Key List = 00000000 DCE principal of the user = CICSUSER User priority = 0 Encrypted password = ...... RegUS IEntry: Resource key buffer = TESTUSER Transaction Level Security Key List = 0000000100000000 Resource Level Security Key List = 00000000 DCE principal of the user = TESTUSER User priority = 0 Encrypted password = SgJAEYVDLJQ
CICS Dump – Transaction Scheduler **** SCHEDULER MODULE (ConTS) **** ConTS private RCA Data: Class Table: Class 0 tasks: waiting = 0 active = 1 max,lim = n/a,n/a Class 1 tasks: waiting = 0 active = 0 max,lim = 1,0 Class 2 tasks: waiting = 0 active = 0 max,lim = 1,0 ...... Application Servers: Min = 1, Current = 3, Max = 5 Server Idle Time (seconds) = 3600 Idle Application Server Queue: Anchor = 0x70000994, First = 0x700a62fc, Last = 0x700a62a0 Waiting Tasks Queue: Anchor = 0x70000828, First = 0x70000828, Last = 0x70000828 Running Tasks Queue: Anchor = 0x70000820, First = 0x700a21bc, Last = 0x700a21bc List of idle application servers: Server Number 102 ...... Server Number 101 ...... List of Running transactions: Task No = 71, Tran Id = CEMT, User Id = , Device Id = App Server = 105, State = Running, Class = 0, Priority = 255
CICS Dump – Task Control Area Task Control Area Header: AIX process ID = 2520 Application Server ID = 105 ...... Task Control Area Task specific part: User Name = CICSUSER TCA Force Purge flag ? = FALSE TCA Purge flag ? = FALSE ReturnTran ? = FALSE UserMode ? = FALSE (若 TRUE, 出错一定在用户代码中, 需要进一步调试程序) (若 FALSE, 出错可能是应用程序调用 CICS API 不当引起) ...... EXEC Interface Block: ============ CICS EIB structure ============== EIBTIME : Time task started = Ox0202308C EIBDATE : Date task started = Ox0101224C EIBTRNID : Transaction ID = CEMT EIBRCODE : Response Code = Ox000000000000 ......
CICS Dump – Control Module Program Control Information: Program Name = DFHCEMT Value of Resident attribute = No Program full path name = C:\opt\cics\bin\DFHCEMT.dll Program Data: Buffer Address = 0x3815e000 Offset 0 1 2 3 4 5 6 7 8 9 A B C D E F ASCII EBCDIC ------ ------------------------------------------------------------------------ 0000 00000000 00000000 6300FFFF 00000000 [........c.......] [................] 0010 00000000 00000020 01000100 00000000 [....... ........] [................] 00000000 ... 4 lines of zeros suppressed ...... EXEC CICS command string: Buffer Address = 0x127044 EXEC CICS GETMAIN SET (X'00128110') FLENGTH (9200) INITIMG (0) NOHANDLE
通过 Dump 寻找应用错误 (AIX Only) • 加编译开关 • cicstcl –s 或 CCFLAGS = “-qlist” • 会在编译后生成 *.lst 文件 • 设置 RD 属性 • RD.ABDump = no或: CEMT SET DUMPOPTIONS NOABABEND • RD.PCDump = no或:CEMT SET DUMPOPTIONS NOPCABEND • 设置 TD属性 • TD.TransDump = yes • 交易出错时生成 Transaction Dump • /var/cics_regions/region/dumps/dir1/AAAANNNN.dmpmm • /var/cics_regions/region/dumps/dir1/cicsASPID.traceback
通过 Dump 寻找应用错误 (AIX Only) • 格式化 Dump 文件 • cicsdfmt –r <域名> AAAANNNN.dmpmm > AAAANNNN.dmpmm.txt • 注意, 不同平台上的 Dump 文件不一定可以相互做格式化 • 例如 AIX 上的 Dump 文件在 NT 上不能做 cicsdfmt • 在 txt 中查找调用堆栈, 找到出错点 • 对照 *.lst 文件, 找到出错代码行 • 对照 *.c 文件, 找到出错代码
寻找错误 (AIX Only) – Sample console.msg ERZ052004I/0602 09/05/01 04:44:12 CICSRGN GPAC: Dump to 'ASRA0006.dmp' started. ERZ052007I/0604 09/05/01 04:44:12 CICSRGN GPAC: Dump to 'ASRA0006.dmp' completed. ERZ014016E/0028 09/05/01 04:44:12 CICSRGN GPAC: Transaction 'TDPU', Abend 'ASRA', at 'GPAC'. ERZ015028W/0154 09/05/01 04:44:12 CICSRGN GPAC: Exception in user application code - exception string is: 'exc_e_illaddr' ERZ016050W/0234 09/05/01 04:44:12 CICSRGN GPAC: Logical unit of work for transaction 'TDPU' has been backed out; Distributed Transaction Service (TRAN) reason 'ENC-tra-1025: A client (not the transaction service) aborted'
寻找错误 (AIX Only) – Sample ASRA0006.dmp01.txt **** START OF TRANSACTION DUMP **** Application Server id = 102 Transaction Id = TDPU User Name = CICSUSER Details of function being executed: 0x2ff1e8c0 Function Name = main Service Level = Offset of current instruction = 0x1e8 Called by function = PinCA_StartC from offset = 0x148 Called by function = TasPR_CallApplication from offset = 0x404 Called by function = TasPR_RunProgram from offset = 0x11d8 ......
寻找错误 (AIX Only) – Sample TransactionDumpUser.lst ...... 79 | CL.2: 84 | 0001CC lwz 8062000C 1 L4A gr3=._iob(gr2,0) 84 | 0001D0 addi 38630040 2 AI gr3=gr3,64 84 | 0001D4 addi 389F0064 1 AI gr4=gr31,100 84 | 0001D8 bl 4BFFFE29 0 CALL gr3=fprintf,2,gr3,gr4,fprintf",... 84 | 0001DC ori 60000000 1 86 | 0001E0 lwz 8061004C 0 L4A gr3=pStringBuffer(gr1,76) 86 | 0001E4 addi 389F0078 1 AI gr4=gr31,120 86 | 0001E8 lswi 7CA4F4AA 4 LSI gr5-gr12=+CONSTANT_AREA(gr4,0), 30 86 | 0001EC stswi 7CA3F5AA 4 STSI #MEMORY(gr3,0)=30,gr5-gr12,mq" 87 | 0001F0 lwz 80610048 1 L4A gr3=pCommArea(gr1,72) 87 | 0001F4 lwz 8081004C 0 L4A gr4=pStringBuffer(gr1,76) 87 | 0001F8 bl 4BFFFE09 0 CALLN gr3,#MEMORY=strcpy,... 87 | 0001FC ori 60000000 1 92 | 000200 lwz 83C20008 0 L4A gr30=.$STATIC_BSS(gr2,0) 92 | 000204 addis 3C602000 1 LIU gr3=8192
寻找错误 (AIX Only) – Sample TransactionDumpUser.c 21 void main () 22 { 24 long lRespCode; 25 char * pCommArea; 26 char * pStringBuffer = 0; 29 EXEC CICS ADDRESS EIB (dfheiptr) RESP (lRespCode); ...... 57 EXEC CICS ADDRESS COMMAREA (pCommArea) RESP (lRespCode); ...... 84 fprintf (stderr, "CommArea = [%s]\n"); 85 86 strcpy (pStringBuffer, "Transaction Dump in USER code"); 87 strcpy (pCommArea, pStringBuffer); 90 EXEC CICS RETURN; 98 }
Traceback文件 • CICS produces traceback files when an application or CICS internal code raises an illegal exception (SigSEGV, SigILL). • Traceback Files are generated under the directory /var/cics_regions/<regionname>/dumps/dir1 • The generated traceback file will have a filename cics<pid>.traceback where : pid is the process id which has generated traceback.
Traceback文件信息举例 -----------------Stack Traceback-------------------- PID = 32782, TID = 1 9 - Function strlen Offset = 0094 8 - Function main Offset = 0104 7 - Function PinCA_StartC Offset = 01D0 6 - Function TasPR_CallApplication Offset = 0508 5 - Function TasPR_RunProgram Offset = 14D8 4 - Function TasPR_IRun Offset = 1FEC 3 - Function TasTA_Exec Offset = 1CA0 2 - Function TasTA_Run Offset = 1C28 1 - Function main Offset = 0B68 0 - Function __start Offset = 0088 … …
Traceback文件信息举例 … … 16 - Function sqlall Offset = 1660 15 - Function sqlsel Offset = 0344 14 - Function sqlnst Offset = 0CCC 13 - Function sqlcmex Offset = 02B4 12 - Function sqlcxt Offset = 0074 11 - Function main Offset = 0558 10 - Function PinCA_StartC Offset = 0148 9 - Function TasPR_CallApplication Offset = 03D8 8 - Function TasPR_RunProgram Offset = 11D8 7 - Function TasPR_IRun Offset = 13B0 6 - Function TasPR_Run Offset = 0AB0 5 - Function PinCA_Route Offset = 06C4 4 - Function ComFS_APPCServ Offset = 0D50 3 - Function TasTA_Exec Offset = 1934 2 - Function TasTA_Run Offset = 1644 1 - Function main Offset = 09B4 0 - Function __start Offset = 0060
利用 系统工具诊断 • showProcInfo (Encina ships this tool) • dbx • dumpThreads • AIX或其它UNIX命令工具
showProcInfo与dbx. • showProcInfo is a Encina supplied debugging tool. It is based on dbx, which is a OS (in this case AIX) supplied debugger. • dbx is available on Solaris but not on HP-UX. showProcInfo uses 'dde' on HP-UX. gdb can also be used as a debugger. • On AIX, dbx is available as part of the bos.debug fileset, which will need to be explicitly installed. If bos.debug is not installed, even the showProcInfo would not run. • This tool is heavily used by CICS customers to collect debugging data. • Debuggers cannot catch the exits. Not a good practice to code exits in application programs.
showProcInfo简介 • showProcInfo is run on CICS Application Servers and core.<TimeStamp> files to get the current state of the process and it's threads. • Typically, showProcInfo is useful during a region hang, however could be used to collect a snapshot of the Application Server Processes. • showProcInfo is available on all platforms.
dbx简介 • dbx can be used to debug a program. • Set breakpoints. • Check the stack trace, etc. • A program can be run through a dbx session itself or dbx can be attached to an already running program. • How to run a program under dbx? • dbx <program path name> • How to attach dbx to an already running program. • Identify the process id of the program using the ps -ef command. • Attach to dbx using, dbx -a <process id>. • If the programs are compiled with a -g option, the dbx will show the argument names as well, else only the offsets can be seen. • How to see a core file through dbx? • dbx <path name for program> core • On Solaris and HP, could use the 'file' command to find out which program dumped the core. On AIX , following command can be used for the same: lquerypv -h core 6b0
dbx命令(1) • If the program is being run in dbx, then we should type 'run' command to start the program. • If dbx is attached to an already running program, then once the dbx is attached, the program is on a hold. 'continue' or 'cont' command should be used to continue processing. • Breakpoints can be set using the commands, • stop in <symbol name> • stop at <line number:file name>
dbx命令(2) • clear - To clear all the breakpoints at a given line. • step (step in) - To debug instruction by instruction. Will take you inside the subroutines as well. • step out - Will take out of the subroutine to the main program from where this function was called. • next - To execute the next instruction. Will not go into the function calls. • cont - Run the program until the next • help - To list all dbx commands and their help information. • status - Shows the current breakpoints set. • catch - Used to catch a signal in dbx itself before sending it across to programs. • list - To list the source code line. Effective only if compiled with -g option.
dbx命令(3) • ignore - dbx ignores the specified signal (defaults to all signals). • where - shows the stack trace. • rerun - Run the program once again. • dump - To dump the active variables. • delete - Deletes the breakpoints. • sh - Used to pass the command line to the shell for execution.
用于调试的其它AIX/UNIX命令 • nm -> To check if a symbol is defined in a library or a program.nm <path name of the program> • ldd (on solaris and hp) -> shows the dependencies of the objects. ldd -r shows the recursive dependency. • dump -Hv -> Shows the object dependencies on AIX. • ps eawww -> Dumps the environment of a process. • pstack -> Solaris specific. Could be used like showProcInfo. • ps va -> Lists processes with memory usage. • vmstat <time interval> <repeat info>Used to check the CPU usage. Eg. vmstat 4 10 . Run the vmstat every four seconds and repeat for 10 times. • errpt -a -> Displays the OS recorded errors. • truss (AIX v5 and solaris) : Used to trace the commands.
获取客户机IP地址 • 使用环境变量“CICS_DEBUG_CPMI=1”,相关信息会被打印到console文件。 • 在TXSeries v5.1或者TXSeries v5.0.0.7以后实现。
ASRA时生成CORE文件 • 设置环境变量:CICS_CORE_ON_ASRA=1
检查内存等资源泄漏 #例: RD:ServerMemCheckInterval = 3600 RD:ServermemCheckLimit = 4 Manage memory growth checking 1. RD:ServerMemCheckInterval Time in seconds between memory growth checks Default is 3600 (0 is disabled) 2. RD:ServermemCheckLimit Number of consecutive checks before CICS reports growth Default is 4 (0 means disabled) 3. Messages written to console