370 likes | 503 Views
Informix User Forum 2005 Moving Forward With Informix. Improved Scripting of IDS Alarms and Events. Thomas Horner Senior DBA/S1 Corporation. Atlanta, Georgia December 8-9, 2005. Overall Objectives.
E N D
Informix User Forum 2005Moving Forward With Informix Improved Scripting ofIDS Alarms and Events Thomas Horner Senior DBA/S1 Corporation Atlanta, Georgia December 8-9, 2005
Overall Objectives • Enhancements to the supplied scripts • Help prevent unnecessary late night page or cell phone call • Be proactive in monitoring of dbspaces • Same shells can be used for 7.x, 9.x, and 10.x IDS engines
Presentation Overview • What does IBM/Informix supply? • Purpose of these custom shells • Overall design of the shells • Details of the alarm shell • Changes made to evidence shell • Details of the “LookatSpace” shell • Other shells I use for administration • Limitations of these shells
IBM Supplied Scripts • alarmprogram.sh, log_full.sh, no_log.sh, and evidence.sh supplied by IBM/Informix • IDS 9.4+ and 10.x alarm program is improved over the older versions • it gathers additional data for certain alarms • it sends email to and/or pages DBA • it recognizes the automatic log alarms • First two functions are in my alarm shell, but not the last one
IBM onconfig Parameters • ALARMPROGRAM onconfig parameter • set to appropriate value (full path name) • ALRM_ALL_EVENTS onconfig parameter • set to 1 • SYSALARMPROGRAM onconfig parameter • set to appropriate value (full path name) • DYNAMIC_LOGS onconfig parameter • this needs to be 1 or 0 for my alarm shell • all available space in log dbspace allocated up front • this is a design decision
Purpose of these Shells • Alarm Shell • combines functions of the “default” programs and adds features • Evidence Shell • match design of this program with the alarm program changes • LookatSpace Shell • gives DBA an “advance” notice of possible space issues
Purpose of these Shells • Other Shells used to monitor and administer the databases: • check database shell – quick check of engine status • onchecks shell – perform oncheck commands weekly • update statistics shell – perform scheduled update statistics • prune log shell – prune online log and other logs
Overall Design of Shells • Alarm and Evidence Shells • add functionality to supplied default programs • do not change how the shells are used by the Informix engine • LookatSpace Shell • run on a scheduled basis to check for low space that may not be obvious from simple onstat -d output • Other Shells • run on a daily or weekly schedule to perform other administrative functions
Overall Design of Shells • All Shells • can be used for multi-instance installations and multiple production databases in one instance • can be used across 7.x, 9.x, and 10.x engines
Installation • These are currently installed on four production servers and several test servers on the following versions: • IDS Version 7.24 on HPUX 10.20 • IDS Version 9.21 on HPUX 11.00 • Other installations are successfully using them (based on emails I have received) • Requires notification means to DBA team and to the Data Center
Alarm Program – Overview • Five parameters passed from instance: • Severity (severity) • ranges from 1 through 5 • Class_ID (class_id) • contains the message ID that caused the alarm • Message (class_msg) • contains the actual text of the alarm • Additional Text (specific_msg) • Event File (see_also)
Alarm Program – Functions Added • Set the proper level of notification based on alarm severity • Prevent overload of machine resources and email caused by duplicate or multiple alarms for the same issue • Reduce “false” alarms by using mutex files • Perform logical log backups using ontape • Option for “no notification” • Alarm log file used to record alarms and actions
Alarm Changes – Proper Notification Level • Severity 1 or 2 • no notification as recommended by IBM/Informix • Severity 3 • not critical – email is sent to the DBA team • no email if class 6, 15, 21, or 23 (more on why later) • Severity 4 or 5 • critical – data center is notified for action and an email is sent to the DBA team for our records • no notification if class 6, 15, or 21 (more on why later)
Stop Duplicate Alarms • Biggest design change I made from the default alarm programs • Classes 6, 15, and 21 can cause multiple alarms • class 6 is “non fatal” Internal Subsystem Failure • class 15 is Data Replication Failure • class 21 is Online Resource Overflow • Idea for this change came with my first encounter with multiple class 21 alarms • caused by process exceeding available number of locks (version 7.x engine) • hundreds of emails received within a minute – OOPS!
Stop Duplicate Alarms (cont’d) • Separate section of code to handle classes 6, 15, and 21 • Class 23 (logical log backup needed) also has specific section of code to perform log backups • Shell uses distinctly named files in /tmp for these three classes of alarms: • /tmp/event${ENV}${FILENO}.`date +%H` • Alarm is considered new if this file in /tmp does not exist or if that file is more than one hour old • One hour threshold was a design decision
Stop Duplicate Alarms (cont’d) • Steps used to handle classes 6, 15, and 21: • if the alarm severity is less than 3, ignore the alarm • if file in /tmp exists and is less than one hour old: • consider this a duplicate alarm of this class • simply log it • if file in /tmp file does not exist, or the file is more than one hour old, this is first alarm of this class: • follow notification protocol • create (or update) the /tmp file for this alarm
Alarm – Real alarm.log output Fri Jul 19 09:40:24 EDT 2002 alarm.sh got event 21 severity : 3 message : OnLine resource overflow: 'Locks'. additional text: Lock table overflow - user id 106, session id 1133666 reference file : Fri Jul 19 09:40:30 EDT 2002 alarm.sh got event 23 severity : 2 message : Logical Log 15362 Complete. additional text: Logical Log 15362 Complete. reference file : Fri Jul 19 09:40:39 EDT 2002 alarm.sh got event 18 severity : 2 message : Log Backup completed: 15362. additional text: Logical Log 15362 - Backup Completed reference file :
Alarm – Real alarm.log output (cont’d) Fri Jul 19 09:40:39 EDT 2002 Multiple alarms - class 21, severity 3. Fri Jul 19 09:40:40 EDT 2002 Multiple alarms - class 21, severity 3. Fri Jul 19 09:41:02 EDT 2002 Existing class 21 issue - no notification needed. Fri Jul 19 09:41:03 EDT 2002 Multiple alarms - class 21, severity 3. Fri Jul 19 09:41:03 EDT 2002 Multiple alarms - class 21, severity 3. Fri Jul 19 09:41:05 EDT 2002 Multiple alarms - class 21, severity 3. Fri Jul 19 09:41:05 EDT 2002 Existing class 21 issue - no notification needed. Fri Jul 19 09:41:17 EDT 2002 alarm.sh got event 23 severity : 2 message : Logical Log 15363 Complete. additional text: Logical Log 15363 Complete. reference file :
Alarm – Perform Logical Log Backups • Make sure no other log backup is running: • check for /tmp/ontape.L${ENV}, a mutex file • do not start another log backup and notify DBA team via email if it does exist • not considered critical because this can occur normally when logs turn over quickly • create the /tmp/ontape.L${ENV} mutex file if it does not exist and continue • If onconfig file has /dev/null for the LTAPEDEV onconfig, run ontape -a to free the log, then exit
Alarm – Perform Logical Log Backups (cont’d) • Make sure engine is up using “onstat -” command • if not follow notification protocol (severity is critical) • Make sure log backup device is ready • if not follow notification protocol (severity is critical) • Determine number of first and last log that will be in this backup file using “onstat -l” command piped to a grep
Alarm – Perform Logical Log Backups (cont’d) • Note any “missing” log numbers in log file • Perform the actual log backup using “ontape -a” • If ontape command fails, follow notification protocol (severity is critical) • Move, rename, and compress the log backup file using gzip • Remove the mutex file so that the next log backup can run
Alarm – No Notification Option • At beginning of alarm program, it looks for file named alarm.nomail in /usr/informix • MAILFLAG shell variable is set to “on” or “off” • Before every statement where notification is to be sent, the MAILFLAG variable is looked at • If MAILFLAG is “off”, do not send email or notify Data Center • If MAILFLAG is “on”, send email and (if critical) notify Data Center • You can simply remove the alarm.nomail file to start having notifications sent
Evidence Program – Overview • Default (supplied) program is called evidence.sh • Normally called by engine when an assert failure occurs to “gather evidence” for use by IBM/Informix support • Not supplied with 7.2x engines • SYSALARMPROGRAM configuration parameter • Twelve parameters are passed to program • IBM/Informix recommends not changing the functions of this more complex shell
Evidence Program – Issues Addressed • I did change the notification techniques to match those used in the alarm program • Added the use of MAILFLAG to stop notification • Added notification for warnings (email to DBA team) in addition to failures • Put in appropriate values for the environment variables at the beginning of the program • I do not email the assert failure file (which the default program does) because of its large size • Named the program evidence.${ENV} for use in multiple instances
LookatSpace Program – Purpose • You may think that you have plenty of free space in a particular dbspace • one table that requests a large next extent can use up all the remaining free dbspace • another table in the same dbspace that also needs additional space can be “out of luck” and a SQL error will be returned to the user • This shell looks for this type of situation and emails any issues found to the DBA team • DBA team then has time to add a chunk to the dbspace before it becomes critical • We run this once a week on a scheduled basis
LookatSpace – Program Design • Get name of database with the largest table in the instance using sysmaster SQL to get name of production database (assumes only one) • Obtain dbspace usage using sysmaster SQL • separate out those that contain blobs for use later • Obtain which non-fragmented tables are in what dbspace using SQL • Obtain which fragmented tables are in what dbspace using SQL
LookatSpace – Program Design (cont’d) • Two lists of dbspaces are created • we do not put non-fragmented and fragmented tables in the same dbspace • If dbspace contains no tables or blobs, and has less than 3% free space: • assume that this dbspace contains only indexes • send email to DBA team because it is low on space • If dbspace has non-fragmented tables: • obtain table space usage and future needs • uses sysmaster SQL
LookatSpace – Program Design (cont’d) • If dbspace has fragmented tables: • obtain table space usage and future needs • uses sysmaster SQL • If space is more than 80% used, and next extent is greater than free space remaining in the dbspace: • send an email to the DBA team • If space is more than 95% used, and next extent is greater than available dbspace: • add a warning message to that DBA team email
LookatSpace – Program Design (cont’d) • If dbspace contains blobs, check free space in dbspace and the number of blobs remaining • If space available is less than 3% and number of blobs remaining is less than 20000, send an email with warning to the DBA team • While the program goes through all these steps, a basic text report (space report) is created • If there are no issues to report, no email is sent, but the space report is always available for review
LookatSpace – Program Design (cont’d) • The report is appended to each week, so a history of space utilization is available for analysis • A future enhancement could include looking at the index dbspaces • we have had these unexpectedly fill up when there is more than one large index in the same dbspace • Another enhancement can be to write code to analyze the space utilization reports and obtain trending information
LookatSpace – Sample Email Space is low in DBSpace dbs1 with tables on Tue Sep 27 05:31:00 EDT 2005 for host sf8pdb1, instance sfarm_shm. Table vfmtrnaudactvty next extent of 250000 pages will use all free 99997 pages in dbs1. Table has 1499947 pages allocated, 231611 pages free, and 84.56 percent used. Details are located in the /usr/informix/logs/checkspc.out file.
Other Shells I Use • Check Database Shell • checks to see if engine is up and active on a scheduled basis • performs log move if requested (uses onmode commands) • log move is run from another shell (to prevent issue in case of hung checkpoint) • log move option is used in our shop for disaster recovery purposes • Onchecks Shell • performs basic oncheck commands on a weekly basis
Other Shells I Use (cont’d) • Update Statistics Shell • can choose how update statistics is run via input parameters • temporarily changes certain Informix environment variables to improve performance while running update statistics • Prune Log Shell • archives various log files monthly • also archives the online.log
Limitations of these Shells • The shells (except alarm or evidence) are run on a scheduled basis, not on a demand basis • The LookatSpace shell requires that fragmented and non-fragmented tables not be in the same dbspace • The LookatSpace shell does not “predict” when index dbspaces will fill up • Certain thresholds are “hard-coded” in the shells and may need to be changed for your installation • Certain names of files and directories are coded in the shells and may need to be changed for your installation • Latest enhancements of data gathering features of 9.4+ supplied alarm program are not in the alarm shell
Review • Alarm program • took the IBM/Informix “template” and ideas of others and myself to make it more robust • handles multiple alarms and performs log backups • Evidence program • took the IBM/Informix “template” and made notification consistent with the alarm program • LookatSpace program • helps the DBA team identify space issues before they impact end user or become an “emergency” • Other shells we use to monitor the engines
Questions and Comments? • To get a copy of these shells, email me at thorner@s1.com. I can package the files and send them to you via email. • Objective here was to prevent the unnecessary page or phone call, that may result in fixing something that is actually not broken. • Proactive monitoring of dbspaces using LookatSpace is better than that 3 am page requiring you to add a chunk. • Thank you all for your attention. I hope that these shells enable you to keep better informed about the status of your production systems.
Informix User Forum 2005Moving Forward With Informix Improved Scripting ofIDS Alarms and Events Thomas Horner thorner@s1.com Atlanta, Georgia December 8-9, 2005