590 likes | 601 Views
This article covers the planning considerations, installation issues, and maintenance issues when preparing for high availability. Topics include budgeting, hardware choices, software inventory, progress versions, database layout, after imaging, and personnel planning.
E N D
Preparing for High Availability Adam Backman adam@wss.com V.P. of Technology White Star Software
What We Will Cover • Planning considerations • Installation issues • Maintenance issues
Planning Phase - People • Who “owns” the data • Be inclusive • This is not solely an IT decision • Eliminate surprises
Planning Considerations • Budget – high availability is not free • Hardware – fault tolerant, redundancy, … • Software – Progress is good but how is your “other” software? • Knowledge – buy or rent • Time – schedule and outage time • Personnel constraints – Who is on call?
Goals During Outage • Do no additional damage • Shortest amount of time • Reduce/Eliminate impact to customer
The Cost of Downtime • Wages • Idle workers • Cost to replace data • Production • Lost production • Impact to the customer • Can’t click website • Can’t place order
How Much Downtime Can You Afford? • For maintenance • Application • Database • For failures • Hardware • Software • Natural disaster
Planning Phase - Budget • Less downtime = additional cost • Better disks (RAID, Mirrors, EMC, …) • Redundant system • Remote site • More money does not equal less downtime • Prioritize • Look for most likely scenarios • Look beyond cool
Planning Phase - Hardware • Disks – The only moving part • RAID – Redundant Array Inexpensive Disks • Avoid software mirroring • Use multiple controllers • Try to stick with a 1 vendor solution
What RAID really means RAID has many levels, here are the most common • RAID 0: This level is also called striping. • RAID 1: This is referred to as mirroring. • RAID 5: Poor performance RAID level • RAID 10: This is mirroring and striping. Also known as RAID 0 + 1
Planning Phase - Hardware • CPU Check with vendor to ensure fault tolerance • Memory Do not interleave memory • Vendor Choose a reliable vendor (IBM, HP, Sun, Compaq, …)
Planning Phase - Hardware • Other hardware • File servers • Network stuff (LAN & WAN) • Phone/Internet connections
Planning Phase - Software • Inventory all software (client and server) and make sure it is current and supported • Determine what software is needed all of the time (Production control – Yes, Reporting software – No)
Planning Phase - Progress • Version of Progress (look for patches) • Layout of database • Single database or Multi-database • Storage area layout (logical and physical layout) • Application issues • Client/Server, N-Tier or Host based • Where does the application code reside?
Planning Database Layout • Single database • Easy to maintain • Still have storage areas to spread data • Single point of failure • Multi-database • More to maintain • Allows application partitioning • Maintenance flexibility • Two phase commit
After Imaging • Before image files keep information about records giving you the ability to undo a transaction • After image files keep information about records that allows you to redo a transaction in the event of media failure • After imaging is only part of a high availability strategy
After Imaging • Every high availability system should have after imaging enabled • Multiple after image areas are required for high availability • Only enable after imaging after you have a comprehensive backup and recovery plan in place
How Does Journaling Work? Here is an logical over-simplification of how journaling works FOR EACH customer: BI Note written UPDATE customer. AI Note written END.
Planning Phase - Knowledge • Own Our people have the knowledge to do the project • Buy We can train our people to do this project • Rent We will hire consultants to implement this for us (Insert shameless plug here)
Planning Phase - Time Schedule for project • Machine purchase and delivery • Software availability • Resource availability • Do we need a long weekend for implementation? Timings determined later may determine implementation schedule items
Planning Phase - Personnel • 24 hr. Operators If you don’t have operators you will need to develop monitoring routines with paging ability • Database Administrator(s) • System Administrator(s) Develop an escalation plan with “on call” schedule for off hours issues
Installation Phase • All items should have been already developed and tested prior to this stage • All items should have been already developed and tested prior to this stage • All items should have been already developed and tested prior to this stage • Get the point?
Installation Steps • Develop a schedule with timings and leave room for error as there WILL be errors • Write scripts to do tasks where possible to eliminate the human factor • Have a master checklist with the person/ people responsible for each item
Maintenance Goals • Provide consistent performance • Allow to advanced planning • Avoid unscheduled outages
Maintenance • Don’t design something you cannot support • Scripting should be flexible but bulletproof • Example: www.peg.com/utilities.html • Monitoring and trending are very important to maintain high availability systems
Monitoring Areas of concern for high availability • Progress • Database areas filling • BI not being reused • AI space depleted • Running out of licenses • System • Disk space • Resources (memory, CPU, tunables, …)
Monitoring Progress - DB /* Storage Area fill rate program */ DEF VAR percent-free as DEC FORMAT ">9.99". FOR EACH _AreaStatus: percent-free = 100 - ((_AreaStatus-HiWater / _AreaStatus-TotBlocks * 100)). DISPLAY _AreaStatus-areaname "Percent Free:" percent-free .
Monitoring Progress - BI /* Last BI file growth program */ DEF VAR t_filename AS c FORMAT "x(40)". t_filename = pdbname(1) + ".b". FIND LAST _ActIOFile WHERE _IOFile-filename BEGINS t_filename. IF _IOfile-Extends = 0 THEN DISPLAY "ALL IS WELL". ELSE DISPLAY "The Sky is Falling !!!".
Monitoring Progress - AI # Program: After image extent full checker FULL_EXT=`rfutil $DB -C aimage extent list | grep -i full | wc -l` if [ $FULL_EXT -lt 9 ] then echo “$DB has $FULL_EXT full extents STATUS – OK” else echo “WARNING - $DB has $FULL_EXT full extents” fi
Monitoring Progress - Users /* License count tester */ DEF VAR remaining-licenses AS INT. FIND _license. remaining-licenses = _Lic-ValidUsers - _Lic-MaxActive. /* You may want to use _Lic-ActiveConns instead of _Lic-MaxActive */ IF .10 > (remaining-licenses / _Lic-ValidUsers) THEN DISPLAY "Less than 10% of licenses remaining" WITH FRAME X. ELSE DISPLAY "More than 10% of licenses remaining" WITH FRAME Y.
System Monitoring • Disk Space • How much disk available for growth • Also look at throughput capacity (average wait) • Memory capacity • Free memory is not a good indicator • I focus on the scan rate • CPU Capacity • How much idle time
Maintenance Tasks • Backup and restore • After imaging • Log based replication • Data maintenance
Backup and Restore • Progress online backup • Quiet point backup • Warm standby backup
Backup and Restore Why can’t I just backup the database and before image files while the database is at a slow point? Answer: The database consists of three portions while it is up and those are: The database files, the before image file(s) and memory
Portions of an Active DB Shared memory holds the most volatile data The database contains older committed data The before image holds transaction information All three are needed for a complete backup Shared memory DB BI
Online Backup What happens during an online backup? • Grab a db latch • Do a pseudo-checkpoint (this synchs memory to disk) • Switch AI file (if necessary) • Backup the before image file • Release the db latch • Backup the database (starting at the end)
Quiet Points • Very little impact to system availability • Allows for integration with hardware utilities • Only way to get an online backup with an operating system utility without shutting down the broker
How quiet points work. • Get database latch • do pseudo checkpoint • wait for quiet point to be removed NOTE: All processing will wait for the quiet point to be removed
Quiet Point Backup How to do a quiet point backup • Enable the quiet point (This synchs memory to disk) • Synchronize your disk mirrors • Split your disk mirrors • Disable the quiet point • Mount the mirrors as different file systems • Backup your mounted mirrors with an OS utility (tar, cpio, fdump, …)
After Imaging • Every high availability system should have after imaging enabled • Only enable after imaging after you have a comprehensive backup and recovery plan in place • AI is sometimes referred to as the redo log
Multi-volume after image files • Not a backup but a journal of completed transactions • Can be used to keep a copy of the database up to date • Can be switched with no interruption to user processing • Should part of every high availability environment
How to integrate after imaging • In conjunction with a backup site • To update a report server • As a means of backup
AI to update a backup site • Poor man’s replication • Allows for periodic update of a copy of the database • The copy can then be backed up with a conventional backup mechanism
Log Based Replication • Log based replication is another way to say applying AI files to a copy of your database • Excellent way to maintain a warm copy of your database for fail over • Can be used on the same machine or on a remote machine for additional protection
Log Based Replication Rules • The standby database can only be accessed read-only (-RO) which means no remote (client/server) connections to the standby data • You must have a multi-volume AI. This is a must for high availability in any case • The standby database can have a different structure than the primary data
AI as a Means of Backup • Not generally a good idea • Increased recovery time • Reduced reliability • Backup the database each weekend • Backup the AI file(s) each weeknight
Backup – Points to Remember • Simplicity and minimizing user interaction will increase backup reliability • You are only as good as your last tested backup • Archiving off site is essential
Database Maintenance • Data Stuff • Table move • Database analysis • Index Stuff • Index rebuild (offline) • Index Compress • Index Fix
Table Move • Pros • Simple • Bullet proof • Cons • Slow • Table is read only for the duration of the move • Uses tons of logging space
Table Move Syntax: proutil dbname –C tablemove tablename table-area [index-area] Table-area = The target application data area into which the table is to be moved Index-area = The name of the target index area, if not specified the indexes will be left in there existing location