650 likes | 827 Views
Make sure you leave me a business card or a piece of paper with your name on it for the drawing at the end of the session. Book Drawing. Exchange High Availability Without Clustering. Jim McBee ITCS Hawaii jim@somorita.com. Setting the stage….
E N D
Make sure you leave me a business card or a piece of paper with your name on it for the drawing at the end of the session. Book Drawing
Exchange High AvailabilityWithout Clustering Jim McBee ITCS Hawaii jim@somorita.com
Setting the stage…. “Approximately 80 percent of unplanned downtime is caused by people and process issues, while the remainder is caused by technology failures and disasters” -Gartner Group study, March 16, 1999
Who is Jim McBee!!?? • Consultant, Writer, MCSE, MVP and MCT – Honolulu, Hawaii (Aloha!) • Principal clients • USPACOM J2 • USARPAC G6 • Author – Exchange 2003 24Seven (Sybex) • Contributor – Exchange and Outlook Administrator • Blog • http://mostlyexchange.blogspot.com • Free eBook • http://nexus.realtimepublishers.com/ttgsm.htm
This session’s coverage • Introduction to me and the topic • Presentation – About 60 minutes • Book give away – Drop off your business card or write your name on a slip of paper • Questions and answers – 10 - 15 minutes
Audience Assumptions • You have at least a few months experience running Exchange 5.5, 2000, or 2003. • You have worked with Active Directory • You can install and configure a Windows 2000 / 2003 server
Presentations coverage • Defining… • Availability, reliability, fault tolerance • Estimated costs of clustering • Common causes of downtime • Your friend, the SLA • Preventing disasters • Configuration recommendations • Minimizing the effects of downtime • Daily operations • Backup plans • This presentation will be posted to my blog after April 30, 2006 – http://mostlyexchange.blogspot.com
If you take nothing else from this session, take this:Formula for better availability • Get good training and have good reference material • Set yourself up for predictable operations • Monitor your system to ensure it stays within the boundaries you establish
High Availability - 101 • Determine the causes of unplanned downtime • Focus on preventing ‘disasters’ • Predictable daily operations • Catch problems before they affect the users
Myths of high availability • Failure to meet 24x7x365 is a technical problem • More hardware = better availability • Training is not necessary • Existing procedures and processes are good enough • High availability can be bought off the shelf • Can achieved without ‘investment’
In search of 5 nines (99.999%) • The percentage of uptime you have during your scheduled hours of operation • Stated hours of operation 24x7x365? • 99% up time = 3.7 days of downtime • 99.7% up time = 1 day • 99.9% up time = 8.8 hours • 99.99% up time = 52 minutes • 99.999% up time = 5.3 minutes • Hopefully you are not promising 24x7x365!
Availability and Reliability… • Availability… • The percent of time that Exchange is accessible to the user community within the stated schedule of operations • The proportion of time that a system can be used for productive work • Let’s you keep your job • Reliability… • An application or service provides the same results under similar load • Provides consistent, correct results • Let’s you sleep a little better at night
Availability and Reliability… • Don’t sacrifice reliability for availability!!! • Don’t put off service pack application or critical system maintenance to so your availability numbers look good (i.e. replacing a dead disk) • In general, 8 hours of scheduled, off-peak downtime or degraded service is more acceptable to users than 1 hour of unplanned downtime in the middle of the business day.
Fault Tolerance versus High Availability • Fault tolerance • Components that keep an application functioning in the event of a component failure • Disks (RAID 1, 5, 0+1) • Redundant Power Supplies • UPS • High Availability • Does not necessarily guarantee 100% availability, just higher availability • Moving an application to an alternate server
So, what are WE talking about today? • We are going to focus on: • Reliability • Fault tolerance • Preventing ‘disasters’ • Increasing availability through better reliability, fault tolerance, and procedures
What is an Exchange disaster? • Answers vary from organization to organization • Typically loss of data • Loss of messaging services for more than one or two hours during scheduled operations? • Loss of a single mailbox? • Failure of a specific service? • Microsoft measures downtime based on the number of users affected! • 1000 users on a server that is down for 5 minutes would be 5000 minutes of downtime! • That kind of downtime does NOT look good on a resume
Appraise the cost of downtime • User productivity • Missed contractual obligations • Missed sales or customer contact • Loss of customer confidence • Loss of end user good will • Loss of credibility • Loss of your job!
Clustering 101 • Providers higher availability • Clustering does exactly what it claims to do; it protects your organization against hardware failures. • Clustering gets a bad rap for a number of reasons: • Improper operations • Lofty expectations or assumptions • Allows the passive node to be shutdown or rebooted for maintenance
Non-clustered configuration costs • Possible configuration: • Dell Dual Xeon 2.8GHz • 4GB RAM • 700GB disks • 160/320GB SDLT Tape • Windows 2003 Standard Server • Exchange 2003 Enterprise Edition • 1,500 Exchange CALs • Veritas Backup Exec w/Exchange Agent • Cost = approximately $91,000
Clustered configuration costs • Possible configuration: • 2 Dell Dual Xeon 2.8GHz • 4GB RAM • 700GB disks • 2 copies Windows 2003 Advanced Server • 1 copy Exchange 2003 Enterprise Edition • 1,500 Exchange CALs • Veritas Backup Exec w/Exchange Agent • Veritas SAN Option • Dell rack • Dell fiber-based SAN and SAN connected 160/320GB SDLT Tape Drive • Cost = approximately $190,000
To cluster or not to cluster…. • Price potentially doubles! • Complexity triples! • You must understand Windows / Active Directory / Exchange / Clustering / SANs • Layer 8 problems – The Political layer • Management expectations are higher! • Danger Will Robinson! Danger! • Layer 9 problems - The Bozone layer • Snuffy the Network Admin • Fail-over is NOT instantaneous (at best 2 – 3 minutes) • Still have a single points of failure (the SAN, the network infrastructure)
To cluster or not to cluster… • If you don’t have 99.7% (1 day of downtime) availability right NOW, clustering won’t help. • People and procedures are the highest sources of failures. “High availability starts from within, grasshopper”
Downtime Common Causes: 13 customers and 25 outages • 4 virus outbreaks requiring a shutdown • 4 SAN failures • 4 Shutdowns due to insufficient disk space • 1 Exceeded 16GB limit on Exchange standard • 1 File based A/V software corrupted EDB • 1 Admin applied wrong security template • 1 Operator could not restore database – 5 days! • 1 Database corrupt, 1018 error (device driver) • 1 Database corrupt, operator plugged external SCSI subsystem in while live • 1 Loss of organization’s only global catalog • 1 Loss of organization’s only DNS server • 1 Administrator incorrect configured directory replication – loss of GAL • 1 Server blue screening every few hours (service pack / firmware issue) • 1 Motherboard failure • 1 SCSI controller failure • 1 Power to the campus data center failed
Ooops… • All but 3 of these outages could have been prevented with better procedures, training, and reliability preparedness. • Only 2 of these could have been prevented with clustering. • Many of these were prolonged or made worse due to insufficient training or procedures. • Exchange was not directly to blame
Change and Configuration Control • Never make changes without a process in place: • Document the changes to be made or patches to be applied. • Test the change in your lab • Responsible parties should review / approve • Notify affected parties • Schedule and give notice to the users • Implement • “Process” is going to become omnipresent for IT
Service Level Agreements (SLA) • Many types of SLAs • From vendor to customer • From IT Department to management/users • For an IT, the SLA may provide: • Published hours of operation • Expected system responsiveness • Guidelines for operation and recovery • Sets expectations for the user community • Guideline for planning server hardware and configuration • May provide mechanism for reporting and accountability
SLA: Defining Recovery Time • SLA states that in the event of corruption, it takes 4 hours to get a mailbox store back online • Largest store size is 75GB • DLT tape restores at 10GB per hour • The BEST restore time you can expect for the largest > 8 hours! • It is time to re-think store sizes, backup / restore devices, the distribution of mailboxes, or the SLA! • Estimated recovery time may not accurately estimate transaction log replay, either.
Sample SLAs and information • Intermedia • http://www.intermedia.net/legal/shared_sla • http://www.service-level-agreement.net • http://servicelevelbooks.com • http://www.oakland.edu/uts/helpdesk/docs/emailservicelevel.pdf
An ounce of prevention… • Eliminate single points of failure • Reliable servers / server configuration • UPS capacity - 30 minutes • Exchange configuration • Monitoring • Virus protection • Regular, reliable backups • Documentation
Where are your single points of failure? • DNS • Domain controllers • Global catalog servers • Front-end servers • Storage redundancy • Network infrastructure • Backbone • WAN links • Inbound / outbound SMTP mail
Server Configuration • Environment factors • Potential heat or water damage? • Physically secure • It should be really hard to hit the power button • Flash BIOS updates / firmware / device driver updates • Motherboard, disk controllers, tape devices, SANS • Check with your hardware vendor – The latest is not always the greatest • Use good quality cables for networking, fiber, and SCSI connections • Label and neatly tie-wrap them down! • Caching controllers • Using write caching only if battery backup exists; disable entirely otherwise • Budget for a ‘cold standby’ server with identical hardware
Server Configuration - Disks • SCSI disks provide better performance than IDE! • Disk redundancy • All disks should have redundancy (RAID 1, 5, 0+1) • On database disks, keep the disks less 50% full • Improves restore performance • Provides capacity for unexpected growth • Allows for ESEUTIL repair • Don’t forget enough disk space for RSG • On transaction log disks, plan for at least a week of transaction logs • Never compress Exchange logs or databases!
Server Configuration - Software • Latest service pack, critical fixes, and updates • Device drivers – consult manufacturer • Buggy disk device drivers is common cause for corrupt databases (and controller firmware) • Monitor security fixes • Evaluate each security / critical update to see if it applies to you and how quickly it should be applied.
Server Configuration - Batteries Go Bad! • Consult manufacturer for recommended schedule to replace: • UPS batteries • Caching controller batteries
Server Configuration - Consistency • Organize Exchange servers in to OUs • Use OU policy for • Auditing policy • Event log sizes and overwrite configuration • Security options • Disabled services • Custom registry settings • Information Store MAPI ports • System Attendant DS MAPI ports • W3SVC service dependencies • These can be included in the SCEREGVL.INF file – See KB 214752 • Avoid server-by-server registry changes if possible • Avoid security templates that overly restrict the local security settings or make file system permission changes.
Server Configuration – Gold Build • Get your servers, software, and configuration to a ‘gold build’ • Except for critical updates, don’t change the configuration frequently Change is the enemy of availability, grasshopper!
Exchange Configuration • Necessary to limit Exchange usage to prevent out-of-control or unexpected growth, viruses spreading, as well as system abuse. • Limit: • Message sizes • Recipient limits • Mailbox sizes.
Exchange Configuration – Misc. • Configure deleted item recovery on all stores • Configure deleted mailbox recovery • Teach help desk how to recover ‘hard deleted’ items – KB 178630 • Direct Exchange databases to RAID 5 or RAID 0+1 volumes • Direct Exchange transaction logs to RAID 1 or RAID 0 + 1 volume • Preferably on separate disk controller from databases) • Do not rely on PSTs as primary mechanism for mail storage. • PST = BAD
Exchange Configuration: Role Segmentation • Dedicate Exchange servers to specific tasks: • Mailbox servers • Public folder servers • Routing group / Internet / X.400 bridgehead • Foreign mail system connectors (MS Mail, Notes) • Wireless, fax, SMS, and pager gateways • Front-end servers • Segmentation can: • Simply complexity of your environment • Minimize impact of a server failure • Reduce recovery times • Often not practical in the ‘age of consolidation’ • If consolidating, consolidate mailbox servers from everything else
We can’t all be clairvoyant .. • …but we can monitor… • Implement some type of monitoring even if you can’t afford NetIQ, OmniAnalyzer, MOM, etc… - You will be glad you did! • Exchange System Manager’s Status and Notifications is free! Recommend monitoring: • Critical services • Disk space • Queue growth • CPU usage
Operational Procedures • Follow standardized and documented procedures • Keep logs of all changes, updates, and problems with Exchange servers • Whenever possible, do not work at the Exchange server console. Do office administration and automation tasks at your desktop! • Never use beta software from any vendor • Never install an e-mail client on the Exchange server. • Perform complete backups before any changes • Do not apply service packs or updates immediately after release • Do not delete user accounts and mailboxes right away. Set account expiration to the day the user left and wait a month or two. • Never set file-based virus scanning software to scan the M:\ drive or any Exchange data or transaction log directories. • If enabled, never use backup software to back up the M:\ drive
I just gotta defrag! • Squash the urge to ‘over administer’ Exchange. • Rarely a reason to perform offline maintenance or offline defrags • Deleted or moved many mailboxes • Users have recently performed a ‘purge’ • If you need to get away from your kids/spouse and come in on weekends, use that time to test your restoration or disaster recovery procedures on a test network.
Daily operations • The Big 5 daily tasks • Perform and verify successful backups • Check available disk space • Update virus signatures / scanning engine • Check the SMTP and X.400 queues • Check the event logs
Events to watch for… • Anything that indicates a problem or error must be investigated. • Nightly successful backups • NTBackup # 8001 – SG backed up • ESE # 213 – SG backed up • ESE # 224 – Log files being purged for SG • Online maintenance (daily) • ESE # 701 – Completed online defrag • MSExchangeIS Mailbox # 1207 – Purged deleted items • MSExchangeIS Mailbox # 9535 – Purged deleted mailboxes • MSExchangeIS # 1221 – White space report • Performance suffers if online maintenance does not complete. • Make sure that online backups do not overlap online maintenance
Weekly or monthly operations • If enabled, purge the BADMAIL directory • Check the log file generation • Purge / archive the protocol logging directories • Archive event logs
Virus protection • Virus protection is mandatory in Exchange environments! • On the Exchange server, use a AVAPI 2.0 / 2.5 enabled virus scanner • Keep the signatures up-to-date – daily! • Client-side antivirus scanning is important, too • Publish a ‘forbidden attachment list’
EXE COM CMD BAT CHM REG SCR VBS VB ASP EML HTM PIF HTML JS SHS WSH WSC Forbidden Attachment List – Minimal
Other Forbidden Attachments • MPG • MPEG • MP3 • AVI • WAV • WMV • And other file types that are large and / or possibly unbusiness-like.