1.44k likes | 1.57k Views
Building a Proactive Monitoring and Alerting System Using Native IBM Domino Tools. Andy Pedisich Technotics. Why Do This Session …. Many Admins want to take advantage of native Notes monitoring solutions, they just don’t have the bandwidth to explore them “Free time” is very rare these days
E N D
Building a Proactive Monitoring and Alerting System Using Native IBM Domino Tools Andy Pedisich Technotics
Why Do This Session … • Many Admins want to take advantage of native Notes monitoring solutions, they just don’t have the bandwidth to explore them • “Free time” is very rare these days • This jumpstart will show you: • How to collect stats • How to analyze stats • How to go behind the scenes • How to set up monitors, alerts • And how to capture just about any little event you are interested in • And finally, how to configure and work with DDM • Let’s get started
What We’ll Cover … • Looking at the big picture of server monitoring • Understanding statistic generation • Designing an efficient and sensible collection infrastructure • Pulling useful information from statistical data • Using cluster stats to keep clusters reliable • Understanding the essentials of event monitoring • Determining the best notification methods • DDM: Understanding how it fits into your environment • DDM: Crafting a perfect DDM data collection hierarchy • DDM: Looking at DDM events and probes • Wrap-up
Driving Your Domino Servers • You can learn a lot about the importance of monitoring from driving your car • Your car tells you a lot about what’s going on • And you know they’re important because you pay attentionto the indicators • You fill the gas tank when it’s low • Unless you are Rob Axelrod (ask Rob) • And (usually) pay attention to the speedometer so you won’t get a ticket • Or maybe you’re the driver who thinks that red light on the dash is just for ambience while you’re driving at night • Uh oh
Domino Servers Are Obsessed with Statistics • Domino servers are constantly spewing stats • Just like your car telling you how fast you’re going • Except with Domino there are literally several hundred statistics generated • Most of them are updated continuously • Many administrators don’t know which ones are important • Or how to tell the good readings from the bad ones • Or what to do about them when they are bad
The Truth About Monitoring • A good administrator shouldn’t have to look very hard • And you can be notified about most problems automatically • You can be proactive about fixing them • When you’re proactive, you put out less fires • Firefighting dilutes your effort • But being notified requires that you monitor your environment for events and issues • And events depend on statistics • And statistics need to be collected • And too many sites don’t collect stats correctly • Some don’t collect them at all
What We’ll Cover … • Looking at the big picture of server monitoring • Understanding statistic generation • Designing an efficient and sensible collection infrastructure • Pulling useful information from statistical data • Using cluster stats to keep clusters reliable • Understanding the essentials of event monitoring • Determining the best notification methods • DDM: Understanding how it fits into your environment • DDM: Crafting a perfect DDM data collection hierarchy • DDM: Looking at DDM events and probes • Wrap-up
Perpetual Statistics • Domino servers constantly generate statistics • They track data on a surprising level • On almost every aspect of server operations • Agent manager • Mail and calendaring • The server’s platform • SMTP and Notes mail • LDAP • HTTP • Network • And lots more, too
Server Statistics Are Organized Hierarchically • Stats are gathered into major categories like these • And then each one has a multitude of subcategories
Subcategories of Statistics • Here’s a snapshot from the Administrator client showing some of the statistical hierarchy • This gives you a snapshot of the stats on your server • Use Refresh to get another snapshot
Statistics Come in Basic Types • The basic types of statistics are: • Stats that never change once the server is started • Snapshot stats – reflect what’s going on right now • Cumulative stats that grow from the moment the server is started • These stats are available to you for: • Your Domino servers • The platform your server is running on • Your network environment
Static Statistics • Statistics that don’t change usually represent the operating environment of the server • Server.Version.Notes = Release 8.5.3FP3 • Server.Version.OS = Windows NT 5.0 • Server.CPU.Type = Intel Pentium • Disk.D.Size = 71,847,784,448 • Mem.PhysicalRAM = 527,433,728
Amazing Detail, Yours Free! • This includes OS platform, Domino version, RAM • Lots of information about disks in use • Platform.LogicalDisk.TotalNumofDisks = 3 • Platform.LogicalDisk.2.AssignedName = E • Disk.C.Size = 80,023,715,840 • And even Network Interface Card (NIC) information • Platform.Network.1.AdapterName = Intel[R] PRO_1000 MT Server Adapter • Platform.Network.2.AdapterName = Broadcom NetXtreme Gigabit Ethernet _2 • Platform.Network.3.AdapterName = Broadcom NetXtreme Gigabit Ethernet
What Good Are These Static Stats? • Think these static stats aren’t helpful? • Guess again • They are extremely valuable • If you are collecting stats correctly from all your servers, you can take a pretty detailed server inventory • Without leaving your desk • From servers all around the world, just by looking at the data we’re going to collect in the Monitoring Results database • This database is also know by its filename: STATREP.NSF
Snapshot Statistics • Snapshot stats show what’s happening at the moment youask for them • They are changing all the time • Disk.E.Free = 18,679,414,784 • Server.Users = 280 • Mem.Free = 433,614,848 • MAIL.Waiting = 250 • The best part about this is that you get lots of Domino-related stats you wouldn’t get by looking at the operating system’s performance tools
Cumulative Stats • Some stats are cumulative • They start counting from zero when you start the server • Server.Trans.Total = 31,915 • SMTP.MessagesProcessed = 966 • Stats, like averages and maximums, are calculated from the cumulative ones • Server.Users.Peak.Time = 02/21/2006 07:50:33 MST • Platform.Memory.PagesPerSec.Peak = 1,364.1
Resetting Statistics • Some of these cumulative stats can be reset using the following console command: • Set Statistics statisticname • You can’t use wildcards (*) with this argument! • Here’s an example of why you might want to reset a stat: • Set Stat Server.Trans.Total • Resets the Server.Trans.Total statistic to 0 • You might want to reset this stat if: • You are starting to benchmark a new application • You are debugging an agent and want to see if it is more efficient after changes to its design
Platform Stats, Too • Platform stats vary widely from OS to OS • Getting platform stats from within Notes has great value • Track Domino server performance on an OS level even if your servers run on a variety of operating systems • For example, it’s very common to have a mix of AIX and Wintel servers • In a few minutes, we’ll be discussing threshold tracking • You’ll be able to set notification thresholds universally from within Notes to track these platform stats
Getting to Platform Statistics • Domino releases 6, 7, and 8 track platform stats automatically • In earlier versions, they had to be explicitly enabled and many times were disabled due to problems with servers crashing • These problems are gone • To see all platform stats – enter this console command • Show stat platform
A Word About Platform Stats on Partitioned Servers • Domino collects platform stats that pertain to the whole system • Not to an individual partition • The only statistics that are specific to a partition are those that reflect tasks, such as process statistics • One partition might run 10 tasks, while another partition runs 15 tasks
Confirming Stats with Other Tools • Be careful when trying to confirm platform statistics using other performance monitoring tools • Because of the differences in sampling intervals, you cannot use native monitoring tools to confirm platform statistics • There will be discrepancies between platform statistics and those obtained … • Using Perfmon – for Windows 2000 • Or a system command, such as this UNIX command: • iostat /vmstat/ netstat
See Server Statistics • Quickest way to see all server stats is to enter console command: • Show stat • Any place you can get to a console, you can access stats that can tell you a lot about the current state of the server • A SHOW STAT command gives you every statistic the Domino server has • Several hundred of them! • That’s really too many to deal with at once
Can I See That in a Smaller Size? • Get a better view of the stats showing just what you’re looking for using the asterisk wildcard • You can ask directly for the top level of the hierarchy • Show stat server • That shows all of the stat hierarchy under “server”
You Might Want Only Part of the Data • To get a select list of just the stats under the top level requires the use of wildcards in your console commands • If you only want Server.Users hierarchy, use the global “*” • Show stat server.users.*
Pushing the Wildcards • If you want a closer look, like just grabbing particular sub-levels of stats, get clever with the wildcard • For example, use the following command to find out about mail waiting • Show stat mail.wait* MAIL.Waiting = 1 Mail.WaitingForDeliveryRetry = 1 MAIL.WaitingForDIR = 0 MAIL.WaitingForDNS = 0 MAIL.WaitingRecipients = 1 5 statistics found
What We’ll Cover … • Looking at the big picture of server monitoring • Understanding statistic generation • Designing an efficient and sensible collection infrastructure • Pulling useful information from statistical data • Using cluster stats to keep clusters reliable • Understanding the essentials of event monitoring • Determining the best notification methods • DDM: Understanding how it fits into your environment • DDM: Crafting a perfect DDM data collection hierarchy • DDM: Looking at DDM events and probes • Wrap-up
Take It to the Next Level • Now that we know where the statistics are, it’s time to kick it up a notch • Let’s set up a collection architecture • Some Notes shops do not collect server statistics at all! • How in the world can they: • Determine what is causing performance issues? • Plan for future growth? • Have a grip on whether their server platforms are configured correctly? • Do they just make the stuff up and go with it?
The Two Things Needed • There are two things that are needed for statistics collection to happen: • The Events4 database must have a Server Collection document • The Collect task must be running on the server that is designated to collect the statistics
Details, Details, Details • Events4, the Monitoring Configuration database, needs a Statistics Collection document for each server collecting stats • This database should replicate to every server in the domain • A server will know it is supposed to collect stats because of this document • But it won’t automatically load the collect server task • We have to make sure that happens
Server Statistics Collection Docs • Use a Server Statistic Collection doc to indicate the server that will collect stats • And the servers you want the stats collected from
Set the Statistics Collection Interval • Use the collection report interval on the Options tab to set up how often statistics should be gathered • Generally, collecting once an hour is sufficient • If you are upgrading or changing the environment, it’s better to collect every 30 minutes • Or even every 15 minutes, if you are trying to fix problems
A Single Document Looks Like Many in the View • This single document, with a multi-value field containing all the servers, will look like it is multiple documents in the Events4 database • Make sure administrators know this, or they might delete everything by mistake • Guess how I know this?
Centralize Your Domain’s Statistic Collection • Ideally, use just a few key servers to do the collection • You might even be able to get away with just one! • Your network topology will have a profound effect on which servers you select • So will the load currently running on the servers • Avoid collecting stats over long, slow links • Be careful of WAN routes that are already packed with other network traffic
Configure Key Collect Points • If you have offices in London and Tokyo, then pick a collection server from each city • That server will collect stats from all servers in that region • Collect stats in a database created from the Monitoring Reports template • The databases don’t have to be called Statrep • Voilà! Centralized data at your fingertips
Remember to Add the Collect Task • The Collect server task must be running on the servers you selected as collectors • Use LOAD COLLECT from the console to get it started • Add the Collect task to the ServerTasks= line in the selected servers’ Notes.ini to make it permanent • Remove Collect from ServerTasks= from all other servers! • Want the servers to start collecting stats immediately? • Use the following console command: • Tell Collector Collect • It will kick off a statistic collection of all the servers you specified
The Collect Task Should Not Run on Every Server • Stat collection can be set up so each server collects its own stats • And puts them into a local Statrep Monitoring Results database • This method has the following drawbacks: • You have to run the Collect task on every server • You must visit Statrep on each server to analyze statistics • This is a real pain in the neck • And it makes analysis harder • Statistics have the most value when collected into a central location where they can be easily analyzed
What We’ll Cover … • Looking at the big picture of server monitoring • Understanding statistic generation • Designing an efficient and sensible collection infrastructure • Pulling useful information from statistical data • Using cluster stats to keep clusters reliable • Understanding the essentials of event monitoring • Determining the best notification methods • DDM: Understanding how it fits into your environment • DDM: Crafting a perfect DDM data collection hierarchy • DDM: Looking at DDM events and probes • Wrap-up
Let’s Start by Looking at Disk Stats • If I get a call about server performance, I check disk stats first • Bad disk utilization can seriously tank a server • One stat to track is Percent Utilization • A very busy disk can mean a very busy server • But it might mean something else is wrong • Perhaps a controller is beginning to fail or drive cache is wrong • Disk stats names depends on platform, but have PctUtil in them • It could be Logical Disk or Physical Disk • Like Platform.LogicalDisk.1.PctUtil.Avg • This should rarely hit 60% on Wintel boxes • On AIX and iSeries, it depends on disk sub-systems config • They often can run 90%+ without issues
Average Disk Queue Length • This is a major statistic! • Platform.LogicalDisk.1.AvgQueueLen, .Avg and .Peak • Queues of more than a couple of seconds mean your disks can’t really keep up with the action • You can hit high peaks occasionally without issues • But constant highs mean moving users or apps • Balance these disk stats against CPU/Memory stats • Because memory = virtual disk • And constant thrashing of disks might mean you need more RAM • Problem is, Statrep doesn’t have a view that shows these important statistics
There’s a Lot of Stuff That Isn’t There • Before we get any further, it’s important to point out something that is hidden • Statistical data – In the Monitoring Reporting database • STATREP.NSF • Statrep has views that simply don’t have data that is as useful as it could be • It’s there, it’s just not in views • However, it’s important to know that every document in the database contains every statistic you see when you issue a SHOW STAT command at the console • It’s just a matter of showing it in a view
Take Home This View • But now you have a version of Statrep with a view that does contain those important stats! • A specially-crafted version of the Statrep template with a view like the one below is available • You can download it from my blog • You’ll probably have to modify the columns based on the disk configurations of your own systems
Processor Statistics • Platform.Memory.RAM stats will disclose memory usage • Don’t just think you might need more memory: be certain by checking this out • On Wintel systems, this number should rarely be 60% • But on iSeries and AIX, it can be much higher • On iSeries it can actually run quite nicely at 90%
CPU Stats Are There for Each Task • Platform.Process.ActiveDomino.TotalCpuUtil • Gives you the big picture of how Domino is using processors • There is a Platform.Process.$$$.PctCpuUtil stat for each task you run on your Domino servers • Platform.Process.Amgr.PctCpuUtil • Platform.Process.Router.PctCpuUtil • Platform.Process.Process.PctCpuUtil • Platform.Process.Amgr.PctCpuUtil • … And so on
Using These Stats • You might find that the Agent Manager is the biggest hog because of user personal agents! • You could move busy user agents to a different server • These stats don’t show in the Lotus version of Statrep • But they are on the Technotics85Statrep.NTF version • You can download it from my blog • www.andypedisich.com
What We’ll Cover … • Looking at the big picture of server monitoring • Understanding statistic generation • Designing an efficient and sensible collection infrastructure • Pulling useful information from statistical data • Using cluster stats to keep clusters reliable • Understanding the essentials of event monitoring • Determining the best notification methods • DDM: Understanding how it fits into your environment • DDM: Crafting a perfect DDM data collection hierarchy • DDM: Looking at DDM events and probes • Wrap-up
Why Wouldn’t the Failover Replica Be Up to Date? • When primary server is down, users are directed to a replica on a failover server • But sometimes that replica is not up to date • Cluster replication keeps primary server in sync with failover • It’s an event-driven process – occurs automatically when a change is made to a database • Changes to a database are pushed to the replica on failover • Deletion stubs are not replicated • That’s why you also need a scheduled replication doc between servers in a cluster • It’s vital that these replicas are synchronized • But by default, clusters only have 1 cluster replicator task
Not Now … I’m Too Busy • Occasionally, there is too much data changing to be replicated efficiently by a single cluster replicator • If cluster replicators are too busy, replication is queued until more resources are available • Your databases get out synch and stale • Adding a cluster replicator will help fix this problem • Use this parameter in the Notes.ini • CLUSTER_REPLICATORS=# • But how do you tell if there’s a potential problem? • Adding too many cluster replicators will have a negative effect on server performance
What to Do About Stats Over the Limit • Acceptable Replica.Cluster.SecondsOnQueue • Queue is checked every 15 seconds • Under light load, should be less than 15 seconds • Under heavy load, if the number is larger than 30, another cluster replicator should be added • If the above statistic is low, and Replica.Cluster. WorkQueueDepth is constantly higher than 10 • Perhaps your network bandwidth is too low • Consider setting up a private LAN for cluster replication traffic