620 likes | 739 Views
MOM Essentials 5 – Advanced Configuration and Administration. Gordon McKenna MOM – MVP Inframon Ltd. Management Pack Tuning. Agenda. Management Pack tuning - Management Pack Architectural Overview - Management Pack Overview Demo. Event Rules Collection rules Filtering rules
E N D
MOM Essentials 5 – Advanced Configuration and Administration Gordon McKenna MOM – MVP Inframon Ltd.
Agenda • Management Pack tuning - Management Pack Architectural Overview - Management Pack Overview Demo
Event Rules Collection rules Filtering rules Missing event rules Consolidation rules Duplicate Alert Suppression Performance Rules Measuring Threshold Alert Rules Rule Provider Response Knowledge Alert Script SNMP trap Pager E-Mail Task Managed Code File Transfer NT event log Perfmon data WMI SNMP Log files Syslog Product Knowledge Links to Vendor Company Knowledge Links to Centralised Company knowledge Criteria Wheresource=DCOM and Event ID=1006 MOM Rule: Unit Of Instruction/Policy
Management Packs • Management Pack imported via MOM Server • Discovery finds computers in need of a given Management Pack • MOM deploys appropriate Management Packs • No need to touch managed nodes to install Management Packs • Rules: Implement all MOM monitoring behavior • Watch for indicators of problems • Verify key elements of functionality • Management Packs provide a definition of server health
Basic rule of thumb when first deploying MP’s • DO NOT deploy all your required Management Packs at the same time - Install one, tune properly, then do the next • Be brutal, switch off rules that you do not need - use Alert Overrides wherever possible (rather than just disabling) • Document your changes - XML, Excel, Third party • Put some form of change control in place to prevent random changes - Make sure that only the people who need MOM Admin or author rights get it.
Other Do’s and Don’t’s • Do read each MP guide before you deploy • Don’t just disable a rule because it is giving you an error • Do use the community to search for solutions to issues • Don’t let “Alert Creep” set in – be pro-active with your tuning • Do use reports to help you stay on top of alerts • Do have regular reviews of the environment • DON’T PANIC !
Common Alerts After MP Deployment ADMP Replication taking too long – need to modify script for worst case scenario Script Failures (access denied) – usually permission related, make sure Local System has enough rights if unsure then create an AAA according to MP guide. Script Failures (failed to bind LDAP) – If AD is behind firewall then FQDN will be missing from computer table. MOMLatencyMonitor not created – Make sure your first AD Agent is on Infrastructure Master
Common Alerts After MP Deployment Exchange Mail flow scripts fail – Permission related check out “Top 3 issues effecting Exchange MP” in product documentation on MS MOM site Disk Write Latency perf ctr – Some re-configuration required (see Exchange MP guide) General perf ctrs – Their will be a lot of base lining to do due to nature of application OWA\OMA\AS – disable rule group if this does not apply
Key Processes for alert tuning strategy • What to do with Alerts • Who “owns” the MOM alert • Who gets notified • Who takes action • How will it be recorded/closed • What to do with Reports • How to “publish” them • Who reviews them
Tips for Getting Application Owners Onboard • Sell each each MP to the technology owner i.e. SQL MP to the SQL Team • Demonstrate how MOM can help them • Scope them there own console • Make sure they explain fully how their environment works • Make them part of the deployment process • Make them the MP owner • Get them to own MP change documentation • Put a process in place so they can be notified when changes take place to MP’s • Give them a sandpit environment.
Methods for Maintaining MP’s • Create an excel spreadsheet of all rules, documenting changes – Can be time consuming • Use MOM resource kit tools to convert .AKM file to .XML, then use XML editor or MPDiff (res kit) – Fiddly • Develop your own solution (web based\.NET) • Use a third-party tool like Silect software http://www.silect.com
Application Engineering Standards • Get App developers to think “Manageability” • “Design for Operations” • Write a good set of standards • Introduce new technologies if need be like AVICode for .Net www.avicode.com • Provide value to them with peformance and availability reporting • Make them understand how hard your job is ;-)
Ideas for Standards • Registry Keys – Info on App, override criteria, thresholds • Event Logs • Performance Counters • WMI – Process and Thread ID’s, app data • Status Monitoring – Health Modelling • Scripting • .NET (AVICode) • Other Methods – MOM API, C++
MOM OperationsGuiding principles • Database: Manage alert, event and perf data volumes • Management Servers: Monitor the health of the management server queues • Agent administration: Watch the “Pending Actions” computer group closely and watch for agents not heart beating (This principle is covered in the appendix)
MOM OperationsGuiding principles • Database: Manage alert, event and perf data volumes • Management Servers: Monitor the health of the management server queues • Agent administration: Watch the “Pending Actions” computer group closely and watch for agents not heart beating (This principle is covered in the appendix)
select count(*) from eventall -- all events in the db select count(*) from alertview -- all alerts -- Date/Time the System Center dts job last ran select timedtslastran from reportingsettings -- see appendix for perf data volume query Database VolumeView Alert and event counts daily Tip: The DTS job for the System Center data-base logs a date/time when it completes successfully. The grooming job (MOMXGroomByDays) will not groom out any data newer than this date
Database GroomingAlert resolution criteria Tip: Perfmon, event and alert data is held in the Onepoint database for “Groom data older than the following number of days:” The clock does not start ticking for alerts until the alert is resolved
Relevant code from stored procedure “AlertUpdateNewToResolved.” SET ResolutionState = 255, TimeResolved = @LastModified, LastModified = @LastModified, LastModifiedBy = 'AutoResolved', ResolvedBy = 'AutoResolved' WHERE ResolutionState = 0 AND TimeOfLastEvent < @dGroomDate AND (ProblemState <> 3 OR ProblemState IS NULL) Database GroomingAuto alert resolution Tip: Only alerts that are in a “New” resolution state AND do NOT have an “active” problem state AND whose last event is older than the number of days you specify (Global Settings/Database Grooming) get auto resolved
Management ServersIncoming and outgoing queues Management Server Outgoing Queue Blocked= 22061 RTN= 22062 Incoming Queue: Blocked= 21268 RTN= 21269 MOM Agents agent No disruption of service is caused when the outgoing queue fills up. If the incoming queue is full, agents begin caching data locally until they can find a management server to write to One pointData-base
Management ServersIncoming and outgoing queues • During extreme alert, event, or perf data storms, the incoming queue may fill up. The management servers communicate this to their agents. Under this condition, agents will try “hopping” to their “failover” management server. They log event 21249 when they do so and 21250 when they come back • During this time period, failed heartbeat alerts (21284, 21209) are inaccurate and you get a lot of “agent stopped sending required config requests” events (22085) • This condition also defeats your agent load balancing strategies for management servers • A soon to be public QFE is available to throttle how soon agents try their “failover” management server
Server Queue Full Pointers Perf counters to watch under “MOM Server” Perfmon object: Db Perf Insert Simple Count Db Alert Insert Simple Count Db Event Insert Simple Count Queue Space Percent Used
Management ServersIncoming and outgoing queues • Perf data incoming rate is fairly constant from measuring or collection rules. Rules that run scripts that submit perf data can cause large spikes. Scripts to watch out for if you have tens of thousands of mailboxes on your Exchange servers • Exchange 2003 - Collect Mailbox Statistics • Exchange 2003 - Collect number of mailboxes per server • Exchange 2003 - Collect Message Tracking Log Statistics • These scripts have parameters which you can use to tune how many mailboxes they report on. Some customers use them to report only one mailbox, as they still deliver very useful global counters for the information stores
Management ServersIncoming and outgoing queues • Tip: If the management server incoming queue is filling up there are three options • Increase the size of the temporary storage on the management server: Global Settings/ Management Servers/Temporary Storage (easiest) • Reduce the volume of perfmon, event, and alert data by tuning rules (best) • Increase throughput of SQL server by increasing disk I/O, memory, etc (usually least practical)
Appendix: Agent Admin Pending actions computer group • The only legitimate reasons for computers that are “pending action” are • Computer rules have discovered a new computer or more computers have been added to manualmc.txt • Computer rules have changed to be less inclusive or computers have been removed from manualmc.txt Tip: When agents that match computer discoveryrules or manualmc.txt are pending uninstallation. Watch out for domain name changes and failed computer discoveries
Appendix: Agent Admin Pending actions - domain changes • If the agent computer changes domains: • You will get failed heartbeat and failed discovery alerts on the old computer account • The agent will be in a new domain still trying to contact its management server. It will log this MOM Alert Name: The MOM server received invalid or corrupt network data which may indicate a security problem. Description follows: The MOM Server rejected configuration or data package from computer domain\computername. Package failed to pass server security verification. Event Number=21289 Source=Microsoft Operations Manager
Agent Admin Pending actions - failed discoveries • Give MOM agents a good healthy uninstall delay. Global Settings/Management Servers/Properties/ Automatic Management/Uninstall Delay. This will ensure that discovery must fail > once for a server to be marked for uninstall • Disjoint AD name space: Is the FQDN of your servers not the same as their primary DNS suffix? MOM discovery may have a few minor issues in this scenario. Mutual authentication with agents may not work either. We have almost 1800 agents in 12 domains. MOM has trouble discovery about 16 servers. Most of those are in two domains but many other servers in these two domains are discovered successfully. We don’t know what causes this but the problem can be remedied….
Agent AdminPending actions – failed discoveries • The manualmc.txt file can contain computer names with several different formats • Fully Qualified Domain Name (FQDN) name • NetBIOS name • Domain\ComputerName • Usually either NetBIOS name or Domain\Computer name works, but I have found that you get fewer discovery errors using NetBIOS name. Where discovery fails using NetBIOS name you should try domain\computername format. Sometimes it works where NetBIOS fails IF you add the FQDN to the hosts file (yes, the hosts file)! • Computer discovery rules work somewhat better but are no guarantee for successful discovery
Agent Admin Agent management hints • Install all the nodes of a cluster on the same management server or you will not get accurate heartbeat alerts • Occasionally turn on rule: Microsoft Operations Manager\Operations Manager 2005\Agents on all MOM roles\ Agent communication failure troubleshooting events; These events can be viewed from a public view called “Agent communication failure troubleshooting events” under “Microsoft Operations Manager/ Operations Manager 2005/Agent Configuration and Connectivity. You might be surprised at some of the warnings and errors your agents are logging • If you turn agent proxying on and you submit an alert on behalf of a computer that does not have a MOM agent in the same management group as the box that submitted the alert then the computer you submitted the alert for will show up in the “Unmanaged Computers” group
Agent Admin Agent management hints • If a push installation (automated from management server) fails on an agent try installing the agent manually using momagent.msi. When the agent starts check the NT eventlog for diagnostic messages. Some of them are pretty good • Script a process to check your server database each day for populating the manualmc.txt file. Manualmc.txt allows you to chose what servers to manage rather than getting all the computers discovered by computer rules and adding exclude rules for those you don’t want. Also, removing a computer from manualmc.txt can be automated whereas removing computer rules cannot
Agent AdminSource and Logging computers MOM agents can generate alerts on behalf of other computers if agent proxying is turned on for that agent (Agent-managed computers/computername/properties/Security/uncheck prevent agent proxying). Management servers do this all the time for failed heartbeats, discoveries, etc. This is what the “source” and “logging” computer in the advance criteria tab are all about SELECT C.Name as 'Source', D.name as 'Logging', es.number,dateadd(hh,-6,e.timegenerated),e.message FROM EventAll E INNER JOIN Computer C ON (E.idGeneratedBy = C.idComputer) INNER JOIN Computer D ON (E.idloggedOn = D.idComputer) INNER JOIN EventSource ES ON (E.idEventSource = ES.idEventSource) where e.idloggedon <> e.idgeneratedby order by 2 desc
Agent AdminUseful views in the MOM MP • The MOM management pack has some extremely useful alert and event views under public views/Microsoft Operations Manager/Operations Manager 2005. Here are my favorites • Everything under “Computer Discovery.” Great place to look for what discoveries are failing and why. To this folder I would add an event view for event number 21185 which shows a summary of failed discoveries • Agent Configuration and Connectivity • Agent communication failure troubleshooting events • Agent communication failures • Agent Deployment • Agent Installation Failures – All • Agent Performance • MOM Host - %Processor Time • MOM Service - %Processor Time
Account AdminChanging account passwords Resetting an Action Account (domain user) • SetActionAccount.exe <management group> [options] Options: -query //returns the current Action Account settings for the specified management group. -set <domain> <username> //sets the Action Account for the specified management group. Note - the tool will prompt you for the new password. Note - the management group must be specified, even if the agent is not multihomed.
Account AdminChanging account passwords • Changing the DAS Account password • Change the account settings on the Identity tab of the Microsoft Operations Manager Active Operations Data Access Service COM+ application on the Management Server. • If you are using a different account, you must also add that account as a SQL Server Security Login with “Permit” server access. • Give the db_owner Account access to the OnePoint database on the MOM database server, if you are using a different account. MOM setup grants the DAS account this access by default. • If you also have the MOM to-MOM Product Connector installed, add the account to the MOM Service security group on the Management Server. • Note • You must restart the COM+ application to commit the changes.
Configuring the Webconsole • Configuring the Web Console As Read-Only You can configure the Web console to be Read-only, so that operations data can be seen, but tasks cannot be run and changes cannot be made. This setting does not affect the Operator console read/write access. To enable or disable Read-Only access for the Web Console On the server hosting the Microsoft Operations Manager 2005 Web console application, open the %INSTALLDRIVE%\ Program Files\Microsoft Operations Manager 2005\WebConsole\web.config file in a text editor. In the <appSettings> node, change the node “<!--add key="Readonly" value="true"/-->” to “<add key="Readonly" value="true"/>”. Restart the Microsoft Operations Manager 2005 Web console application in the Internet Information Services (IIS) snap-in.
For Further Information….. • Read the MOM 2005 Operations guide • Use the microsoft.com\mom website • Attend the Windows Management Webcasts on MOM • ATTEND ALL OF THE TECHNET SESSIONS ON MOM
Troubleshooting MOM 2005 MOM 2005 Logs
Written by Developers for Developers • Original concept by Mission Critical Software • To assist developers with debugging their own code! • Needs familiarity with the code itself • Augmented by NetIQ • Further augmented by Microsoft • With a move away from mc8 to log files for easier troubleshooting by PSS
Logging Levels • HKLM\Software\Mission Critical Software\Tracelevel • Default = 0x1 • Service restart not required for Trace Level changes • Verbosity levels • 0xFFFFFFFF = Off • > 0 = Errors (Err:) • > 3 = Errors and Warnings (Wrn:) • > 6 = Errors, Warnings and Info (Inf:) • > 9 = Errors, Warnings, Info and Debug (Dbg:)
Log Locations • Service logs (*.mc8, *.log) • %Systemroot%\Temp\Microsoft Operations Manager • DAS log (DllHost.mc8) • Documents & Settings\<Das account>\Local Settings\Temp\Microsoft Operations Manager • MOM MMC Snap-in log • Documents & Settings\All Users\Local Settings\Temp\Microsoft Operations Manager
Service Logs • MOMService(Init).mc8 • Logged to for the first TraceInit seconds • Circular line logging commences after TraceInit seconds (default 60) • MOMService(A-B).mc8 • Circular line log files • Logs roll over after TraceCircularLines • TraceCircularLines default 50,000
Reading service logs • Notepad • Lacks translation of date/time • Useful for quick examination & search functions • MOMLogViewer • MOM ResKit utility • Displays pertinent information such as Date/Time • Realtime update • Doesn’t always read ALL lines! • Trace32.exe • From SMS support tools • Useful for real-time logging and highlighting • Lacks column translation
Which Trace Level?Which Trace Level was enabled for log file analysis • MOMLogViewer • Add Column Trace Level • Notepad / Trace32 • Fourth distinct column or Err:,Wrn:,Inf:,Dbg: in the text column
Why Trace? • On PSS / PG Advice • PSS or the Product Group ask for more data • Used by CPR or developers to isolate code causing a potential problem • Do not enable Trace Level >6 unless advised (it is resource intensive) • Obtaining more information • Default Trace Level 1 = Errors • Mc8 logs will contain information on error (although not always significant) • Review logs after crash or significant failure.
Err: (0-2) Tracing (Errors) • Expected errors • Err:Directory Service NT Event log not registered on this machine. Not processing Directory Service log. • i.e. This is not a domain controller • Unexpected errors • Err:Logged event -1073715815(Error) args = "momsrv.w2k3lab.com" "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond." "1270" • i.e. Communication failure
Wrn: (3-5) Tracing (Warning) • Often produced prior or during code failure • Agent/Server queue nearing full • e.g. Wrn:High water wait timed out. Waiting indefinitely for 957 bytes (exact space) • Wrn:Connection failed with status 3 (ie. TCP connection failed unexpectedly) • Sometimes expected • E.g. Wrn:Cannot start eventlog reader for target:, log name:Directory Service