280 likes | 588 Views
Avid System Monitor Ed Harper November 2010. Avid System Monitor delivers Enterprise wide monitoring solution for Avid systems and infrastructure switches. Avid System Monitoring overview. What it is. A tool to increase the system availability by identifying issues in real time
E N D
Avid System Monitor Ed Harper November 2010
Avid System Monitor delivers Enterprise wide monitoring solution for Avid systems and infrastructure switches Avid System Monitoring overview
What it is • A tool to increase the system availability by identifying issues in real time • A tool to help identify potential problems in a system as they are occurring • A single tool for monitoring all necessary components of the “system”, including Avid gear, network infrastructure, 3rd party devices • A tool that collects performance data over time so that it can be graphed (and trends identified) • A tool that will continually evolve to identify known problems within a system (after the knowledge of those problems have been learned during Code Blues, etc) • A window into specific state of the Avid & selected infrastructure system components at a given point in time. It also provides enough flexibility for customers to refine and fine tune the tool’s outputs once the basic functions are mastered.
Overview • Avid Monitoring Gateway service installed on Framework (ASF) enabled devices to provide visibility to Avid System Monitor via HTTP • Avid System Monitor delivers enterprise solution monitoring for Avid systems and infrastructure • Pro-active system health and status monitoring • Statistics gathering, graphing and thresholds • Event logging, intelligent alarm processing and notification • Dashboard views showing outages and availability • Simple drill down to isolate issues • Standards based • SNMP, HTTP & IP port status
Monitoring components Monitored Node • Agents • Interplay Engine • Stream Server • Capture • Media Indexer • Interplay Lookup Service (LUS) • ISIS 7000 System Director • ISIS 5000 Monitoring Server Recommended platform SR2500 GUI, SNMP & HTTP collection • 3rd Party • Cisco switches • Foundry Switches • Force10 SQL Database Java (JDK) Environment Real Time Audit • Agentless • AirSpeed, AirSpeed Multi Stream • Capture Manager • DNS, DHCP services • Time Sync • Avid Service Framework • Provides time sync
Monitoring Environment • Monitored Avid Services & Devices • Detailed monitoring including status, statistics etc. • Avid Service Framework (ASF) • Media Indexer (MI) • ASF Lookup Service • Interplay Engine • Stream Server • Interplay Capture • ISIS 5000 & 7000: System Director • Real-time inventory • Device up/down status without detailed monitoring • Workflow Engine, iNews FTS, Workstation Service , Time Sync service, Multicast repeater, LowRes Encoder • 3rd Party Elements • Windows services; DNS, DHCP etc • Network Switches • Cisco, Foundry, Force10
Dashboard • Single screen view with Intelligent grouping of devices & domains • High level status • Alarms • Notifications • Node Status • Resource Graphs • Click on any device group to automatically filter information for selected devices
Events & Alarms • Extensive Event Logging • Severity, source etc • Acknowledgement • Search • Fine grain event details • Correlating up/restore • Alarms • Flexible rules to allow event aggregation in alarm view to count multiple occurrences of same event • Severity • Last time of event • Count occurrences • Link to event details • Option to auto-clean events • Operator Instructions specific to alarm & device type
Notifications • Flexible notification to email • Individuals or groups • Automatic Escalation • Escalation to higher level group if notification is not acknowledged within certain time • Example; Minor event sent to Ops team, if unacknowledged for 20 minutes raised priority to Major and issues notification to Management team • Notification logging, with timestamps including response time
Statistics & Charts • Historical statistics gathering, trending, charts • Thresholds set to trigger events and notifications on ‘interesting’ conditions • Specifically tuned to Avid components, based on real world experience
Threshold Event Notification Media Indexer Media Files Admin configurable trigger levels • Flexible Threshold engine • Configurable on any counter in the system • Extensive pre-programmed thresholds provided in Avid monitoring package • Simple process to add customer specific threshold
Threshold Configuration • Custom configuration of Threshold Event • Any counter value collected by OpenNMS • Type; High, Low, Relative Change, Absolute Change • Datasource; Entity to collect counter data (graph properties) • Datasource Type; Node or interface • Datasource Label; String displayed in event • Value; Threshold value • Re-arm; Reset/ Cleared value • Trigger: Number of times the threshold must be broken to create an event
Node View • Single screen dashboard per node • Current Status • Availability; system and individual services • Notifications, Recent Events, Recent Outages
Outages & Availability • Current Outages • Node or Service down • Grouped by Device / Service Type • Click to drill down • Calculated 30-Day Availability • Color Coded
Surveillance View: Flexible Grouping • Grouping by • Service • Category • Simple customization • Current Outages by; • Device Type • Workgroup or location
Node Discovery • Configure OpenNMS to discover devices and services on specific IP address or range • Automated capability query of generic IP, SNMP and Avid specific services & device capabilities • Add device names to nodes for readability if desired • IP address and DNS names displayed by default • Automated capabilities scan every 24 hours
Network Switch Monitoring • SNMP • Link Up • Link Down • Network • Spanning Tree Topology Change • Bandwidth Utilization • Thermal • Max temp exceeded • System • Memory utilization • Processor utilization • Foundry • Startup config change • Running config change • Telnet login / logout • Cisco • Configuration change SNMP monitoring and statistics gathering for Cisco, Foundry & Force10 infrastructure Zone 2 switches
Maps • OpenNMS provides mapping tool with device status • Multiple maps to allow views for LAN, editors etc • Link discovery find node connectivity • Not all links shown correctly; ISIS switches not manageable so devices appear connected to adjacent switch
Proving it’s Value (a real field example) • Phased Roll-out • Monitoring SNMP switches (only) • Customer Reported AirSpeed “Slow Down” • Avid CS / Systems Engineers queried OpenNMS remotely • Pulled switch bandwidth utilization • Switches operating correctly • Within a few minutes troubleshooting team moved on to investigate specific devices • Without OpenNMS proving switch operation required access labor intensive process of monitoring scripts and driving traffic loads • Time consuming ~ 1 day to prove switches Faster resolution Greater customer satisfaction
Example Memory Utilization on Interplay Media Indexer Charts show steady consumption of server RAM memory during load test Performance impacted as memory maxed out Thresholds provide notification when x% exceeded
Pricing, Availability etc • Delivery • Value-add offered to customers with Avid Uptime support • Software download • Phased roll-out at selected customer Production networks • Typically switch monitoring • Pricing • Avid System Monitor available to Avid Uptime support contract customers • PSG installation • PSG engagement required
Summary • Real-Time monitoring of devices, services, networks & infrastructure • Avid Customer Success • Customer IT / Admin • Statistics, thresholds, events and notifications • Broad Enterprise system support • Increasing breadth and depth • Pro-active warnings and notification of potential problems • Improved time to resolution
Avid Monitoring Solution OpenNMS GUI ISIS client, Editor ICMP (Ping) Avid TCP Port monitoring DNS AirSpeed ICMP HTTP/TCP SNMP Data collection Trap receiver Avid TCP Port monitoring DNS, time sync ICMP SNMP Service / IP monitoring ICMP SNMP ASF Monitoring Gateway ICMP SNMP Interplay SNMP ASF Health Monitor LAN Switches ICMP only SNMP Lookup Server Media Indexer AirSpeedMS Interplay Engine, Stream Server, Archive System Director ISIS Engine ISIS ISB, ISIS switch SNMP Full Monitoring; events, statistics
Failure Modes Monitored • Avid System Monitor is tuned to identify specific failure modes • As found in field experience / Code Blue • Media Indexer • MI in the HAG with a weight of "0": Indicates an "election issue" which can cause major system slowdown. • Number of quarantined files growing: Indicates a faulty ingest device creating bad files. • Different file count between each of the HAG MI's: Indicates issue with ISIS notifications. Some files will appear offline to some clients. • Different time on each of the machines in the WG: Can be the cause of lost ISIS notifications (see above). • MI Heap usage running dangerously high: Indicates your WG file count or client count is causing too much stress on that MI. Eventually, the MI will thrash. • Number of files added/updated on last full resync, when it's greater than 0. This value is displayed in the Health Monitor, under each storage pane of the MI. • Interplay Engine • Time to perform login - should be below 15 seconds: indicates engine slowness • Number of journal files - should be below 50: indicates journal integration stuck/dead • Number of deletes - should be below 100 for 5 minute polling intervals during normal production time: indicates deletion during production time • Number of loaded objects/number of total objects - should be above 30%: indicates engine cache warm-up causing slowness • Backup running flag - should be off during production time • Avid Service Framework Lookup Service (LUS) • For LUS, here are things we could check today via SNMP Gateway. However, these monitor points don't really contribute to most of the problems we see related to ASF. They are the only data points that are available today. • Monitor Handle Count (either via gateway or MSFT agent) - should be below some threshold (<5000) • Monitor Thread Count (either via gateway or MSFT agent) - should be below some threshold (<500) • Monitor Events In Queue (via gateway) - should be less than 50 • Check that a process is bound to port 4160 on the box (don't know how to do that with OpenNMS) - confirms that the LUS process is running • Monitor Memory Usage (either via gateway or MSFT agent) - should be below some threshold (<200MB) • ISIS • ISIS monitors a number of critical areas and sends an event to the Windows event log when values reach a defined value or threshold. You can configure ISIS to send an email when an error or warning event occurs. You can also configure the System Director to generate an SNMP trap when the event occurs. The top areas include the following: • Temperature and presence of components such as switches, storage elements, and power supplies. • Workspace usage thresholds. For example, an Admin can enable warning and error thresholds. If you can set the workspace threshold to 90%, ISIS will generate an error event when a workspace reaches 90% full • Disk health issues such as disk failed or disk performance degraded based on continuous monitoring. • Server failover notifications. For example, on a failover system you are notified when the system fails over to the other node. • Metadata problems. For example: if there is a problem opening a metadata file or if the metadata in a file seems out of date