250 likes | 408 Views
Evaluation of Network Management Systems (NMS). Background Problem Statement Resolution Evaluation of NMS solutions Recommendation Tasks Accomplished Tasks Assigned Rahul Datta, ISIS, 09/10/06
E N D
Evaluation of Network Management Systems (NMS) • Background • Problem Statement • Resolution • Evaluation of NMS solutions • Recommendation • Tasks Accomplished • Tasks Assigned Rahul Datta, ISIS, 09/10/06 Graduate Student, Vanderbilt University
Background The Fermi National Accelerator Laboratory ( Fermi labs), is undergoing a research in lattice QCD( quantum chromodynamics).For this purpose they operate large clusters of computers. Their goal is the understanding of the strong dynamics of quarks and gluons, which is beyond the reach of the traditional perturbative methods of quantum field theory. A central goal of the groups using the computers is the accomplishment of the calculations required to extract from experiment the fundamental parameters of the Standard Model of particle physics. The Fermi labs is focusing on building a Cluster Reliability Subsystem. The LQCD computer cluster will be very large and will need to be available 24 hours a day. The cluster should insure that resources are used to best possible extent and attempt to complete started tasks in the presence of hardware and software failures (be fault resilient).
Background contd.. Examples of things that can affect availability and performance include • power outages - scheduled and unscheduled • job failures due to failing or failed hardware • scheduling jobs on faulty nodes • decreased performance due to hardware deterioration • decreased performance due to external influences (e.g. air quality) • inability to diagnose problems (e.g. hardware, OS, batch tools)
Problem Statement • To determine and specify the requirements placed on an NMS by LQCD-like systems. • To survey available Network management systems (NMS) and select a limited number capable of meeting the requirements to monitor/manage the computer cluster and the devices contained in that network. • To measure the performance of the NMS. • To ascertain the characteristics and features of the NMS. • Prototype a limited-scale monitoring/adaptation system. • To monitor the ( utilization, state) of all processors and networks in the system. • To experiment with it and observe what kind of plug-ins or modification can be made to the NMS • To consider a system where pluggable components hook into a message distribution system for routing and delivery to other pluggable components
Resolution • Open Source, Not restricted, (Distribution , porting , licensing) • Tools for user Interface • Kind of communications available. • Heavy weight package or Light weight package.( Resource requirements, Memory, processor, bandwidth) • Synchronization and triggers , Memory check. • Plug ins available or modules can be build ( for ex. Sensor modules) • Effectors, sensors and monitors. • Documentation
Potential NMS solutions • Open NMS • PIKT • JFFNMS • Nagios • Aware • Net-Policy • SYSMON Note : All the Network Management systems discussed here are Open Source. • Due to the scope of the research done as of now Net-Policy and SYSMON has not been discussed in details here.
Open NMS( Open Network Management System ) • Platform supported : Linux ,Fermi Linux, Cent OS, RHEL 3 & 4, Debian Sarge, SuSE, Red Hat Linux, Mandrake, Solaris, Mac OS( panther). • Features : • GUI ( web based graphical user interface) • Service polling: • OpenNMS provides real-time event-driven systems. Events are typically from SNMP traps, but can come from other sources such as syslog. There is no polling interval as such in these systems. If a node goes down, an SNMP trap is generated by the switch immediately. true real-time network monitoring OpenNMS has the ability to poll the following services (ICMP ,NotesHTTP, DominoHTTP ,Citrix ,LDAP ,SNMP ,SNMPv2 ,and many more…. ) • Network discovery • Availability Reporting
Open NMS( contd…) • SNMP Data Collection • SNMP Trap receiver (Over 5000 traps are pre-configured) • Notification via e-mail, pager, xmpp, growl, or anything that can be run on a command line • Supported Communications : Alarms, Sensors, Effectors • Threshold (based on data collected via SNMP or response time from a poller ) • Well documented • Language written in : JAVA
PIKT ( Problem Informant Killer Tool) • Platform supported : GNU/ Linux ,Fermi Linux, AIX. FreeBSD , OpenBSD, Digital UNIX • Features : • Lacks proper GUI • Reporting a problem • Fixing a problem (Kill idle user sessions, monitoring user activity, delete junk files, disk management) • Scanning a log file ( log file analysis) • Configuring a system ( network configuration) • Auto-configuring a file( automated configuration management)
PIKT Features (contd…) • Job scheduling (centrally directed scheduling daemon, cron alternative) • Monitoring system security (checksum differences, change auditing) • Enhancing the command line (command line macros, remote command execution) • Lacks proper documentation
JFFNMS (Just For Fun Network Management Systems) • Platform supported : GNU/ Linux ,Fermi Linux, AIX. FreeBSD , OpenBSD, Digital UNIX • Features : • Web GUI • Event console, Shows event , Alarms in the same time ordered display • Distributed Polling • Triggers/Actions Framework for email/other clients • Map and sub-Map support • Completely administrative via web. Sound alerts in the browser • Database Abstraction Framework • Object oriented • Sensors
JFFNMS ( contd…….) • Reports • Traffic bytes • Utilization % • Packets per second, errors per second, error rate • Round Trip Time and Packet loss ( CISCO and Smokeping) • Drops • TCP connections: Incoming, Outgoing, Established, Delay • Number of processors, Number of users • Used memory and Disks with aggregation • Processor utilization and Load average • Temperature • Documentation available • Language written in : PHP
NAGIOS • Platform supported : Linux ,Fermi Linux • Features : • Monitoring of network services( SMTP, POP3,HTTP,etc) Ability to define network host hierarchy, allowing detection and distinction of hosts that are down and those that are unreachable • Notifications via email , pager or other user defined method • Ability to define event handlers to be run during service or host events for proactive service resolution • Ability to acknowledge problems via the web interface • Supported Communications • Simple plugin design allowing users to develop their own host and service checks
NAGIOS (contd..) • Supported Communications (contd….) • Simple plugin design allowing users to develop their own host and service checks • Monitoring of Host resources( processor load, disk and memory usage, running processes, log files, etc) • Monitoring of environmental factors such as temperature • Language written in : C
AWARE • Platform supported : Linux ,Fermi Linux • Features : • Open source implementation allows for robust code base and customization • Common core engine implements a model of event processing • A "plug in" style mechanism allows dynamic addition of handlers • Agents are composed of a set of running event handlers • Agents can get their configuration from other agents (e.g., a centrally managed set of agent configurations) • Agents can communicate with other agents using connection oriented, connectionless and broadcast based methods
AWARE • Features (contd..) • Supported Communications: • Sensors: A comprehensive set of sensors that gather relevant information • Analyzers: Components that process data from the sensors and issue controller commands • Controllers: Components that change system state (e.g., run programs, change system parameters, control devices • Documentation Available. • Language written in : C
Recommendation • Explore and experiment with the full features of at least 2 or 3 Open Source Network Management Systems (NMS) before finalizing a NMS. • Based on the comparative features OpenNMS has been chosen.
Tasks Accomplished • Installation of OpenNMS successfully on an offsite Fermi Linux machine at ISIS, Vanderbilt University.
Tasks assigned • Exploring the features of OpneNMS for example : • To find a sensor and installing the sensor, building it. Writing our own sensors, alarms, effectors. • Detect the temperature difference of the hard drive of at least one of the nodes using OpenNMS.
Useful Links /URLS • http://www.openxtra.co.uk/resource-center/open_source_network_management_systems.php • http://www.opennms.org/index.php/Main_Page • http://jffnms.sourceforge.net/ • http://www.elegant-software.com/software/aware/doc/html/index.html