550 likes | 702 Views
Brno University of Technology CESNET z.s.p.o University Campus Network Monitoring in Everyday Life. Tomáš Podermański , tpoder@cis.vutbr.cz. Brno University of Technology. http://www.vutbr.cz One of the largest universities in the Czech Republic
E N D
Brno University of TechnologyCESNET z.s.p.oUniversity Campus Network Monitoring in Everyday Life Tomáš Podermański, tpoder@cis.vutbr.cz
Brno University of Technology • http://www.vutbr.cz • One of the largest universities in the Czech Republic • founded in 1899, 110th anniversary will be celebrated this year • 20,000 students and 2,000 employees • 9 faculties • 6 other organisation units • Student dormitory for 6,000 students
VUT FP, FEKT, Kolejní 4 VUT Koleje, Kolejní 2 VUT FCH, FEKT, Purkyňova 118 VUT Koleje, Mánesova 12 VUT FEKT, Technická 8 VUT FIT, Božetechova 2 VUT FSI, Technická 2 AV VFU, Palackého 1/3 VUT TI, Technická 4 VUT Koleje, Purk. MU CESNET , Botanická 68a AV ČR UPT MZLU, Tauferova VUT, Kounicova 67a VUT Koleje , Kounicova 46/48 AV ČR UFM VUT Rektorát, Antonínská 1 VUT FAST, Veveří 95 VUT FaVU, Údolní 19 VUT , Gorkého 13 VUT FEKT Údolní 53 VUT FA, Poříčí 5 MU, Vinařská 5 AV ČR, Rybářská 13 VUT FaVU, Rybářská 13
Physical Layer • 24 places connected to each other • Each place is connected at least from two directions (by separated cables) • Over 100 km of optical cables • Most of the cables are the property of the university • IPv4 layer • The network cores are based on Hewlett Packard • OSPF based routing • For multicast PIM SM and DM are used. • Most of the traffic is being transported thought this network • IPv6 layer • IPv6 functionality on HP devices available as beta release • Temporary solution based on 3com devices or PC routers with Xorp. • Dedicated IPv6 switch/router together with the main IPv4 switch/router. • For connections between IPv6 routers VLANs are used. • Temporary low cost solution until main devices will have full IPv6 support
Basic monitoring, active vs. passive • Active monitoring • We sent a probe data and get a response • A probe of the device, network etc. • Passive monitoring • Observer of the device, network etc.
Components in a Monitoring System Agent Agent Manager Agent Agent Agent
Components in monitoring system • Agent and protocol • SNMP agent • Get, Set, Walk, Traps • NetFlow, SFlow, IPFIX probe • Accumulated statistics • For many systems specialized protocol based on the main system • Role of a cache on the agent • Active monitoring • We use an appropriate protocol or data depending on a monitored service • Proxy service (view from the other point) Agent Agent Manager Agent Agent Agent
Components in Monitoring System • Manager & Frontend • Manager collects and proceses data from agents • Store and archive in datastore • SQL, RRD, … • User interface • Web, application • Reports, SLA, … • Configuration • Historical view • System of alerts • Email, SMS, phone call • The most popular systems • Zabbix, Nagios, OpenView, nfsen/dump, flowtools, rrdtool, mrtg, cacti, munin, … Agent Agent Manager Agent Agent Agent
Quiz What causes the most of troubles in IT? • Power supply of systems • Overloaded circuits • Non managed UPS • Mess in eletricity instalations • Improperpower supply could be a booby trap • Cooling systems • Absence of a preventive monitoring • Frozen units • Jam by foliage • …
Physical infrastructure LAYER 0,1
Power Supply with 1 + 1 Redundancy PDU I PDU II UPS II ATS UPS I 2x 16A
Power Supply with 1 + 1 Redundancy PDU I PDU II Load, voltage Load, voltage on source 1, voltage on source 2, Selected source UPS II ATS UPS I Load, Input voltage, output voltage, battery status 2x 16A
power system with 1 + 1 redundancy ATS UPS 2x 16A
power system with 1 + 1 redundancy Load, current Input voltage, output voltage, battery status ATS UPS Load, current voltage on source 1, voltage on source 2, Selected source 2x 16A
power system with 1 + 1 redundancy ATS Overloaded circuit tripped circuit breaker UPS 2x 16A
power system with 1 + 1 redundancy When the power goes up again... in a few minutes UPS is low ATS UPS 2x 16A Second circuit is overloaded tripped circuit breaker
Cooling Systems In many cases a cooling system is a part of the building. Majority of cooling systems are difficult to monitor. Some devices have a support, but it costs a lot of money. In many cases monitoring is more expensive than the cooling device. There is no standard interface (RS485 with a closed protocol). Some devices have a binary output which indicates both error and running state (via relay) Possible conversion to SNMP Another and the easiest solution -> monitoring of temperaturein a communication room. Thermometer with a SNMP output. LonWorks Monitoring system Unit status/SNMP Temperatue/SNMP
Monitoring in Data Center Rooms More complex eletrical installation Having UPS and ATS in every rack is ineffective Devices with a 3-phase power Circuits are divided to 3 groups (direct, genset, UPS) More detailed information about the eletricity distribution is very useful. It is necessary to monitor whether phases are balanced Genset could break down
Power in Data Center Rooms Main power A Devices in racks V V ATS Genset A A Bypass HVAC A V UPS
Server Monitoring Hardware Manufacturers’ software support is required (Dell OpenManage, HP InsightControl, …) Chassis temperature Fan condition Power status Operating system CPU, Load, Memory, Utilization, process Disk subsystem External disk array with own management port Raid status Disk condition (S.M.A.R.T.) SNMP Monitoring system IPMI Other
Network Device Monitoring Hardware Chassis temperature Fan condition Power status State of the operating system CPU Load Memory Monitoring system SNMP
Network Connection – L1 Monitoring Port status Link UP/DOWN Speed Errors on interfaces Traffic on interfaces Remote device status LLDP + data from MIB Remote interface, remote device, …
Link LAYER 2
Network Connection – L2 Monitoring L2 monitoring L2 ping could be very useful We have to use information obtained from other layers (L1,L3) Unfortunately, there is no simple possibility to check connectivity on a single VLAN One option is to obtain some information from MIB, but it’s not sufficient SPT/MSPT information, root bridge VLAN on interfaces
Network Connection – L3 monitoring L3 monitoring ICMP and PING are still the most important The problem is how to monitor broken paths (routing protocol usually covers any problem) Check of the routing protocol state ICMP using the source routing Flow based monitoring Multicast monitoring 147.229.6.1 147.229.6.2 Data
Network Connection – L3 monitoring L3 monitoring Checking the a router having the proper neighbor OSPF-MIB RFC-4750 ospfNbrRtrId VRRP-MIB RFC-2787 vrrpOperAdminState, vrrpOperState, vrrpOperMasterIpAddr Master BDR DR Backup
Multicast Monitoring Quite demanding task For each stream the <S,G> path has to be created Continuously received and transmitted stream doesn’t have to discover problem on the RP Almost impossible to monitor local infrastructure The only one known tool – Multicast Beacon Written in perl Dead project Last release 2006 Without VLAN support or support for multiple interfaces on a single host Homepage unavailable Own solution : mcwatch
Multicast Agents Data is periodically sent to a server
Multicast Agent VLAN POSIX SOCKET APPLICATION Multicast Beacon
Multicast Agent VLAN POSIX SOCKET APPLICATION mcwatch
NetFlow Monitoring • Two NetFlow probes see on both external connectivity lines • NetFlow probes connected directly to optical fiber via TAP • Wire speed accelerated probes (FlowMon). CESNET PoP CRS-1/16 University network 10G Ethernet
Flow Processing Two NetFlow probes see on both external connectivity lines NetFlow probes connected directly to optical fiber via TAP Wire speed accelerated probes (FlowMon). Nfcapd All administrators Datastore SQL aggregated Backbone administrator
Flow Processing Data are stored on a storage server • Data are kept for 30 days • Analysis of security incidents, statistical proposes • Big deal – how to get/select useful data and provide them to people who need them. • Security matter • Full data are accessible only for small and trustful group of administrators • For other IT staff (faculty administrators, IT managers) summarised data are accessible via a web interface. • Data are processed by common open source tools: • nfdump • A lot of troubles, but we don’t have any better solution • We are trying to do any optimalisation into the current impelentations • Several theses on this topic is in process • Commercial tools - situation is not better • Usually plenty of nice charts and statistics • But performance is often terrible (sampling is required)
Transport, application and the others LAYER 4-7
Layer 7 • Many own plugins • Eduroam/radius monitoring • DNS • Database status • Backup server status • …. • Collected data and avilable for administrators on different level • Eduroam/Radius logs • Maillogs (DNSBL, spam clasification, statistics) • WiFi/VPN connections • ….
Components in the Monitoring System zabbix SNMP Zabbix Spinel SNMP xwho, xhis radius mysql icmp snmp xmon NetIs wifilogs millogs radiuslogs honeypots incidents … aggflow nfdump netflow
Monitoring : Layers & Technology zabbix SNMP, zabbix, NetFlow, radius, ICMP, ICMPv6, Spinel, … Power, Cooling systems, Temperature Server and disk arrays Network devices Physical xwho, xhis Port statistics, link status, number of errors LLDP neighbour Link ICMP tests using source routing option OSPF, VRRP peers Multicast traffic monitoring Internet NetIs Application Radius, DNS Other services nfdump
Actuall problems • SNMP protocol • No alternative • Many bugs in various implementations • Absence of the L2 testing tool • Netflow • We have plenty of the data but nobody knows how to process it in the effective way • In some cases the more detailed information is required than Flow • IPv6 brings some new problems and challenges