1 / 30

OSPF Monitor Architecture, Design and Deployment Experience

OSPF Monitor Architecture, Design and Deployment Experience. Aman Shaikh Albert Greenberg AT&T Labs - Research NSDI 2004. Objectives for OSPF Monitor. Real-time analysis of OSPF behavior Trouble-shooting, alerting, validation of maintenance Real-time snapshots of OSPF network topology

skip
Download Presentation

OSPF Monitor Architecture, Design and Deployment Experience

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OSPF MonitorArchitecture, Design and Deployment Experience Aman Shaikh Albert Greenberg AT&T Labs - Research NSDI 2004 OSPF Monitor - NSDI 2004

  2. Objectives for OSPF Monitor • Real-time analysis of OSPF behavior • Trouble-shooting, alerting, validation of maintenance • Real-time snapshots of OSPF network topology • Off-line analysis • Post-mortem analysis of recurring problems • Generate statistics and reports about network performance • Identify anomaly signatures • Facilitate tuning of configurable parameters • Improve maintenance procedures • Analyze OSPF behavior in commercial networks OSPF Monitor - NSDI 2004

  3. OSPF Monitor in a Nutshell • Collect OSPF LSAs (Link State Advertisements) passively from network • Every router describes its local connectivity in an LSA • Router originates an LSA due to... • Change in network topology • Periodic soft-state refresh • LSA is flooded to other routers in the domain • Flooding is reliable and hop-by-hop • Flooding leads to duplicate copies of LSAs being received • Every router stores LSAs (self-originated + received) in link-state database (= topology graph) • Real-time analysis of LSA streams • Archive LSAs for off-line analysis OSPF Monitor - NSDI 2004

  4. Components • Data collection: LSA Reflector (LSAR) • Passively collects OSPF LSAs from network • “Reflects” streams of LSAs to LSAG • Archives LSAs for analysis by OSPFScan • Real-time analysis: LSA aGgregator (LSAG) • Monitors network for topology changes, LSA storms, node flaps and anomalies • Off-line analysis: OSPFScan • Supports queries on LSA archives • Allows playback and modeling of topology changes • Allows emulation of OSPF routing OSPF Monitor - NSDI 2004

  5. LSAG Real-time Monitoring OSPFScan Off-line Analysis LSA archive LSA archive LSA archive Example LSAs LSAs TCP Connection LSAs LSAR 1 LSAR 2 “Reflect” LSA “Reflect” LSA replicate LSAs LSAs LSAs OSPF Network Area 0 Area 2 Area 1 OSPF Monitor - NSDI 2004

  6. How LSAR attaches to Network • Host mode • Join multicast group • Adv: completely passive • Disadv: not reliable, delayed initialization of LSDB • Full adjacency mode • Form full adjacency (= peering session) with a router • Adv: reliable, immediate initialization of LSDB • Disadv: LSAR’s instability can impact entire network • Partial adjacency mode • Keep adjacency in a state that allows LSAR to receive LSAs, but does not allow data forwarding over link • Adv: reliable, LSAR’s instability does not impact entire network, immediate initialization of LSDB • Disadv: can raise alarms on the router OSPF Monitor - NSDI 2004

  7. LSAR R Please send me LSA L Please send me LSA L Please send me LSA L I have LSA L Partial Adjacency for LSAR I need LSA L from LSAR Partial state • Router R does not advertise a link to LSAR • LSAR does not originate any LSAs • Routers (except R) not aware of LSAR’s presence • Does not trigger routing calculations in network • LSAR’s going up/down does not impact network • LSARR link is not used for data forwarding OSPF Monitor - NSDI 2004

  8. LSA aGregator (LSAG) • Analyzes “reflected” LSAs from LSARs in real-time • Generates console messages: • Change in OSPF network topology • ADJACENY COST CHANGE: rtr 10.0.0.1 (intf 10.0.0.2)  rtr 10.0.0.5 old_cost 1000 new_cost 50000 area 0.0.0.0 • Node flaps • RTR FLAP: rtr 10.0.0.12 no_flaps 7 flap_window 570 sec • LSA storms • LSA STORM: lstype 3 lsid 10.1.0.0 advrt 10.0.0.3 area 0.0.0.0 no_lsas 7 storm_window 470 sec • Anomalous behavior • TYPE-3 ROUTE FROM NON-BORDER RTR: ntw 10.3.0.0/24 rtr 10.0.0.6 area 0.0.0.0 • Dumps snapshots of network topology OSPF Monitor - NSDI 2004

  9. OSPFScan • Tools for off-line analysis of LSA archives • Parse, select (based on queries), and analyze • Functionality supported by OSPFScan • Classification of LSA traffic • Change LSAs, refresh LSAs, duplicate LSAs • Emulation of OSPF Routing • How OSPF routing tables evolved in response to network changes • How end-to-end path within OSPF domain looked like at any instance • Modeling of topology changes • Vertex addition/deletion and link addition/deletion/change_cost • Playback of topology change events • Statistics and report generation OSPF Monitor - NSDI 2004

  10. Performance Evaluation • Performance of LSAR and LSAG through lab experiments • LSAR and LSAG are key to real-time monitoring • How performance scales with LSA-rate and network size OSPF Monitor - NSDI 2004

  11. Measure LSA processing time for LSAG LSA LSA LSA Emulated topology LSA LSA Measure LSA pass-through time for LSAR Experimental Setup PC SUT LSAG TCP connection OSPF adjacency Zebra LSAR TCP connection OSPF Monitor - NSDI 2004

  12. Methodology • Send a burst of LSAs from Zebra to LSAR • Vary number of LSAs (l) in a burst of 1 sec duration • Use of fully connected graph as the emulated topology • Vary number of nodes (n) in the topology • Performance measurements • LSAR performance: LSA “pass-through” time • Zebra measures time difference between sending and receiving an LSA from LSAR • LSAG performance: LSA processing time • Instrumentation of LSAG code OSPF Monitor - NSDI 2004

  13. LSAR Performance OSPF Monitor - NSDI 2004

  14. LSAG Performance OSPF Monitor - NSDI 2004

  15. Deployment • Tier-1 ISP network • Area 0, 100+ routers; point-to-point links • Deployed since January, 2003 • LSA archive size: 8 MB/day • LSAR connection: partial adjacency mode • Enterprise network • 15 areas, 500+ routers; Ethernet-based LANs • Deployed since February, 2002 • LSA archive size: 10 MB/day • LSAR connection: host mode OSPF Monitor - NSDI 2004

  16. LSAG in Day-to-day Operations • Generation of alarms by feeding messages into higher layer network management systems • Grouping of messages to reduce the number of alarms • Prioritization of messages • Validation of maintenance steps and monitoring the impact of these steps on network-wide OSPF behavior • Example: • Network operators use cost-out/cost-in of links to carry out maintenance • A “link-audit” web-page allows operators to keep track of link costs in real-time OSPF Monitor - NSDI 2004

  17. Problems Caught by LSAG • Equipment problem • Detected internal problems in a crucial router in enterprise network • Problem manifested as episodes of OSPF adjacency flapping • Configuration problem • Identified assignment of same router-id to two routers in enterprise network • OSPF implementation bug • Caught a bug in type-3 LSA generation code of a router vendor in ISP network • Faster refresh of LSAs than standards-mandated rate OSPF Monitor - NSDI 2004

  18. Long Term Analysis by OSPFScan • LSA traffic analysis • Identified excessive duplicate LSA traffic in some areas of Enterprise Network • Led to root-cause analysis and preventative steps • Statistics generation • Inter-arrival time of change LSAs in ISP network • Fine-tuning configurable timers related to route calculation (= SPF calculation) • Mean down-time and up-time for links and routers in ISP network • Assessment of reliability and availability OSPF Monitor - NSDI 2004

  19. Lessons Learned through Deployment • New tools reveal new failure modes • Real-time alerting and off-line analysis are complementary • Distributed architecture helped a lot • OSPF exhibits significant activity in real networks • Maintenance and genuine problems • Add functionality incrementally and through interaction with users • Archive all LSAs • LSA volume is manageable • Don’t throw away refresh and duplicate LSAs OSPF Monitor - NSDI 2004

  20. Conclusion • Three component architecture • LSAR: data collection • LSAG: real-time analysis • OSPFScan: off-line analysis • Performance analysis • LSAR and LSAG scale well as LSA-rate and network size increases • Deployment • Deployed in Tier-1 ISP and Enterprise network • Has proved to be an extremely valuable tool for network management • “OSPF Monitor was a Lifesaver” • VP of Networking, Enterprise network OSPF Monitor - NSDI 2004

  21. Future Work • Real-time analysis • Correlation with other fault and performance data for more meaningful alerting • Prioritization of alerts • Off-line analysis • Correlation with other data sources • Work already underway: BGP, fault, performance • Identification of problem signatures and feeding them into real-time component for problem prediction OSPF Monitor - NSDI 2004

  22. Backup Slides OSPF Monitor - NSDI 2004

  23. Overview of OSPF • OSPF is a link-state protocol • Every router learns entire network topology • Topology is represented as graph • Routers are vertices, links are edges • Every link is assigned weight through configuration • Every router uses Dijkstra’s single source shortest path algorithm to build its forwarding table • Router builds Shortest Path Tree (SPT) with itself as root • Shortest Path Calculation (SPF) • Packets are forwarded along shortest paths defined by link weights OSPF Monitor - NSDI 2004

  24. Border routers Area 1 Area 2 Area 0 Areas in OSPF • OSPF allows domain to be divided into areas for scalability • Areas are numbered 0, 1, 2 … • Hub-and-spoke with area 0 as hub • Every link is assigned to exactly one area • Routers with links in multiple areas are called border routers OSPF Monitor - NSDI 2004

  25. OSPF domain R1’s View R1 R1 Area 0 Area 0 200 100 200 100 R2 R3 R2 R3 400 500 400 500 300 200 300 200 B1 B2 B1 B2 20 10 C1 C2 60 70 20 10 50 10.10.4.0/24 10.10.5.0/24 10.10.5.0/24 10.10.4.0/24 Area 1 Area 1 Summarization with Areas • Each router learns • Entire topology of its attached areas • Information about subnets in remote areas and their distance from the border routers • Distance = sum of link costs from border router to subnet OSPF Monitor - NSDI 2004

  26. Link State Advertisements (LSAs) • Every router describes its local connectivity in Link State Advertisements (LSAs) • Router originates an LSA due to… • Change in network topology • Example: link goes down or comes up • Periodic soft-state refresh • Recommended value of interval is 30 minutes • LSA is flooded to other routers in the domain • Flooding is reliable and hop-by-hop • Includes change and refresh LSAs • Flooding leads to duplicate copies of LSAs being received • Every router stores LSAs (self-originated + received) in link-state database (= topology graph) OSPF Monitor - NSDI 2004

  27. Adjacency • Neighbor routers (i.e., routers connected by a physical link) form an adjacency • The purpose is to make sure • Link is operational and routers can communicate with each other • Neighbor routers have consistent view of network topology • To avoid loops and black holes • Link gets used for data forwarding only after adjacency is established • Use of periodic Hellos to monitor the status of link and adjacency OSPF Monitor - NSDI 2004

  28. Equipment Problem at Enterprise Network • Internal errors in a router in area 0 • Episodes where router would drop adjacencies with other routers • Problem manifested in LSAG as “ADJ UP” and “ADJ DOWN” messages • Not visible in other network management systems • Led to proactive maintenance OSPF Monitor - NSDI 2004

  29. Area 0 Area 2 Genuine Anomaly Genuine Anomaly Days Days Artifact: 23 hr day (Apr 7) Days Days Area 3 Area 4 LSA Traffic in Enterprise Network Refresh LSAs Change LSAs Duplicate LSAs OSPF Monitor - NSDI 2004

  30. Overhead: Duplicate LSAs • Why do some areas witness substantial duplicate LSA traffic, while other areas do not witness any? • OSPF flooding over LANs leads to control plane asymmetries and to imbalances in duplicate LSA traffic Days OSPF Monitor - NSDI 2004

More Related