1 / 24

A Case-study of OSPF Behavior in a Large Enterprise Network

A Case-study of OSPF Behavior in a Large Enterprise Network. Aman Shaikh, UCSC Chris Isett, Siemens Health Services Albert Greenberg, AT&T Labs-Research Matthew Roughan, AT&T Labs-Research Joel Gottlieb, AT&T Labs-Research IMW – November 07, 2002. Why Study OSPF Behavior?.

johana
Download Presentation

A Case-study of OSPF Behavior in a Large Enterprise Network

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Case-study of OSPF Behavior in a Large Enterprise Network Aman Shaikh, UCSC Chris Isett, Siemens Health Services Albert Greenberg, AT&T Labs-Research Matthew Roughan, AT&T Labs-Research Joel Gottlieb, AT&T Labs-Research IMW – November 07, 2002 IMW - 2002

  2. Why Study OSPF Behavior? • Any meaningful performance assurance depends on routing stability • An internal network change (OSPF event) can have major impact on services, flows and customers • Transients can degrade services significantly (e.g., VoIP) • Expectations for IP network management are higher • Improve OSPF performance, particularly reliable and fast detection of topology change, without introducing instabilities • Changes are needed • Parameter adjustment or more fundamental • Realistic workload model for simulations are needed • Testing scalability, convergence, reliability • However, the behavior and performance of OSPF in large ISPs and enterprise networks is not well understood IMW - 2002

  3. OSPF • OSPF is a Link-state routing protocol • All routers in the domain come to a consistent view of the topology by exchange of Link State Advertisements (LSAs) • Router describes its local connectivity (i.e., set of links) in an LSA • Set of LSAs (self-originated + received) at a router = topology • Hierarchical routing • OSPF domain can be divided into areas • Hub-and-spoke topology with area 0 as hub and other non-zero areas as spokes IMW - 2002

  4. OSPF Performance • OSPF processing impacts convergence, (in)stability • Load is increasing as networks grow • Bulk of OSPF processing is due to LSAs • Sending/receiving LSAs • LSAs can trigger Route calculation (Dijkstra’s algorithm) • Understanding dynamics of LSA traffic is key for a better understanding of OSPF IMW - 2002

  5. Methodology • Categorize and baseline LSA traffic • Detect, diagnose and act on anomalies • Propose changes to improve performance IMW - 2002

  6. Categorizing LSA Traffic • A router originates an LSA due to… • Change in network topology • Example: link goes down or comes up • Detection of anomalies and problems • Periodic soft-state refresh • Recommended value of interval is 30 minutes • Forms baseline LSA traffic • LSAs are disseminated using reliable flooding • Includes change and refresh LSAs • Flooding leads to duplicate copies of LSAs being received at a router • Overhead: wastes resources Change LSAs Refresh LSAs Duplicate LSAs IMW - 2002

  7. Highlights of the Results • Categorize, baseline and predict • Categories: Refresh, Change, Duplicate; External, Internal • Bulk of LSA traffic is due to refresh • Refresh LSA traffic is smooth: no evidence of refresh synchronization across network • Refresh LSA traffic is predictable from router configuration info • Detect, diagnose and act • Almost all LSAs arise from persistent yet partial failure modes • Internal LSA spikes • Indicate router hardware degradation • Carry out preventive maintenance • External LSA spikes • Indicate degradation in customer connectivity • Call customer before customer calls you • Propose Improvements • Simple configuration changes to reduce duplicate LSA traffic IMW - 2002

  8. Enterprise Network Case Study • The network provides customers with connectivity to applications and databases residing in the data center • OSPF network • 15 areas, 500 routers • This case study covers 8 areas, 250 routers • One month: April 2002 • Link-layer = Ethernet-based LANs • Customers are connected via leased lines • Customer routes are injected via EIGRP into OSPF • The routes are propagated via external LSAs • Quite reasonable for the enterprise network in question IMW - 2002

  9. External (EIGRP) Area A LAN1 LAN2 B1 B2 Monitor Border rtrs Area 0 Enterprise Network Topology Customer Customer Customer OSPF Domain Area A Area B Area 0 Area C Servers Database Applications Monitor is completely passive No adjacencies with any routers Receives LSAs on a multicast group IMW - 2002

  10. Area 0 Area 2 Genuine Anomaly Genuine Anomaly Days Days Artifact: 23 hr day (Apr 7) Days Days Area 3 Area 4 LSA Traffic in Different Areas Refresh LSAs Change LSAs Duplicate LSAs IMW - 2002

  11. Baseline LSA Traffic: Refresh LSAs • Refresh LSA traffic can be reliably predicted using information available in router configuration files • Important for workload modeling • See paper for details Days Days Area 2 Area 3 IMW - 2002

  12. Refresh process is not synchronized Negligible LSA clumping • No evidence of synchronization • Contrary to simulation-based study in [Basu01] • Reasons • Changes in the topology help break synchronization • LSA refresh at one router is not coupled with LSA refresh at other routers • Drift in the refresh interval of different routers IMW - 2002

  13. Anomaly Detection: Change LSAs Days • Internal to OSPF domain versus external • Change LSAs due to external events dominated • Not surprising due to large number of leased lines used to import customer routes into OSPF • Customer volatility  network volatility IMW - 2002

  14. Root Causes of Change LSAs • Persistent problem  flapping  numerous change LSAs • Internal LSA spikes  hardware router problems • OSPF monitor identified a problem (not visible to SNMP-based network mgt tools) early and led to preventive maintenance • External LSA spikes  customer route volatility • Overload of an external link to a customer between 8 pm – 4 am causes EIGRP session on that link to flap IMW - 2002

  15. Overhead: Duplicate LSAs • Why do some areas witness substantial duplicate LSA traffic, while other areas do not witness any? • OSPF flooding over LANs leads to control plane asymmetries and to imbalances in duplicate LSA traffic Days IMW - 2002

  16. LSA Flooding over Broadcast LANs • DR = Designated router, BDR = Backup Designated Router • Who becomes DR and BDR depends on configuration • Flooding on a LAN is a two-step process: • A router multicasts LSA to DR and BDR • DR or BDR multicasts LSA to other routers • LSA appears only twice on LAN instead of n – 1 times LAN DR BDR IMW - 2002

  17. Control Plane Asymmetry • Two LANs (LAN1 and LAN2) in each area • Monitor is on LAN1 • Routers B1 and B2 are connected to LAN1 and LAN2 • LSAs originated on LAN2 can get duplicated depending on which routers have become DR and BDR on LAN1 • Leads to control plane asymmetry • Four cases IMW - 2002

  18. Case 3 (R, B1) Case 4 (R, R’) Case 2 (B1, R) Case 1 (B1, B2) DR DR LAN1 LAN1 LAN1 LAN1 B1 (BDR) B1 (DR) B2 (BDR) B1 (DR) B1 B2 B2 B2 LAN2 LAN2 LAN2 LAN2 Four Cases IMW - 2002

  19. Eliminating Duplicate LSA Traffic IMW - 2002

  20. Summary • Categorize and baseline LSA traffic • Refresh LSAs: constitute bulk of overall LSA traffic • No evidence of synchronization between different routers • Refresh LSA traffic predictable from configuration information • Detect, diagnose and act on anomalies • Change LSAs: can indicate persistent yet partial failure modes • Internal LSA spikes  hardware router problems  preventive router maintenance • External LSA spikes  customer congestion problems  “preventive” customer care • Propose changes to improve performance • Duplicate LSAs: can arise from control plane asymmetries • Simple configuration changes can eliminate duplicate LSAs and improve performance IMW - 2002

  21. Future Work • Study OSPF behavior in other commercial networks • ISPs, enterprise networks • Longer term studies • Combine with other data sources • BGP: interaction with OSPF • Traffic: impact of routing on forwarding • Convergence • Better monitoring and management tools • Good simulation models • Combine with router-level measurements [Shaikh & Greenberg, IMW ‘01] IMW - 2002

  22. Backup IMW - 2002

  23. Questions • OSPF is a Link-state routing protocol • All routers in the domain come to a consistent view of the topology by exchange of Link State Advertisements (LSAs) • Three categories of LSAs: refresh, change, duplicate • Refresh • Is the refresh traffic predictable? Can it be baselined? • Is refresh traffic synchronized in real networks? • Change • What is the nature of change LSA traffic, arising from internal and external sources? • What do the failure modes look like? • Is it possible to use this traffic to trigger preventive maintenance traffic (e.g., just as measurements of bit error rates triggers preventive maintenance of the data plane) • Duplicate • Can duplicate LSAs be reduced? At what cost to reliability? IMW - 2002

  24. SPF Calculation LSA LSA Data packet LS Ack Data packet Router Model LSA Processing Route Processor (CPU) OSPF Process LSA Flooding Topology View SPF Calculation FIB Update FIB Forwarding Forwarding Switching Fabric Interface card Interface card IMW - 2002

More Related