280 likes | 300 Views
This document outlines the architecture and test coverage required for achieving 99.999% availability in IP networks. It discusses the importance of reliability, the contributing factors for availability, and the tests needed to ensure high availability.
E N D
Core Router Testing for High Availability Scott Poretsky Avici Systems, Inc. June 3, 2002 Avici Company Confidential
Outline • IP Network Availability • Test Coverage for 99.999% Availability • Commercial Test Equipment Requirements Architecture for the 21st Century Network
IP Network Availability Architecture for the 21st Century Network
High Reliability = More Revenue Reliability is the single biggest criteria in selecting an ISP, according to Interactive Week/Telechoice ISP Customer Survey ISP Customer Survey 4.8 4.8 4.7 4.7 4.6 4.6 4.5 4.5 Relative Importance Relative Importance 4.4 4.4 4.3 4.3 4.2 4.2 4.1 4.1 4 4 Reliability Reliability Value Value Performance Performance Customer Customer Provisioning Provisioning Service Service Speed Speed New IP services demand higher levels of network reliability Architecture for the 21st Century Network
High Reliability = More Profit Compensation for poor router reliability through redundancy and interconnects can increase network cost by up to 50% IP Backbone Service Service Service Peering Provider Provider Provider Peer Peer Peer Core Layer Core Layer (Backbone Router) (Backbone Router) Aggregation Layer Aggregation Layer (Hub Router) (Hub Router) Edge Edge Layer Layer Access Access DSLAM DSLAM L3/4 L3/4 CMTS CMTS Direct Direct GGSN GGSN L3/4 L3/4 VOIP VOIP Direct Direct Switch Switch Connects Connects Switch Switch Connects Connects Devices Devices Architecture for the 21st Century Network
Definitions • Reliable • Capable of being dependable (Webster) • Availability • Measure of Reliability using router/switch Uptime • Mission Reliability • Mean Time Between Critical Failures (MTBCF) or the average time between hardware or software failures that interrupt service (the mission) • Maintenance Reliability • Mean Time Between Failures (MTBF) or the average time between hardware failures that require corrective maintenance actions • Defects Per Million (DPM) • Measure of downtime equal to (1 – Availability) x 106 Architecture for the 21st Century Network
Contributing Factors for Availability Total Time to Restore Router/Switch After aSoftware Failure Mission Reliability CrashDump Time Image Upgrade Time Boot Time Protocol Convergence Time Software Failure Occurs Not to Scale Full Operation Restored Time Total Time to Restore a Module After a Hardware Failure Maintenance Reliability Maintainer Response Time Removal and Replacement Time Boot Time Protocol Convergence Time Time Hardware Failure Occurs Full Operation Restored Not to Scale Architecture for the 21st Century Network
The Availability Goal • The Goal – 99.999% Router Availability • The Reality – 99.9% Router Availability • Features to achieve 99.999% availability. • Non-Stop Routing • Graceful Restart • What if testing could could improve Mission Reliability to achieve 99.999% Availability in absence of new features? • What if the addition of these new features would then achieve 99.9999% Availability? Architecture for the 21st Century Network
Test Coverage Architecture for the 21st Century Network
Traditional Test Coverage • Isolated testing of protocols • Functionality • Conformance • Interoperability • Scaling • Forwarding Performance in the absence of protocols. • Disadvantages • Operational environment is not tested • Operational conditions are not tested • The router under test is not completely stressed. Deployed routers run multiple protocols simultaneously. Architecture for the 21st Century Network
Test Program for 99.999% Availability • Stress Testing • Longevity Testing • Convergence Testing • Network-Specific Topology Testing • Automated Regression Testing Architecture for the 21st Century Network
Stress Testing • Simultaneous configuration and scaling of multiple protocols. • BGP, IGP • MPLS-TE, LDP (optional) • MBGP, PIM-SM, MSDP (optional) • Traffic Forwarding • Line Rate Traffic Forwarding • Overutilize links • Enable QoS • Network Instability • Repeated Route Flaps • Link Loss • Tunnel Reroutes (optional) • Serviceability • Repeated SNMP Gets • Logging Enabled • Debug Enabled • Telnet with SHOW commands (stressful and invalid) Architecture for the 21st Century Network
Stress Configuration Optional Neighbor Router for Tunnel Reroutes Router Under Test Neighbor Router Neighbor Router Test Equipment Test Equipment Test Equipment Architecture for the 21st Century Network
Stress Execution Guidelines • Configure ECMP, Parallel Paths, and Composite Links between routers • Use Live BGP Feed for Route Table • Mix traffic types across links (IP Unicast, IP Multicast, MPLS) • One neighbor router should be a different vendor to show interoperability under stress • Run Stress for many days (if the router lasts that long) Router should experience more in a couple of days then it likely would in its operational lifetime. Architecture for the 21st Century Network
Typical Stress Metrics • Flap 1 million BGP routes per hour • Forward 10 Terabits of data per hour • Perform 100,000 SNMP Gets per hour • Simulate 100 fiber cuts per hour (use every remote interface) • Along with • Full BGP Table • Full IGP Table • Full Multicast Cache • Required MPLS-TE Tunnels (protection optional) • Required LDP FECs • Enable Logging and Protocol Debug Architecture for the 21st Century Network
Longevity Testing • Similar to Stress Testing, but more operational (less stressful) conditions injected over many weeks. • Simultaneous configuration and scaling of multiple protocols • Traffic Forwarding • More realistic Network Instability • More typical Serviceability actions • Use Live Internet feed. Architecture for the 21st Century Network
Convergence Terms • Network Convergence - The point in time at which all nodes in a network have updated their routing tables for a route entry change (new, withdrawal, or modification) • Protocol Convergence - The point in time in which a single node updates its routing table and advertises the route table change to its peer in a routing protocol advertisement (or update) message. • Route Convergence - The point in time in which a single node updates its routing table and reroutes traffic out the new interface. Route Convergence is the common Router Benchmark. Architecture for the 21st Century Network
Convergence Test Issues • Large number of Protocols in which Convergence is important. • Number of conditions that can impact results. • Technical difficulty in testing convergence of one protocol due to flap or instability of another protocol. Architecture for the 21st Century Network
Convergence Test Conditions • Interface shutdown • on Local Interface • on Remote Interface • Fiber Pull • on Local Interface • on Remote Interface • Peer removal via CLI • on Local router • on Peer router • Peer node failure • Route Table changes • Route Withdrawal • Route Flap • Next-Hop Change • Metric Change • Dynamic Constraint Change • Policy Change All conditions must be tested because different results can be produced. Architecture for the 21st Century Network
Network-Specific Topology Testing • Large network with many routers (e.g. 10) • Use multiple vendors for interoperability/functionality testing. • Multiple protocols configured in deployment scenario • Run test cases to match deployment scenario Architecture for the 21st Century Network
Automated Regression Testing • Addition of bug fixes/new features put previously working features at risk. • Regression testing ensures that the previously working features still work. • As the number of releases with new features grow it is more difficult to provide complete regression coverage through manual testing (increasingly labor intensive). • Automated regression testing enables more coverage in less time. • Automation is typically achieved using TCL scripts. • Configuration: Router Under Test Test Equipment Architecture for the 21st Century Network
Commercial Test Equipment Requirements Architecture for the 21st Century Network
The State of the Union • Test Equipment fails to meet today’s requirements for testing 99.999% Availability. • Router vendors have been forced to develop their own specialized test tools. • Carriers have been forced to use the router vendor test tools. Test Equipment vendors must respond to the challenge today. Architecture for the 21st Century Network
Stress Testing Requirements • Maintain BGP Sessions and IGP Adjacencies • Flap BGP Routes • Signal and maintain RSVP-TE tunnels • Distribute LDP FECs • Signal and maintain Multicast Groups • Perform SNMP GETs and check validity • Forward Traffic (IP Unicast, IP Multicast, and MPLS) Make the network seem much bigger than it really is without having to obtain hundreds of routers. Architecture for the 21st Century Network
Required Protocol Emulation/ Conformance Suites Coverage • Routing Protocols • BGP • OSPF, ISIS • OSPF-TE, ISIS-TE • RSVP-TE • Fast Reroute • Standby Tunnels • Ingress, Mid-Point, Egress • LDP • RFC 2547 Layer 3 VPNs • Martini Layer 2 VPNs • P and PE • LDP over RSVP • Multicast • MBGP • PIM-SM • MSDP Architecture for the 21st Century Network
Protocol Emulation Requirements • Run any protocols in combination on the same interface • Forward traffic for emulated protocols • Protocol Emulation on any interface type – GigE, 10GigE, and POS (including 192c). • Scaling • BGP Sessions >500/system, >100/interface • BGP Routes >3M/system, >500K/session • MPLS-TE Tunnels >10K - Ingress, Mid-Point, Egress • FECs >10K • Load external BGP table for advertisement • Controlled BGP Route Flapping Architecture for the 21st Century Network
Automated Regression Requirements • Commercial test equipment vendors offer protocol conformance TCL suites. • Test Case coverage must be improved within each suite • Interaction between protocols must be tested • Need each script to test multiple interfaces (4 or more) • Full Protocol Coverage • Multicast protocols have been the “forgotten son” Architecture for the 21st Century Network
System Requirements • Multiple ports per chassis (>32) • Automated Convergence measurement • Automated reroute/failover measurement • Support for ECMP and Composite Links • System/Protocol Stability For Many Days • Ability to store GUI configuration for repeatability. • Ability to TCL script any GUI test case. Architecture for the 21st Century Network