150 likes | 274 Views
ABC Co. Network Implementation. High reliability is primary concern near 100% uptime required Customer SLA has stiff penalty clauses Everything is designed in a redundant fashion Network redundancy not integrated with system design or application design.
E N D
ABC Co. Network Implementation • High reliability is primary concern • near 100% uptime required • Customer SLA has stiff penalty clauses • Everything is designed in a redundant fashion • Network redundancy not integrated with system design or application design. • Application and system design not integrated • Management added last (to fix problems)
The challenge is always politics • Politics prevents different parts of the company from working together. • Networking, Systems, and Applications are three different groups. • Systems group own the management issues. • Some requirements get in the way: • e.g. Management station must keep its data on the database server.
Network design • “Dual Everything” is the design rule • Dual Routers/hubs (Cisco 5500’s) • Dual Ethernet • Dual attached systems
A simple picture Redundant net to customers Rtr/Hub Rtr/Hub Dual rail Ethernet Server a Server n TNG DNS Wins
More detail • No actual “Ethernet bus” • Systems connect to 5500 via UTP • Each system connects to both 5500’s • one connection is to “primary” LAN, other to secondary LAN • Half have “left” 5500 as primary, other have “right” as primary. • 5500s run OSPF and “router cluster” software
Problems... • Server OS (NT and Unix) do not switch off the primary interface if it fails and will keep trying to use it. Applications hang and connections time out. • DNS points only to one interface on each server. • No automatic failover built into applications.
Management software must: • Detect NIC failures • Continue to monitor system agents in presence of network failures • Correct server routing tables if primary interface fails (or the hub fails) • Update DNS • Notify operations as required.
Challenges • Get each system to report all status via both NICs. • Monitor system over both NICs. • Prevent duplicate notifications. • Fail over as fast as possible. • Show connectivity of each system to both networks.
What needs to be done to do this? • Modify auto discovery scripts to add each system twice as independent systems. • Requires private host file for name/address translation (cannot depend on access to DNS) • Invent system to recognize which interface is “active” and block those from other Nic(s)
More work... • Duplicate any information in Object Repository that is needed to manage failover onto local system (cannot trust access to SQL server) • Store current connectivity state for all servers (added ILPs to class definitions).
Tricks used • Each system name in messages has code added to end to indicate interface address: (-p or -s) • Most of the work is done in event message processing. • Each “raw” message is suppressed and a script evoked to process it. • Ping success/failures used to switch state • Agent messages dropped base on state and p/s flag
Basic set of flows • For each event, (other than pings) • If mode is P or S (kept in NT Registry), and message is from S or P, discard. • Else, reformat message with real server name, improve content (system class, etc.) and send back to event console as a new message
More Flow • For each Ping Success/Fail reported: • Remember DSM has already done the retries • If failure, check to see if other port fails, too. If the other port is dead, too, then declare the node down, and reset state to primary. • If its primary, the do failover to secondary. If secondary, do a “failure” back to primary. • Update DNS in all cases.
Router / Hub failure • If the router/hub fails, invoke the primary failover script for each node connected to the primary side, and the secondary failover script for each node connected to the secondary side. • This is effectively all the nodes, so we don’t have to wait for each to have a ping failure. The system will stabilize faster.
Does it work? • You bet! It required: • Some special REXX scripts for failover • A few Basic programs • A hack to the auto discovery scripts. • Some magic with Trix and a few more basic programs.