110 likes | 257 Views
2-Node Clustering. Active-Standby Deployment. 2-Node Deployment Topology. Active-Standby Requirements. Requirements Configuration of Primary controller in cluster (Must) Primary Controller services the NorthBound IP address, a Secondary takes over NB IP upon failover (Must)
E N D
2-Node Clustering Active-Standby Deployment
2-Node Deployment Topology Active-Standby Requirements Requirements • Configuration of Primary controller in cluster (Must) • Primary Controller services the NorthBound IP address, a Secondary takes over NB IP upon failover (Must) • Configuration of whether on failover & recovery, configured Primary controller reasserts leadership (Must) • Configuration of merge strategy on failover & recovery (Want) • Primary controller is master of all devices and is leader of all shards (Must) • Single node operation allowed (access to datastore on non-quorum) (Want)
Failure of Primary Scenario 1: Master Stays Offline Failover Sequence • Secondary controller becomes master of all devices and leader of all shards
Failure of Primary Scenario 2: Primary Comes Back Online Recovery Sequence • Controller A comes back online and its data is replaced by all of Controller B’s data • For Re-assert leadership configuration: • (ON) Controller A becomes master of all devices and leader of all shards • (OFF) Controller B stays master of all devices and maintains leadership of all shards
Network Partition Scenario 1: During Network Partition Failover Sequence • Controller A becomes master of devices in its network segment and leader of all shards • Controller B becomes master of devices in its network segment and leader of all shards
Network Partition Scenario 2: Network Partition Recovers Recovery Sequence • Merge data according to pluggable merge strategy (Default: Secondary’s data replaced with Primary’s data.) • For Re-assert leadership configuration: • (ON) Controller A becomes master of all devices and leader of all shards again. • (OFF) Controller B becomes master of all devices and leader of all shards again
No-Op Failures Failures That Do Not Result in Any Role Changes Scenarios • Secondary controller failure. • Any single link failure. • Secondary controller loses network connectivity (but device connections to Primary maintained)
Cluster Configuration Options Global & Granular Configuration Global • Cluster Leader(aka “Primary”) • Allow this to be changed on live system, e.g. maintenance. • Assigned (2-Node Case), Elected (Larger Cluster Case) • Cluster Leader Northbound IP • Reassert Leadership on Failover and Recovery • Network Partition Detection Alg. (pluggable) • Global Overrides of Per Device/Group and Per Shard items (below) Per Device / Group • Master / Slave Per Shard • Shard Leader (Shard Placement Strategy – pluggable) • Shard Data Merge (Shard Merge Strategy – pluggable)
HA Deployment Scenarios Simplified Global HA Settings Can we Abstract Configurations to Admin-Defined Deployment Scenarios? • e.g. Admin Configures 2-Node (Active-Standby): • This means Primary controller is master of all devices and leader of all shards. • Conflicting configurations are overridden by deployment scenario.
Implementation Dependencies Potential Changes to Other ODL Projects Clustering: • Refactoring of Raft Actor vs. 2-Node Raft Actor code. • Define Cluster Leader • Define Northbound Cluster Leader IP Alias OpenFlow Plugin: • OpenFlow Master/Slave Roles • Grouping of Master/Slave Roles (aka “Regions”) System: • Be Able to SUSPEND the Secondary controller to support Standby mode.
Open Issues Follow-up Design Discussion Topics TBD: • Is Master/Slave definition too tied to OpenFlow? (Generalize?) • Should device ownership/mastership be implemented by OF Plugin? • How to define Northbound Cluster Leader IP in a platform independent way?(Linux/Mac OSx: IP Alias, Windows: Possible) • Gratuitous ARP on Leader Change. • When both Controllers are active in Network Partition scenario which controller “owns” the Northbound Cluster Leader IP? • Define Controller-Wide SUSPEND behavior (how?) • On failure Primary controller should be elected (2-node case Secondary is only option to be elected) • How/Need to detect management plane failure? (Heartbeat timeout >> w.c. GC?)