170 likes | 187 Views
Improving Robustness in Distributed Systems. Per Bergqvist per@synapse.se Erlang User Conference 2001 (courtesy CellPoint Systems AB). Design base. Cluster of cooperating hosts Erlang and C COTS hardware based Unix based (i.e. Solaris or Linux)
E N D
Improving Robustness in Distributed Systems Per Bergqvistper@synapse.se Erlang User Conference 2001 (courtesy CellPoint Systems AB)
Design base • Cluster of cooperating hosts • Erlang and C • COTS hardware based • Unix based (i.e. Solaris or Linux) • 10/100/1000 base-T back plane(”system area network”)
Cluster • Shared, distributed, system configuration • Each host have ONE cluster controller • Dispatch and supervise worker tasks • Master cluster controller: holds configuration database (persistent replica) • Slave cluster controller: gets configuration from master cluster controllers • Cluster is DOWN when all master cluster controllers are inaccessible
Typical system Traffic Firewall Switch Control
Cluster Key Benefits • Single system view • Enforces decoupling of parts of O&M from actual traffic processing
Implementing a cluster • Cluster->Host->Node->NodeData • Cluster global parameters • Subscription mechanisms for conf. changes • Mnesia as configuration database on master cluster controllers • Homebrewn configuration distribution to slave controllers (NOT using mnesia) • (Worker) node supervision
Mnesia gotchas • First distributed node startup • Disallow writes when all replicas not accessible • Use timeout on table load and force load
... BUT ... • TCP based distribution • Network partitioning
Network parameters • Align TCP retransmission intervals w/ Erlang heartbeats • Align TCP and IP rerouting parameters
Typical system II: Dual back plane Firewall Switch Traffic Control
Host A Host C Host B Erlang multi-homing problem
Multi-home Erlang w/ TCP • Add an alias interface to loop back i/f • Patch tcp distribution to bind to alias • Publish alias interface on (all wanted) via real hw i/f’s • Method 1: Static routes and gratuitous/proxy arp • Method 2: Use new (routing) protocol
ARP method • Implement a utility to:- broadcast unsolicited ARP responses- respond to ARP requests for the alias i/f address • Add static routes on all far end systems • NOTE: all real i/f needs to be on same IP subnet
New routing protocol • Broadcast (Ethernet frames) what you have, including interface priority • Let the far end select path based on what/when they receive • Far end dynamically sets up host routes • Use short retransmission intervals
Erlang multi-homing resolved ? Host A Host C Host B
Summing up • Erlang can support multihoming with some additional work • By using loop back alias i/f, link failure becomes a routing problem (peer-peer association is kept intact) • Solaris TCP/IP stack parameters are:- hard to find (only in out-of-date app. notes)- hard to set ”right”- host global • A distribution mechanism with built-in support for multi-homing preferred
Erlang Distribution over SCTP Per Bergqvist et alper@synapse.se Erlang User Conference 2002