230 likes | 396 Views
Self Healing Wide Area Network Services. Bhavjit S Walha Ganesh Venkatesh. Layout. Introduction Previous Work Issues Solution Preliminary results Problems & Future Extensions Conclusion. Motivation. Companies may have servers distributed over a wide area network
E N D
Self Healing Wide Area Network Services Bhavjit S Walha Ganesh Venkatesh
Layout • Introduction • Previous Work • Issues • Solution • Preliminary results • Problems & Future Extensions • Conclusion
Motivation • Companies may have servers distributed over a wide area network • Akamai Content Distribution Network. • Distributed web-servers • Manual monitoring may not be feasible • Centralized control – may lead to problems in case of a network partition • Typical server applications • May crash due software bugs • Little state is retained • Simple restart is thus sufficient
Motivation … • What if peers monitored each others health? • In case a crash is detected - try and restart. • No central monitoring station involved. • Loosely based on a worm • Resilient to sporadic failures • Spreads to uninfected nodes • But • No backdoor involved • May not always shift to new nodes
Introduction • Previous Work • Issues • Solution • Preliminary results • Problems & Future Extensions • Conclusion
Medusa • All nodes a part of a Multicast Group • Each node is thus in touch with all other nodes through Heatbeat messages. • Nodes send regular updates to the multicast tree • All communication through reliable multicast • In case a node goes down • Other nodes try to restart it • Request for service sent to multicast group
Medusa Problems • Scalability • Assumptions of reliable packet delivery • State information shared with all nodes. • Reliable Multicast • Assumes reliable delivery of packets to all nodes • No explicit ACKs • The kill operations fail in case of a temporary break in Multicast tree. • Security • No way of authenticating packets
Introduction • Previous Work • Issues • Solution • Preliminary results • Problems & Future Extensions • Conclusion
Proposed solution • Nodes form peering relationships with only a subset of other nodes. • Exchange Hello packets • Scalable as the degree is fixed • No central control • No dependence on reliable multicast • Distributed communication protocol • Explicit ACKs for packets • Some super-nodes required to be up when booted • Power of Randomly-connected graphs graphs
Design • Each node continually sends Hello Packets to its peer nodes. • Indicates everything is up and working • A timeout indicates something is wrong • Application crash • Network Partition • Aim at application crashes • Application should be stateless • No code transfer • Remotely restartable • SSH needed – A login account and distributed keys.
Initialization • 3-5 super-nodes form a fully-connected connected graph. • Are expected to be up all the time • All nodes have information about their IPs • May be under manual supervision • May have information about the topology • Responsible for forwarding join requests to other nodes
Remote start • SSH to a remote node to restart • Remote (re)start attempted after Hello timeout. • Current implementation requires keys to be distributed beforehand • Starts a small watchdog program which immediately returns • Checks if there is a another copy already running • Current implementation uses ps • In case the application start fails, do nothing – wait for retry to restart • Possible extension: allow the service to spread
New node comes up… • Waits for others to contact it • After timeout: • Send JoinRequest to a super-node with the number of peers needed. • Supernode forwards this request to other nodes • AddRequest • Some node may ask new node to become its peer • Add to neighbourList and send AddACK • Hello • Can add to neighbourList if unsolicited Hello received • Beneficial in case of a short temporary failures • After Request-timeout: • Contact another super-node with another JoinRequest. • Timeout can be dynamically specified in JoinRequestACK.
New node comes up…Random Walk. • Request forwarded by super-node to 3 random nodes on behalf of new node • Each node forwards it to others • Decrease hop count by 1 each time • If hop count = 0, check if it can support more nodes • YES! • Send AddRequest to new node • Add to neighbourList on receiving AddACK. • NO! • Ignore the request • New node may already have found neighbours • Due to duplicate joinRequest or repair of Network partition • New node thus replies to AddRequest with Die packet.
Shutdown • Critical to ensure that all nodes go down • 3-way protocol • Send kill to target node • Target node replies with die • Send dieACK to target node. • kill • used when multiple copies detected • Possibly to balance load • die • Reply to unsolicited Hello • No perfect solution in case of a network partition
Global Shutdown… • Secret killAll packet • Sent by an external program for complete system shutdown • Forwarded to all neighbours • Node does not die until it receives a killACK from everyone • Stops sending hellos immediately • No further restart attempts • Reply only to die, kill and killAll • May send unnecessary traffic • Eventually time out on seeing zero neighbours.
Performance • Tested on 6 nodes in GradLab • Hello interval: 5s • Hello timeout: 22s • Wait before joinRequest: 10s • joinRequest timeout: 20s • Hop count: 2 • Initial degree request: 3 • Super-nodes: 3 • Preliminary tests on PlanetLab
Results • LAN • No timeouts or packet losses observed • No duplicate copies • killAll works perfectly • Re-start latency: 22s • Decreases after a number of restarts • Join latency: 15s • PlanetLab • Re-start latency: 27s • Join latency: 21s
Introduction • Previous Work • Issues • Solution • Preliminary results • Problems and Future Extensions • Conclusion
Limitations • Security • The packets are not authenticated • Stray copies • After a killAll there may be stray copies • Harmless as they do not try to spread • But: prevents another copy from running • No new nodes • Node discovery • Why should they be idle in first place? • What to do when the original nodes come back up? • Solution • Send regular updates to super-nodes • Extra servers can be killed easily
Parameter tweaking • Hop count for Random Walk • Connectivity • Min-degree to ensure connectivity • Max-degree to spread the failure probability • Timeouts • Request timeout • Depends on hop-count • Hello timeout • Different for WAN & LAN • Global timeout • In case of network partition • Loss of Kill ACK packets
Conclusion • Maintaining High Availability does not always require central control • Achieving a global shutdown is problematic • Need to explore connectivity requirements to ensure a connected graph at all times.