240 likes | 470 Views
State Machine Replication. I do Zachevsky Marat Radan Supervisor: Ittay Eyal . Project Presentation. Winter Semester 2010. Goals. Learn and understand Paxos and Python. Design program for fault-tolerant distributed system using the Paxos algorithm.
E N D
State Machine Replication Ido Zachevsky Marat Radan Supervisor: Ittay Eyal Project Presentation Winter Semester 2010
Goals • Learn and understand Paxos and Python. • Design program for fault-tolerant distributed system using the Paxos algorithm. • Test on a real internet scale system, Planet-Lab.
The Problem – Distributed Storage • Using Distributed Algorithms on a network has many advantages • It also has many problems • This project focuses on the Synchronization Problem
Synchronization • The task: Successfully issue a state machine which involves all the computers of a network • All the computers need to be in sync regarding the Current State and the Next States. • All the computers need to know the transitions.
Problems? • Can any computer choose the next state? • What if a computer disconnects ungracefully? • What if a message is delayed due to congestion? • Other problems… • Solution: Use a dedicated algorithm
A Solution – Paxos • Keeping the Safety requirements ensures an agreed-upon value, by all computers, is chosen • Keeping the Liveness requirements ensures a value will be chosen
Paxos - Background Paxos Made Simple Leslie Lamport 01 Nov 2001 • Paxos Made Live
Principles • The system consists of three agent classes: • Proposers • Acceptors • Learners • Some of them distinguished • Communicate via messages
Principles – continued • A single computer – a Leader – is in charge • Decision cycle in two phases: • A majority must promise to commit to a recent proposal. • Once a majority has committed, all computers are informed of the Decision.
Safety requirements • Only a value that has been proposed may be chosen, • Only a single value is chosen, and • A process never learns that a value has been chosen unless it actually has been.
Liveness requirements • Some proposed value is eventually chosen. • A process can eventually learn the value which has been chosen.
Implementing a State Machine • Collection of servers, each implementing a state machine. • The i-th state machine command in the sequence is the value chosen by the i-th instance of the Paxos consensus algorithm. • A pre-decided set of commands is necessary.
Planet-Lab • Planet-Lab is a global research network that supports the development of new network services. • Understanding the system is required • Monitoring is necessary • Generally, implemented via NSSL-lab.
Project Design • Chosen language for implementation: Python • Network framework: Twisted Matrix • Implementation stages: • Single Decision on NSSL • Multiple Decisions on NSSL • Single Decision on Planet-Lab • Multiple Decisions on Planet-Lab
Implementation • Use Cases • Acceptor disconnects? • Leader disconnects? • At which stage? • Acceptor message fails to deliver?
Implementation • Leader Election • In fact an inherent part of the algorithm • Output and monitoring • Actual output not visible in general • Only via monitoring
Flow • Register Nodes • Verify and install necessary files • Upload • Initiate Monitor • Run and wait for activity • Review results
Results • Everything works at the NSSL • In Real-Life, not necessarily • Communication phenomena – messages arriving unordered, in large chunks, etc. • Works well for up to 20-30 Nodes • Use cases tested in Lab
Conclusions • Preliminary work needed to understand Twisted Matrix and Planet-Lab • Dealing with network problems • SSH Tunnel instead of “real” monitoring • Requirements fulfilled
Further work • Optimize networking protocol • Improve client-server interface • Inefficient startup – N(N-1) for N machines • Partition Decision processes • Only few nodes decide each resolution