250 likes | 366 Views
Practical Byzantine Fault Tolerance. - by Sudha Elavarti . Introduction. The growing reliance of industry and government on online information services. Malicious successful attacks become more serious.
E N D
Practical Byzantine Fault Tolerance - by Sudha Elavarti
Introduction • The growing reliance of industry and government on online information services. • Malicious successful attacks become more serious. • Software errors are more due to the growth in size and complexity of software. • These causes faulty nodes to exhibit Byzantine behavior. • The paper presents practical algo. for state machine replication that works in asynchronous systems like the internet.
…continued • The paper makes following contributions:- • Describes state machine replication protocol that survives Byzantine faults. • Describes number of optimizations that allow algo. to perform well in real systems. • Describes implementation of Byzantine-fault tolerant distributed file system. • Provides experimental results that quantify the cost of replication technique.
System Models • Assumptions: • Asynchronous distributed system where nodes are connected by a network. • The network may fail to deliver messages, delay, duplicate or deliver them out of order. • Byzantine failure model: faulty nodes may behave arbitrarily. • Independent node failures. • The adversary cannot delay correct nodes indefinitely and cannot subvert the cryptographic techniques.
System model contd… • Cryptographic techniques • Public-key signatures. • Message authentication codes. • Message digest produced by collision-resistant hash functions.
Service properties • The algorithm can be used to implement any deterministic replicated service with a state and some operations. • Algorithm provides both safety and liveness assuming no more than [n-1/3] faulty replicas. • Safety is provided to any number of faulty clients, using the service. • Liveness is guaranteed, i.e clients eventually receive replies to the request, provided atmost [n-1/3] replicas are faulty.
Service properties contd.. • 3f+1 is minimum number of replicas that allow an asynchronous system to provide safety and liveness. • Where f is number of faulty replicas. • n= 3f+1 replicas are needed because it must be possible to proceed after communicating with n-f replicas since f replicas might be faulty and not responding. • But the f replicas that did not respond may be non-faulty and therefore f of those responded may be faulty. • n-2f > f therefore n > 3f. • Algo does not address the problem of fault tolerant privacy. • Faulty replica may leak information to an attacker .
Algorithm • Algorithm works roughly as follows • A client sends a request to invoke a service operation to the primary • The primary multicasts the request to the backups • Replicas execute the request and send a reply to the client • The client waits for f+1 replies from different replicas with the same result; this is the result of the operation.
Set of replicas – R • Identify each replica by using an integer in {0,1,….,|R|-1}. • |R|=3f+1, where f is max number of faulty replicas. • Replicas move through a succession of configurations. • In a view one replica is the primary and the others are backups. Views are numbered consecutively. • The primary of a view is replica p such that p= v mod |R|, where v is the view number. • View changes are carried out when it appears that the primary has failed. • all non-faulty replicas agree on a total order for the execution of requests despite failures.
The Client • Client c requests the execution of state machine operation o by sending a {REQUEST,o,t,c} message to the primary. • Timestamp t is used to ensure exactly-once semantics. • Timestamps for c ’s requests are totally ordered such that later requests have higher timestamps than earlier ones. • Primary atomically multicasts the requests to all the backups. • All replicas sends the reply {REPLY,v,t,c,i,r}, directly to the client. • Where v = current view number t = timestamp of the corresponding request i = replica number r = result of executing the requested operation. • Client waits for f+1 replies with valid signatures from different replicas, and with same t and r , before accepting the result r.
Client contd… • If the client does not receive replies soon enough, it broadcasts the request to all replicas. If the request has already been processed, the replicas simply re-send the reply; replicas remember the last reply message they sent to each client. • If the primary does not multicast the request to the group, it will eventually be suspected to be faulty by enough replicas to cause a view change.
Normal-Case Operation • state of each replica is stored in a message log. • Primary p receives a client request m , it starts a three-phase protocol. • Three phases are: pre-prepare, prepare, commit. • Pre-prepare and prepare phases is used to totally order requests. • In pre-prepare phase • Primary assigns sequence number n to request. • Multicast pre-prepare msg. with m piggybacked to all backups and appends the msg. to its log. • Msg= < < pre-prepare,v,n,d > ,m > d=msg m’s digest
If backup i accepts the pre-prepare msg. it enters prepare phase by multicasting <PREPARE,v,n,d,i> msg to all other replicas and adds both msgs to its log. Otherwise does nothing. • a replica (including primary) accepts prepare msg and adds them to its log, provided • Their signatures are correct • The view number equals the replica’s current view number. • Their sequence number is between h and H. • We define predicate prepared (m,v,n,i)= true, iff 2f prepares from different backups that match the pre-prepare. • When prepared = true, replica i multicasts a <COMMIT,v,n,D(m),i> to other replicas.
Replicas accept commit msgs and insert them in their log provided signatures are same. • We define committed and committed-local predicates as follows. • Commited(m,v,n) = true, iff prepared(m,v,n,i) is true for all i in some set off+1non-faulty replicas. • Committed-local(m,v,n,i) = true iff the replica has accepted 2f+1 commit msg from different replicas that match the pre-prepare for m • Replica i executes the operation requested by m after committed-local(m,v,n,i)= true and i’s state reflects the sequential execution of all requests with lower sequence numbers. • This ensures that all non-faulty replicas execute request in same order as required to provide safety property. • The algorithm provides safety if all non-faulty replicas agree on the sequence number of requests that commit locally.
Garbage Collection GC is mechanism used to discard msg’s from the log. For the safety condition to hold, messages must be kept in a replica’s log until it knows that the requests that concern have been executed by alteast f+1 non-faulty replicas. This is achieved by checkpoint, which occur when a request with sequence number (n) is divisible by some constant is executed. When a replica i produces a checkpoint it multicasts a msg <CHECKPOINT,n,d,i> to other replicas. Each replica collects checkpoint msgs in its log until it has 2f+1 of them for sequence number n with same digest d. This creates a stable checkpoint and the replica discards all the pre-prepare, prepare and commit msgs. Checkpoint protocol is used to advance low and high water marks. Low water mark h=the sequnce num of last stable check point and high water mark= h+k, where k is large enough
View Changes • View change protocol provides liveness by allowing by allowing the system to make progress when the primary fails. View changes are triggered by timeouts that prevent backups from waiting indefinitely for request to execute. • If the timer of backup expires in view v, the backup starts a view change to move the system to view v+1. it stops accepting messages (other than check-point, view-change, and new-view messages) and multicast a <VIEW-CHANGE, v+1, n, C, P, i>. • When the primary p of view v+1 receives 2f valid view-change messages from other replicas, it multicasts a <NEW-VIEW, v+ 1, V, O> message to all other replicas.
Liveness • To provide Liveness replicas must move to a new view if they are unable to execute a request. • To avoid starting a view change too soon, a replica that multicasts a view-change message for view v+1, waits for 2f+1 view-change messages and then starts the timer T. • If the timer T expires before receiving new-view msg it starts the view change for view v+2. The timer will wait 2T before starting a view-change from v+2 to v+3. • If a replica receives f+1 valid view-change messages from other replicas for views greater than its current view, it sends a view-change message for the smallest view in the set, even if T expires. • Faulty replicas cannot cause a view-change by sending a view-change message. View-change will happen only if at least f+1 replicas send view-change message • The above three techniques guarantee liveness, unless message delays grow faster than the timeout period indefinitely.
OptimizationsReducing Communication • Three optimizations are used to reduce the cost of communication • First avoid sending most large replies. • Reduces bandwidth consumption. • Reduces CPU overhead. • Second optimization reduces the number of message delays for an operation invocation. • Third optimization improves the performance of read-only operations that do not modify the service state.
Cryptography • Digital signatures are used only for view-change and new-view messages. All other messages are authenticated using message authentication codes ( MAC). • MACs can be computed three orders of magnitude faster than digital signatures. • Other public-key cryptosystems generate faster signatures, but low verification and in this algorithm each signature is verified many times. • Each node shares a 16-byte secret session-key with each replica. • Digital signature in a reply message is replaced by single MAC, signatures in all other messages are replaced by vectors of MACs called authenticators. • Time to verify an authenticator is constant, the size grows linearly with the number of replicas, but slowly.
ImplementationThe Replication Library • The client interface to the replication library consists of a single procedure, invoke, with one argument, and an input buffer containing a request to invoke a state machine operation. • On the server side the replication code makes a number of up calls to procedures that server part of replication must implement. • The procedures are , execute, make_checkpoint,delete_checkpoint, get_digest, get_checkpoint, set_checkpoint. • Point-to-point communication between nodes is implemented using UDP, and multicast to the group of replicas is implemented using UDP over IP multicast • The algorithm tolerates out-of-order delivery and rejects duplicates.
Byzantine-Fault-tolerant File System • BFS is implemented using replication library • Application processes run unmodified and interact through the NFS client in the kernel. • User-level relay processes mediate communication between the standard NFS client and the replicas. • Relay receives NFS requests, invokes procedure of replication library and sends the result back to NFS client. • Each replica runs a user-level process with replication library and NFS V2 daemon, which is referred as snfsd. • Replication library receives request from the relay and interacts with snfsd by making up calls.
Performance Evaluation • EXPERIMENTAL SETUP • Experiments measure normal-case behavior (no view-changes) • All experiments run with one client running two relays and four replicas. Four replicas can tolerate one Byzantine fault. • Micro-benchmark provides a service-independent evaluation of the performance of the replication library. • Andrew benchmark is used to compare BFS with two other file-systems :- • NFS V2 implementation in Digital UNIX • BFS without replication.
Conclusion • The algorithm works correctly in asynchronous system like the internet. • The performance of BFS is only 3% worse than the standard NFS implementation. • Good performance is due to replacing public-key signatures by Message Authentication Codes, reducing the size and number of messages, and the incremental checkpoint management technique. • One reason why Byzantine fault tolerant algorithms is important in future is that they allow the system to work correctly even when there are software errors. • not all, software errors that occur in all replicas • It can mask errors that occur independently at different replicas • Non-deterministic software errors • Persistent errors