1.23k likes | 1.4k Views
Byzantine Techniques II. Justin W. Hart CS 614 12/01/2005. Papers. BAR Fault Tolerance for Cooperative Services . Amitanand S. Aiyer, et. al. (SOSP 2005) Fault-scalable Byzantine Fault-Tolerant Services. Michael Abd-El-Malek et.al. SOSP 2005. BAR Fault Tolerance for Distributed Services.
E N D
Byzantine Techniques II Justin W. Hart CS 614 12/01/2005
Papers • BAR Fault Tolerance for Cooperative Services. Amitanand S. Aiyer, et. al. (SOSP 2005) • Fault-scalable Byzantine Fault-Tolerant Services. Michael Abd-El-Malek et.al. SOSP 2005
BAR Fault Tolerance for Distributed Services • BAR Model • General Three-Level Architecture • BAR-B
Motivation • “General approach to constructing cooperative services that span multiple administrative domains (MADs)”
Why is this difficult? • Nodes are under control of multiple administrators • Broken – Byzantine behaviors. • Misconfigured, or configured with malicious intent. • Selfish – Rational behaviors • Alter the protocol to increase local utility
Other models? • Byzantine Models – Account for Byzantine behavior, but do not handle rational behavior. • Rational Models – Account for rational behavior, but may break with Byzantine behavior.
BAR Model • Byzantine • Behaving arbitrarily or maliciously • Altruistic • Execute the proposed program, whether it benefits them or not • Rational • Deviate from the proposed program for purposes of local benefit
BART – BAR Tolerant • It’s a cruel world • At most (n-2)/3 nodes in the system are Byzantine • The rest are rational
Two classes of protocols • Incentive-Compatible Byzantine Fault Tolerant (IC-BFT) • Guarantees a set of safety and liveliness properties • It is in the best interest of rational nodes to follow the protocol exactly • Byzantine Altruistic Rational Tolerant • Guarantees a set of safety and liveliness properties despite the presence of rational nodes • IC-BFT is a subset of BART
An important concept • It isn’t enough for a protocol to survive drills of a handful of attacks. It must provably provide its guarantees.
A flavor of things to come • Protocol builds on Practical Byzantine Fault Tolerance in order to combat Byzantine behavior • Protocol uses game theoretical concepts in order to combat rational behavior
…and the nodes are starving! • Nodes require access to a state machine in order to complete their objectives • Protocol contains methods for punishing rational nodes, including denying them access to the state machine
An expensive notion of identity • Identity is established through cryptographic keys assigned through a trusted authority • Prevents Sybil attacks • Bounds the number of Byzantine nodes • Gives rational nodes reason to consider long-term consequences of their actions • Gives real world grounding to identity
Assumptions about rational nodes • “Receive long-term benefit from staying in the protocol” • “Conservative when computing the impact of Byzantine nodes on their utility” • “If the protocol provides a Nash equilibrium, then all rational nodes will follow it” • “Rational nodes do not collude…colluding nodes are classified as Byzantine”
Byzantine nodes • Byzantine fault model • Strong adversary • Adversary can coordinate collusion attacks
Important concepts • Promptness principal • Proof of Misbehavior (POM) • Cost balancing
Promptness principal • If a rational node gains no benefit from delaying a message, it will send it as soon as possible
Proof of Misbehavior (POM) • Self-contained, cryptographic proof of wrongdoing • Provides accountability to nodes for their actions
Example of POM • Node A requests that Node B store a chunk • Node B replies that it has stored the chunk • Later Node A requests that chunk back • Node B sends back random garbage (it hadn’t stored the chunk) and a signature • Because Node A stored a hash of the chunk, it can demonstrate misbehavior on part of Node B
…but it’s a bit more complicated than that! • This corresponds to a rather simple behavior to combat. “Aggressively Byzantine” behavior.
Passive-aggressive behaviors • Harder cases than “aggressively Byzantine” • A malicious Node A could merely lie about misbehavior on the part of Node B • A node could exploit non-determinism in order to shirk work
Cost Balancing • If two behaviors have the same cost, there is no reason to choose the wrong one
Level 1 • Unilaterally deny service to nodes that fail to deliver messages • “Tit-for-Tat” • Balance costs • No incentive to make the wrong choice • Penance • Unilaterally impose extra work on nodes with untimely responses
Level 2 • Failure to respond to a request by a state machine will generate a POM from a quorum of nodes in the state machine
Level 3 • Makes use of reliable work assignment • Needs only to provide sufficient information to identify valid request/response pairs
Nuts and Bolts • Level 1 • Level 2
Level 1 • Ensure long-term benefit to participants • The RSM rotates the leadership role to participants. • Participants want to stay in the system in order to control the RSM and complete their protocols • Limit non-determinism • Self interested nodes could hide behind non-determinism to shirk work • Use Terminating Reliable Broadcast, rather than consensus. • In TRB, only the sender can propose a value • Other nodes can only adopt this value, or choose a default value
Level 1 • Mitigate the effects of residual non-determinism • Cost balancing • The protocol preferred choice is no more expensive than any other • Encouraging timeliness • Nodes can inflict sanctions on untimely messages • Enforce predictable communication patterns • Nodes have to have participated at every step in order to have the opportunity to issue a command
3f+2 nodes, rather than 3f+1 • Suppose a sender “s” is slow • The same group of nodes now want to determine that “s” is slow • A new leader is elected • Every node but “s” wants a timely conclusion to this, in order to get their turn to propose a value to the state machine • “s” is not allowed to participate in this quorum
TRB provides a few guarantees • They differ during periods of synchrony and periods of asynchrony
In synchrony • Termination • Every non-Byzantine process delivers exactly one message • Agreement • If on non-Byzantine process delivers a message m, then all non-Byzantine processes eventually deliver m
In asynchrony • Integrity • If a non-Byzantine process delivers m, then the sender sent m • Non-Triviality • If the sender is non-Byzantine and sends m, then the sender eventually delivers m
Message Queue • Enforces predictable communication patterns • Bubbles • A simple retaliation policy • Node A’s message queue is filled with messages that it intends to send to Node B • This message queue is interleaved with bubbles. • Bubbles contain predicates indicating messages expected from B • No message except the expected predicate from B can fill the bubble • No messages in A’s queue will go to B until B fills the bubble
Balanced Messages • We’ve already discussed this quite a bit • We assure this at this level of the protocol • This is where we get our gigantic timeout message
Penance • Untimely vector • Tracks a nodes perception of the responsiveness of other nodes • When a node becomes a sender, it includes its untimely vector with the message
Penance • All nodes but the sender receive penance messages from each node. • Because of bubbles, each untimely node must sent a penance message back in order to continue using the system • This provides a penalty to those nodes • The sender is excluded from this process, because it may be motivated to lie in its penance vector, in order to avoid the work of transmitting penance messages
Timeouts and Garbage Collection • Set-turn timeout • Timeout to take leadership away from the sender • Initially 10 seconds in this implementation, in order to overcome all expected network delays • Can only be changed by the sender • Max_response_time • Time at which a node is removed from the system, its messages discarded and its resources garbage collected • Set to 1 week or 1 month in the prototypes
Global Punishment • Badlists • Transform local suspicion into POMs • Suspicion is recorded in a local nodes badlist • Sender includes its badlist with its message • If, over time, recipients see a node in f + 1 different senders badlists, then they too, consider that node to be faulty
Proof • Real proofs do not appear in this paper, they appear in the technical report
…but here’s a bit • Theorem 1: The TRB protocol satisfies Termination, Agreement, Integrity and Non-Triviality
…and a bit more • Theorem 2: No node has a unilateral incentive to deviate from the protocol • Lemma 1: No rational node r benefits from delaying sending the “set-turn” message • Follows from penance • Lemma 2: No rational node r benefits from sending the “set-turn” message early • Sending early could result in senderTO to be sent (this protocol uses synchronized clocks, and all messages are cryptographically signed)
…and the rest that’s mentioned in the paper • Lemma 3: No rational node r benefits from sending a malformed “set-turn” message. • The “set-turn” message only contains the turn number. Because of this, doing so reduces to either sending early (dealt with in Lemma 1) or sending late (dealt with in Lemma 2)
Level 2 • State machine replication is sufficient to support a backup service, but the overhead is unacceptable • 100 participants… 100 MB backed up… 10 GB of drive space • Assign work to individual nodes, using arithmetic codes to provide low-overhead fault-tolerant storage
Guaranteed Response • Direct communication is insufficient when nodes can behave rationally • We introduce a “witness” that overhears the conversation • This eliminates ambiguity • Messages are routed through this intermediary
Guaranteed Response • Node A sends a request to Node B through the witness • The witness stores the request, and enters RequestReceived state • Node B sends a response to Node A through the witness • The witness stores the response, and enters ResponseReceived
Guaranteed Response • Deviation from this protocol will cause the witness to either notice the timeout from Node B or lying on the part of Node A