1 / 123

Byzantine Techniques II

Byzantine Techniques II. Justin W. Hart CS 614 12/01/2005. Papers. BAR Fault Tolerance for Cooperative Services . Amitanand S. Aiyer, et. al. (SOSP 2005) Fault-scalable Byzantine Fault-Tolerant Services. Michael Abd-El-Malek et.al.  SOSP 2005. BAR Fault Tolerance for Distributed Services.

luigi
Download Presentation

Byzantine Techniques II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Byzantine Techniques II Justin W. Hart CS 614 12/01/2005

  2. Papers • BAR Fault Tolerance for Cooperative Services. Amitanand S. Aiyer, et. al. (SOSP 2005) • Fault-scalable Byzantine Fault-Tolerant Services. Michael Abd-El-Malek et.al.  SOSP 2005

  3. BAR Fault Tolerance for Distributed Services • BAR Model • General Three-Level Architecture • BAR-B

  4. Motivation • “General approach to constructing cooperative services that span multiple administrative domains (MADs)”

  5. Why is this difficult? • Nodes are under control of multiple administrators • Broken – Byzantine behaviors. • Misconfigured, or configured with malicious intent. • Selfish – Rational behaviors • Alter the protocol to increase local utility

  6. Other models? • Byzantine Models – Account for Byzantine behavior, but do not handle rational behavior. • Rational Models – Account for rational behavior, but may break with Byzantine behavior.

  7. BAR Model • Byzantine • Behaving arbitrarily or maliciously • Altruistic • Execute the proposed program, whether it benefits them or not • Rational • Deviate from the proposed program for purposes of local benefit

  8. BART – BAR Tolerant • It’s a cruel world • At most (n-2)/3 nodes in the system are Byzantine • The rest are rational

  9. Two classes of protocols • Incentive-Compatible Byzantine Fault Tolerant (IC-BFT) • Guarantees a set of safety and liveliness properties • It is in the best interest of rational nodes to follow the protocol exactly • Byzantine Altruistic Rational Tolerant • Guarantees a set of safety and liveliness properties despite the presence of rational nodes • IC-BFT is a subset of BART

  10. An important concept • It isn’t enough for a protocol to survive drills of a handful of attacks. It must provably provide its guarantees.

  11. A flavor of things to come • Protocol builds on Practical Byzantine Fault Tolerance in order to combat Byzantine behavior • Protocol uses game theoretical concepts in order to combat rational behavior

  12. A taste of Nash Equilibrium

  13. …and the nodes are starving! • Nodes require access to a state machine in order to complete their objectives • Protocol contains methods for punishing rational nodes, including denying them access to the state machine

  14. An expensive notion of identity • Identity is established through cryptographic keys assigned through a trusted authority • Prevents Sybil attacks • Bounds the number of Byzantine nodes • Gives rational nodes reason to consider long-term consequences of their actions • Gives real world grounding to identity

  15. Assumptions about rational nodes • “Receive long-term benefit from staying in the protocol” • “Conservative when computing the impact of Byzantine nodes on their utility” • “If the protocol provides a Nash equilibrium, then all rational nodes will follow it” • “Rational nodes do not collude…colluding nodes are classified as Byzantine”

  16. Byzantine nodes • Byzantine fault model • Strong adversary • Adversary can coordinate collusion attacks

  17. Important concepts • Promptness principal • Proof of Misbehavior (POM) • Cost balancing

  18. Promptness principal • If a rational node gains no benefit from delaying a message, it will send it as soon as possible

  19. Proof of Misbehavior (POM) • Self-contained, cryptographic proof of wrongdoing • Provides accountability to nodes for their actions

  20. Example of POM • Node A requests that Node B store a chunk • Node B replies that it has stored the chunk • Later Node A requests that chunk back • Node B sends back random garbage (it hadn’t stored the chunk) and a signature • Because Node A stored a hash of the chunk, it can demonstrate misbehavior on part of Node B

  21. …but it’s a bit more complicated than that! • This corresponds to a rather simple behavior to combat. “Aggressively Byzantine” behavior.

  22. Passive-aggressive behaviors • Harder cases than “aggressively Byzantine” • A malicious Node A could merely lie about misbehavior on the part of Node B • A node could exploit non-determinism in order to shirk work

  23. Cost Balancing • If two behaviors have the same cost, there is no reason to choose the wrong one

  24. Three-Level Architecture

  25. Level 1 • Unilaterally deny service to nodes that fail to deliver messages • “Tit-for-Tat” • Balance costs • No incentive to make the wrong choice • Penance • Unilaterally impose extra work on nodes with untimely responses

  26. Level 2 • Failure to respond to a request by a state machine will generate a POM from a quorum of nodes in the state machine

  27. Level 3 • Makes use of reliable work assignment • Needs only to provide sufficient information to identify valid request/response pairs

  28. Nuts and Bolts • Level 1 • Level 2

  29. Level 1 • Ensure long-term benefit to participants • The RSM rotates the leadership role to participants. • Participants want to stay in the system in order to control the RSM and complete their protocols • Limit non-determinism • Self interested nodes could hide behind non-determinism to shirk work • Use Terminating Reliable Broadcast, rather than consensus. • In TRB, only the sender can propose a value • Other nodes can only adopt this value, or choose a default value

  30. Level 1 • Mitigate the effects of residual non-determinism • Cost balancing • The protocol preferred choice is no more expensive than any other • Encouraging timeliness • Nodes can inflict sanctions on untimely messages • Enforce predictable communication patterns • Nodes have to have participated at every step in order to have the opportunity to issue a command

  31. Terminating Reliable Broadcast

  32. 3f+2 nodes, rather than 3f+1 • Suppose a sender “s” is slow • The same group of nodes now want to determine that “s” is slow • A new leader is elected • Every node but “s” wants a timely conclusion to this, in order to get their turn to propose a value to the state machine • “s” is not allowed to participate in this quorum

  33. TRB provides a few guarantees • They differ during periods of synchrony and periods of asynchrony

  34. In synchrony • Termination • Every non-Byzantine process delivers exactly one message • Agreement • If on non-Byzantine process delivers a message m, then all non-Byzantine processes eventually deliver m

  35. In asynchrony • Integrity • If a non-Byzantine process delivers m, then the sender sent m • Non-Triviality • If the sender is non-Byzantine and sends m, then the sender eventually delivers m

  36. Message Queue • Enforces predictable communication patterns • Bubbles • A simple retaliation policy • Node A’s message queue is filled with messages that it intends to send to Node B • This message queue is interleaved with bubbles. • Bubbles contain predicates indicating messages expected from B • No message except the expected predicate from B can fill the bubble • No messages in A’s queue will go to B until B fills the bubble

  37. Balanced Messages • We’ve already discussed this quite a bit • We assure this at this level of the protocol • This is where we get our gigantic timeout message

  38. Penance • Untimely vector • Tracks a nodes perception of the responsiveness of other nodes • When a node becomes a sender, it includes its untimely vector with the message

  39. Penance • All nodes but the sender receive penance messages from each node. • Because of bubbles, each untimely node must sent a penance message back in order to continue using the system • This provides a penalty to those nodes • The sender is excluded from this process, because it may be motivated to lie in its penance vector, in order to avoid the work of transmitting penance messages

  40. Timeouts and Garbage Collection • Set-turn timeout • Timeout to take leadership away from the sender • Initially 10 seconds in this implementation, in order to overcome all expected network delays • Can only be changed by the sender • Max_response_time • Time at which a node is removed from the system, its messages discarded and its resources garbage collected • Set to 1 week or 1 month in the prototypes

  41. Global Punishment • Badlists • Transform local suspicion into POMs • Suspicion is recorded in a local nodes badlist • Sender includes its badlist with its message • If, over time, recipients see a node in f + 1 different senders badlists, then they too, consider that node to be faulty

  42. Proof • Real proofs do not appear in this paper, they appear in the technical report

  43. …but here’s a bit • Theorem 1: The TRB protocol satisfies Termination, Agreement, Integrity and Non-Triviality

  44. …and a bit more • Theorem 2: No node has a unilateral incentive to deviate from the protocol • Lemma 1: No rational node r benefits from delaying sending the “set-turn” message • Follows from penance • Lemma 2: No rational node r benefits from sending the “set-turn” message early • Sending early could result in senderTO to be sent (this protocol uses synchronized clocks, and all messages are cryptographically signed)

  45. …and the rest that’s mentioned in the paper • Lemma 3: No rational node r benefits from sending a malformed “set-turn” message. • The “set-turn” message only contains the turn number. Because of this, doing so reduces to either sending early (dealt with in Lemma 1) or sending late (dealt with in Lemma 2)

  46. Level 2 • State machine replication is sufficient to support a backup service, but the overhead is unacceptable • 100 participants… 100 MB backed up… 10 GB of drive space • Assign work to individual nodes, using arithmetic codes to provide low-overhead fault-tolerant storage

  47. Guaranteed Response • Direct communication is insufficient when nodes can behave rationally • We introduce a “witness” that overhears the conversation • This eliminates ambiguity • Messages are routed through this intermediary

  48. Guaranteed Response

  49. Guaranteed Response • Node A sends a request to Node B through the witness • The witness stores the request, and enters RequestReceived state • Node B sends a response to Node A through the witness • The witness stores the response, and enters ResponseReceived

  50. Guaranteed Response • Deviation from this protocol will cause the witness to either notice the timeout from Node B or lying on the part of Node A

More Related