400 likes | 548 Views
Brahms. B yzantine- R esilient R a ndom M embership S ampling. Bortnikov, Gurevich, Keidar, Kliot, and Shraer. Edward (Eddie) Bortnikov. Maxim (Max) Gurevich. Idit Keidar. Alexander (Alex) Shraer. Gabriel (Gabi) Kliot. Why Random Node Sampling. Gossip partners
E N D
Brahms Byzantine-Resilient Random Membership Sampling Bortnikov, Gurevich, Keidar, Kliot, and Shraer
Edward (Eddie) Bortnikov Maxim (Max) Gurevich Idit Keidar Alexander (Alex) Shraer Gabriel (Gabi) Kliot
Why Random Node Sampling • Gossip partners • Random choices make gossip protocols work • Unstructured overlay networks • E.g., among super-peers • Random links provide robustness, expansion • Gathering statistics • Probe random nodes • Choosing cache locations
The Setting • Many nodes – n • 10,000s, 100,000s, 1,000,000s, … • Come and go • Churn • Every joining node knows some others • Connectivity • Full network • Like the Internet • Byzantine failures
Byzantine Fault Tolerance (BFT) • Faulty nodes (portion f) • Arbitrary behavior: bugs, intrusions, selfishness • Choose f ids arbitrarily • No CA, but no panacea for Cybil attacks • May want to bias samples • Isolate nodes, DoS nodes • Promote themselves, bias statistics
Previous Work • Benign gossip membership • Small (logarithmic) views • Robust to churn and benign failures • Empirical study [Lpbcast,Scamp,Cyclon,PSS] • Analytical study [Allavena et al.] • Never proven uniform samples • Spatial correlation among neighbors’ views [PSS] • Byzantine-resilient gossip • Full views [MMR,MS,Fireflies,Drum,BAR] • Small views, some resilience [SPSS] • We are not aware of any analytical work
Our Contributions • Gossip-based BFT membership • Linear portion f of Byzantine failures • O(n1/3)-size partial views • Correct nodes remain connected • Mathematically analyzed, validated in simulations • Random sampling • Novel memory-efficient approach • Converges to proven independent uniform samples The view is not all bad Better than benign gossip
Brahms Sampling - local component Gossip - distributed component Gossip view Sampler sample
Sampler Building Block • Input: data stream, one element at a time • Bias: some values appear more than others • Used with stream of gossiped ids • Output: uniform random sample • of unique elements seen thus far • Independent of other Samplers • One element at a time (converging) next Sampler sample
Sampler Implementation • Memory: stores one element at a time • Use random hash function h • From min-wise independent family [Broder et al.] • For each set X, and all , next init Sampler Choose random hash function Keep id with smallest hash so far sample
Component S: Sampling and Validation id streamfrom gossip init next Sampler Sampler Sampler Sampler using pings sample Validator Validator Validator Validator S
Gossip Process • Provides the stream of ids for S • Needs to ensure connectivity • Use a bag of tricks to overcome attacks
Gossip-Based Membership Primer • Small (sub-linear) local view V • V constantly changes - essential due to churn • Typically, evolves in (unsynchronized) rounds • Push: send my id to some node in V • Reinforce underrepresented nodes • Pull: retrieve view from some node in V • Spread knowledge within the network • [Allavena et al. ‘05]: both are essential • Low probability for partitions and star topologies
Brahms Gossip Rounds • Each round: • Sendpushes, pulls to random nodes from V • Wait to receive pulls, pushes • Update S with all received ids • (Sometimes) re-compute V • Tricky! Beware of adversary attacks
Problem 1: Push Drowning Push Alice A E M M Push Bob Push Mallory Push Carol M B M Push Ed Push Dana Push M&M D Push Malfoy M
Trick 1: Rate-Limit Pushes • Use limited messages to bound faulty pushes system-wide • E.g., computational puzzles/virtual currency • Faulty nodes can send portion p of them • Views won’t be all bad
Problem 2: Quick Isolation Ha! She’s out! Now let’s move on to the next guy! Push Alice A E M Push Bob Push Carol Push Mallory Push Ed Push Dana Push M&M Push Malfoy C D
Trick 2: Detection & Recovery • Do not re-compute V in rounds when too many pushes are received • Slows down isolation;does not prevent it Push Bob Push Mallory Hey! I’m swamped! I better ignore all of ‘em pushes… Push M&M Push Malfoy
Trick 3: Balance Pulls & Pushes • Control contribution of push - α|V| ids versus contribution of pull - β|V| ids • Parameters α, β • Pull-only eventually all faulty ids • Pull from faulty nodes – all faulty ids, from correct nodes – some faulty ids • Push-only quick isolation of attacked node • Push ensures: system-wide not all bad ids • Pull slows down (does not prevent) isolation
Trick 4: History Samples • Attacker influences both push and pull • Feedback γ|V| random ids from S • Parameters α + β + γ = 1 • Attacker loses control - samples are eventually perfectly uniform Yoo-hoo, is there any good process out there?
View and Sample Maintenance Pushed ids Pulled ids S |V| |V| |V| View V Sample
Key Property • Samples take time to help • Assume attack starts when samples are empty • With appropriate parameters • E.g., • Time to isolation > time toconvergence Prove lower bound using tricks 1,2,3(not using samples yet) Prove upper bound until some good sample persists forever Self-healing from partitions
History Samples: Rationale • Judicious use essential • Bootstrap, avoid slow convergence • Deal with churn • With a little bit of history samples (10%) we can cope with any adversary • Amplification!
Analysis Sampling - mathematical analysis Connectivity - analysis and simulation Full system simulation
Connectivity Sampling • Theorem:If overlay remains connected indefinitely, samples are eventuallyuniform
Sampling Connectivity Ever After • Perfect sample of a sampler with hash h: the id with the lowest h(id) system-wide • If correct, sticks once the sampler sees it • Correct perfect sample self-healing from partitions ever after • We analyze PSP(t) – probability of perfect sample at time t
Convergence to 1st Perfect Sample • n = 1000 • f = 0.2 • 40% unique ids in stream
Scalability • Analysis says: • For scalability, want small and constant convergence time • independent of system size, e.g., when
Connectivity Analysis 1: Balanced Attacks • Attack all nodes the same • Maximizes faulty ids in views system-wide • in any single round • If repeated, system converges to fixed point ratio of faulty ids in views, which is < 1 if • γ=0 (no history) and p < 1/3 or • History samples are used, any p There are always good ids in views!
Fixed Point Analysis: Push Local view node i Local view node 1 i Time t: lost push push push from faulty node 1 Time t+1: • x(t) – portion of faulty nodes in views at round t; • portion of faulty pushes to correct nodes : • p / ( p + ( 1 − p )( 1 − x(t) ) )
Fixed Point Analysis: Pull Local view node i Local view node 1 i Time t: pull from i: faulty with probability x(t) pull from faulty Time t+1: E[x(t+1)] = p / (p + (1 − p)(1 − x(t))) + ( x(t) + (1-x(t))x(t) ) + γf
Faulty Ids in Fixed Point Assumed perfect in analysis, real history in simulations With a few history samples, any portion of bad nodes can be tolerated Perfectly validated fixed pointsand convergence
Convergence to Fixed Point • n = 1000 • p = 0.2 • α=β=0.5 • γ=0
Connectivity Analysis 2:Targeted Attack – Roadmap • Step 1: analysis without history samples • Isolation in logarithmic time • … but not too fast, thanks to tricks 1,2,3 • Step 2: analysis of history sample convergence • Time-to-perfect-sample < Time-to-Isolation • Step 3: putting it all together • Empirical evaluation • No isolation happens
Targeted Attack – Step 1 • Q: How fast (lower bound) can an attacker isolate one node from the rest? • Worst-case assumptions • No use of history samples ( = 0) • Unrealistically strong adversary • Observes the exact number of correct pushes and complements it to α|V| • Attacked node not represented initially • Balanced attack on the rest of the system
Isolation w/out History Samples • n = 1000 • p = 0.2 • α=β=0.5 • γ=0 Isolation time for |V|=60 Depend on α,β,p
Step 2: Sample Convergence • n = 1000 • p = 0.2 • α=β=0.5, γ=0 • 40% unique ids Perfect sample in 2-3 rounds Empirically verified
Step 3: Putting It All TogetherNo Isolation with History Samples • n = 1000 • p = 0.2 • α=β=0.45 • γ=0.1 Works well despite small PSP
Sample Convergence (Balanced) • p = 0.2 • α=β=0.45 • γ=0.1 Convergence twice as fast with
Summary O(n1/3)-size views Resist Byzantine failures of linear portion Converge to proven uniform samples Precise analysis of impact of failures