Brahms

Brahms Byzantine-Resilient Random Membership Sampling Bortnikov, Gurevich, Keidar, Kliot, and Shraer

Edward (Eddie) Bortnikov Maxim (Max) Gurevich Idit Keidar Alexander (Alex) Shraer Gabriel (Gabi) Kliot

Why Random Node Sampling • Gossip partners • Random choices make gossip protocols work • Unstructured overlay networks • E.g., among super-peers • Random links provide robustness, expansion • Gathering statistics • Probe random nodes • Choosing cache locations

The Setting • Many nodes – n • 10,000s, 100,000s, 1,000,000s, … • Come and go • Churn • Every joining node knows some others • Connectivity • Full network • Like the Internet • Byzantine failures

Byzantine Fault Tolerance (BFT) • Faulty nodes (portion f) • Arbitrary behavior: bugs, intrusions, selfishness • Choose f ids arbitrarily • No CA, but no panacea for Cybil attacks • May want to bias samples • Isolate nodes, DoS nodes • Promote themselves, bias statistics

Previous Work • Benign gossip membership • Small (logarithmic) views • Robust to churn and benign failures • Empirical study [Lpbcast,Scamp,Cyclon,PSS] • Analytical study [Allavena et al.] • Never proven uniform samples • Spatial correlation among neighbors’ views [PSS] • Byzantine-resilient gossip • Full views [MMR,MS,Fireflies,Drum,BAR] • Small views, some resilience [SPSS] • We are not aware of any analytical work

Our Contributions • Gossip-based BFT membership • Linear portion f of Byzantine failures • O(n1/3)-size partial views • Correct nodes remain connected • Mathematically analyzed, validated in simulations • Random sampling • Novel memory-efficient approach • Converges to proven independent uniform samples The view is not all bad Better than benign gossip

Brahms Sampling - local component Gossip - distributed component Gossip view Sampler sample

Sampler Building Block • Input: data stream, one element at a time • Bias: some values appear more than others • Used with stream of gossiped ids • Output: uniform random sample • of unique elements seen thus far • Independent of other Samplers • One element at a time (converging) next Sampler sample

Sampler Implementation • Memory: stores one element at a time • Use random hash function h • From min-wise independent family [Broder et al.] • For each set X, and all , next init Sampler Choose random hash function Keep id with smallest hash so far sample

Component S: Sampling and Validation id streamfrom gossip init next Sampler Sampler Sampler Sampler using pings sample Validator Validator Validator Validator S

Gossip Process • Provides the stream of ids for S • Needs to ensure connectivity • Use a bag of tricks to overcome attacks

Gossip-Based Membership Primer • Small (sub-linear) local view V • V constantly changes - essential due to churn • Typically, evolves in (unsynchronized) rounds • Push: send my id to some node in V • Reinforce underrepresented nodes • Pull: retrieve view from some node in V • Spread knowledge within the network • [Allavena et al. ‘05]: both are essential • Low probability for partitions and star topologies

Brahms Gossip Rounds • Each round: • Sendpushes, pulls to random nodes from V • Wait to receive pulls, pushes • Update S with all received ids • (Sometimes) re-compute V • Tricky! Beware of adversary attacks

Problem 1: Push Drowning Push Alice A E M M Push Bob Push Mallory Push Carol M B M Push Ed Push Dana Push M&M D Push Malfoy M

Trick 1: Rate-Limit Pushes • Use limited messages to bound faulty pushes system-wide • E.g., computational puzzles/virtual currency • Faulty nodes can send portion p of them • Views won’t be all bad

Problem 2: Quick Isolation Ha! She’s out! Now let’s move on to the next guy! Push Alice A E M Push Bob Push Carol Push Mallory Push Ed Push Dana Push M&M Push Malfoy C D

Trick 2: Detection & Recovery • Do not re-compute V in rounds when too many pushes are received • Slows down isolation;does not prevent it Push Bob Push Mallory Hey! I’m swamped! I better ignore all of ‘em pushes… Push M&M Push Malfoy

Trick 3: Balance Pulls & Pushes • Control contribution of push - α|V| ids versus contribution of pull - β|V| ids • Parameters α, β • Pull-only  eventually all faulty ids • Pull from faulty nodes – all faulty ids, from correct nodes – some faulty ids • Push-only  quick isolation of attacked node • Push ensures: system-wide not all bad ids • Pull slows down (does not prevent) isolation

Trick 4: History Samples • Attacker influences both push and pull • Feedback γ|V| random ids from S • Parameters α + β + γ = 1 • Attacker loses control - samples are eventually perfectly uniform Yoo-hoo, is there any good process out there?

View and Sample Maintenance Pushed ids Pulled ids S |V|  |V| |V| View V Sample

Key Property • Samples take time to help • Assume attack starts when samples are empty • With appropriate parameters • E.g., • Time to isolation > time toconvergence Prove lower bound using tricks 1,2,3(not using samples yet) Prove upper bound until some good sample persists forever Self-healing from partitions

History Samples: Rationale • Judicious use essential • Bootstrap, avoid slow convergence • Deal with churn • With a little bit of history samples (10%) we can cope with any adversary • Amplification!

Analysis Sampling - mathematical analysis Connectivity - analysis and simulation Full system simulation

Connectivity  Sampling • Theorem:If overlay remains connected indefinitely, samples are eventuallyuniform

Sampling  Connectivity Ever After • Perfect sample of a sampler with hash h: the id with the lowest h(id) system-wide • If correct, sticks once the sampler sees it • Correct perfect sample  self-healing from partitions ever after • We analyze PSP(t) – probability of perfect sample at time t

Convergence to 1st Perfect Sample • n = 1000 • f = 0.2 • 40% unique ids in stream

Scalability • Analysis says: • For scalability, want small and constant convergence time • independent of system size, e.g., when

Connectivity Analysis 1: Balanced Attacks • Attack all nodes the same • Maximizes faulty ids in views system-wide • in any single round • If repeated, system converges to fixed point ratio of faulty ids in views, which is < 1 if • γ=0 (no history) and p < 1/3 or • History samples are used, any p There are always good ids in views!

Fixed Point Analysis: Push Local view node i Local view node 1 i Time t: lost push push push from faulty node 1 Time t+1: • x(t) – portion of faulty nodes in views at round t; • portion of faulty pushes to correct nodes : • p / ( p + ( 1 − p )( 1 − x(t) ) )

Fixed Point Analysis: Pull Local view node i Local view node 1 i Time t: pull from i: faulty with probability x(t) pull from faulty Time t+1: E[x(t+1)] =  p / (p + (1 − p)(1 − x(t))) +  ( x(t) + (1-x(t))x(t) ) + γf

Faulty Ids in Fixed Point Assumed perfect in analysis, real history in simulations With a few history samples, any portion of bad nodes can be tolerated Perfectly validated fixed pointsand convergence

Convergence to Fixed Point • n = 1000 • p = 0.2 • α=β=0.5 • γ=0

Connectivity Analysis 2:Targeted Attack – Roadmap • Step 1: analysis without history samples • Isolation in logarithmic time • … but not too fast, thanks to tricks 1,2,3 • Step 2: analysis of history sample convergence • Time-to-perfect-sample < Time-to-Isolation • Step 3: putting it all together • Empirical evaluation • No isolation happens

Targeted Attack – Step 1 • Q: How fast (lower bound) can an attacker isolate one node from the rest? • Worst-case assumptions • No use of history samples ( = 0) • Unrealistically strong adversary • Observes the exact number of correct pushes and complements it to α|V| • Attacked node not represented initially • Balanced attack on the rest of the system

Isolation w/out History Samples • n = 1000 • p = 0.2 • α=β=0.5 • γ=0 Isolation time for |V|=60 Depend on α,β,p

Step 2: Sample Convergence • n = 1000 • p = 0.2 • α=β=0.5, γ=0 • 40% unique ids Perfect sample in 2-3 rounds Empirically verified

Step 3: Putting It All TogetherNo Isolation with History Samples • n = 1000 • p = 0.2 • α=β=0.45 • γ=0.1 Works well despite small PSP

Sample Convergence (Balanced) • p = 0.2 • α=β=0.45 • γ=0.1 Convergence twice as fast with

Summary O(n1/3)-size views Resist Byzantine failures of linear portion Converge to proven uniform samples Precise analysis of impact of failures

Brahms

Brahms

Presentation Transcript

Johannes Brahms

Brahms

Johannes Brahms

Johannes Brahms

Johannes Brahms

Johannes Brahms

BRAHMS 8

Johannes Brahms

BRAHMS@RHIC

Johannes Brahms

SSA in BRAHMS

SSA in BRAHMS

SSA in BRAHMS

Brahms TPC Digitization

BRAHMS Highlights

The Brahms Sound

Measurements with BRAHMS

BRAHMS overview

Brahms Project Proposal

Johannes Brahms

SSA in BRAHMS