Two Techniques for Proving Lower Bounds

Two Techniques for Proving Lower Bounds Hagit Attiya Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA

Goal of this Presentation • Describe two common techniques for proving lower bounds in distributed computing: • Information theory arguments • Covering • Variations • Applications

My always first slide… problem nicer system architecture algorithm implementation real system architecture

Part I Information Theory Arguments

Overview • Bound the flow of information among processes (and memory) • Show that information takes long to be acquired • Argue that solving a particular problem requires information about many processes • Usually applies to: • Shared memory systems • Synchronous executions (imply lower bounds also for asynchronous executions) • Details depend on the primitives used

Single-writer registers: Possible argument • Need to read from each process • The state of a process can be found only in its own register • Hence, first process must read n registers

Not really When processes take steps together First process doubles information in 2nd step But can’t do better than that

More Refined Argument • Consider synchronized executions • Processes take steps in rounds • All reads appear before all writes • INF(pi,t-1): The set of inputs influencing process pi at the start of round t • For t = 1, INF(pi,t-1) = {pi} • For t > 1, ifpi reads a value written by pj, INF(pi,t) = INF(pi,t-1) [ INF(pj,t-1) • For t > 1, if pi writes, INF(pi,t) = INF(pi,t-1)

INF determines the state • INF(pi,t-1): The set of inputs influencing process pi at the start of round t • For t = 1, INF(pi,t-1) = {pi} • For t > 1, ifpi reads a value written by pj, INF(pi,t) = INF(pi,t-1) [ INF(pj,t-1) • For t > 1, if pi writes, INF(pi,t) = INF(pi,t-1) Proof by case analysis Lemma: If the states of processes in INF(pi,t-1) are the same in configurations C and C’, then pitakes the same steps in a t-round execution from C and from C’

Size of INF • INF(pi,t-1): The set of inputs influencing process pi at the start of round t • For t = 1, INF(pi,t-1) = {pi} • For t > 1, ifpi reads a value written by pj, INF(pi,t) = INF(pi,t-1) [ INF(pj,t-1) • For t > 1, if pi writes, INF(pi,t) = INF(pi,t-1) • I(t) = max |INF(pi,t)|  I(t) ≤ 2t Lemma: I(0) = 1, and I (t) ≤ 2 I(t-1)

Simple application: Computing OR • Consider input configuration C0 = (0,0, , 0, , 0) • The size of the influence set of a process is < n in all rounds < log n • Some process pi is not in INF(p1,log n-1) • By lemma, p_1 returns the same value in C0 and in C1 = (0,0, , 1, , 0) • A contradiction pi

Application: Approximate agreement For a small ² > 0 • Processes start with input in [0,1] • Must decide on an output in [0,1] such that • All outputs are within ² of each other (agreement) • If all inputs are v, the output is v (validity) System is asynchronous and a process must decide even if it runs by itself (solo termination)

Application: Approximate agreement [Attiya, Shavit, Lynch] • Consider input configuration C0 = (0,0, , , , 0) • Run all processes to completion from C0must decide 0 • If number of rounds T < log n • I(T) < n • 9 process pi INF(p1,T)

Approximate agreement (cont.) • Consider two input configurations C0 = (0, , , , , 0) C1 = (0, , 1 , , 0) • Run pi to completion, must decide 1 • pi INF(p1,T) • p1 still decides 0 when running from this configuration, contradicting agreement pi Theorem: Solo-terminating approximate agreement requires (log n) rounds in a synchronous failure-free run

Approximate agreement (cont.) • Consider two input configurations C0 = (0, , , , , 0) C1 = (0, , 1 , , 0) • Run pi to completion, must decide 1 • pi INF(p1,T) • p1 still decides 0 when running from this configuration, contradicting agreement pi Overhead of solo-termination: in “nice” runs, since otherwise, a synchronous algorithm can solve the problem in one round. Theorem: Solo-terminating approximate agreement requires (log n) rounds in a synchronous failure-free run

With multi-writer registers • Previous theorem does not hold • A wait-free approximate agreement algorithm that takes O(1) rounds in “nice” executions [Schenk] • Even simpler: An O(1) OR algorithm

With multi-writer registers • Previous theorem does not hold • A wait-free approximate agreement algorithm that takes O(1) rounds in “nice” executions [Schenk] • Even simpler: An O(1) OR algorithm • Only a few initial configurations to distinguish between Overhead of single-writer registers: Separates single-writer and multi-writer registers Can you find it?

Information flow with multi-writer registers The previous argument does not hold Instead, consider how learning more information allows to differentiate between input configurations Capture as a partitioning of process states and memory values [Beame] (0, , , , , 0) (0, , 0 , , 1) (0, , 1 , , 0) (1, , 1 , , 0)

Multi-writer registers: Ordering events Within each round • Put all reads, then • Put all writes • Reads obtain value written at the end of previous round

Partitioning into equivalence classes For process p and round t, two input configurations are in the same equivalence class of P(p,t)if p is in the same state after t rounds from both(in a synchronous failure-free execution) P(t): the number of classes after t rounds (max over p) V(R,t), V(t) defined similarly for locations R  P(t), V(t) · (4n+2)2t−2 Lemma: P(t) · P(t-1)V(t-1) and V(t) · n P(t-1)+V(t-1)

Application: The collect problem • update(v) stores v as latest value of a process • collect() returns a set of values (one per process) When each process initially stores one of two values • There are 2n possible input configurationsEach leading to a different output Previous lemma implies (4n+2)2t−2 ≥ P(t) ≥ 2n • Must have (log n) rounds

Also for other primitives (CAS) Non-reading CAS Reading CAS returns the old value (can be handled, but we won’t do that) Can also extend to non-reading kCAS • CAS(R,old,new){ • if R==old then • R = new • return success • else return fail • }

Careful with CAS More information flow in a sequence of steps initially, R == 0 cas(R,0,1) cas(R,1,2) . . . cas(R,n−1,n) On the other hand cas(R,n-1,n) cas(R,n-2,n-1) . . . cas(R,0,1)      

Ordering events within a round Put all reads first. Put all writes last. For every register R whose current value is v, consider all CAS events: • Put all events with old  v: all fail • Put all events with old == v: only the first succeeds(assumes operations are non-degenerate) Allows to prove a lemma analogue to multi-writer registers (different constants)

Information Flow with Bounded Fan-In Arbitrary objects, but bounded contention • Not too many processes access the same base object similtaneously Isolate processes n a Q-independentexecution • Only processes in Q take steps • Access only objects not modified by processes in Q • For a process p 2 Q, a Q-independent execution is indistinguishable from a p-solo execution

Constructing independent executions Lemma: For any algorithm using only objects with contention ≤ w and every t ≥ 0, there is a t-round Qt-independent execution, with| Qt | ≥ n/(w+2)t Proof by induction, with a trivial base case. Induction step: consider Qt-independentexecution. We use the following result from graph theory. Look at the next steps processes in Qt are about to perform, and construct an undirected graph (V,E) Turan theorem: Any graph (V,E) has an independent set of size |V|2/(|V|+2|E|)

Induction step: The graph • V = Qt • E contains an edge {pi, pj} if • pi and pjaccess the same object, or • pi is about to read an object modified by pj, or • pjis about to read an object modified by pi |E| ≤ | Qt|(w+1)/2 Turan’s theorem and inductive hypothesis there is an independent set Qt+1 of size ≥ n/(w+2)t Omit all steps of Qt – Qt+1from the execution to get aQt+1-independentexecution

Application: Weak Test&Set Weak test&set: Like test&set but at most onesuccess Take t such that (w+2)t < n Lemma gives a t-round {pi,pj}-independent execution • Each of pi and pj seems to be running solo • must succeed • Contradiction Theorem: The solo step complexity of weak test&setis (log n / log w )

Part II Covering

Covering: The basic idea Several processes write to the same location Writes by early processes are lost, if no read in between • Must write to distinct locations • Other process must read these locations

Max Register • WriteMax(v,R) operation • ReadMaxoperation op returns the maximal value written by a WriteMax operation that • completed before op started, or • overlaps op • Special case of a linearizable object

Lower bound for ReadMax operation [Jayanti, Tan, Toueg] The proof is constructive Theorem: ReadMax must read n different registers.

Construction for the lower bound p1 … pkperform WriteMax operations writes by p1 … pk to R1 … Rk Pnperforms ReadMax operation reads R1 … Rk ®k ¯k °k Proof by induction on k = 0, …, n Base case is simple Taking k = n yields the result

Inductive Step p1 … pkperform WriteMax operations writes by p1 … pk to R1 … Rk Pnperforms ReadMax operation ®k ¯k °k must write to R R1 …Rk pk+1 perform WriteMax operations writes by p1 … pk to R1 … Rk does not observe pk+1 Pnperforms ReadMax operation ¯k °k

Inductive Step p1 … pkperform WriteMax operations writes by p1 … pk to R1 … Rk Pnperforms ReadMax operation ®k ¯k °k must write to R R1 …Rk pk+1 perform WriteMax operations ¼k writes by p1 … pk to R1 … Rk must readR R1 …Rk Pnperforms ReadMax operation ¯k °k

Inductive Step p1 … pkperform WriteMax operations writes by p1 … pk to R1 … Rk Pnperforms ReadMax operation ®k ¯k °k pk+1 perform WriteMax operations ¼k writes by p1 … pk to R1 … Rk Pnperforms ReadMax operation write to Rk+1 ¯k °k Claim follows with R1 … RkRk+1 and ®k+1 = ®k¼k

Swap objects Theorem holds for other primitives and objects, e.g., (register-to memory) swap Need some care in constructing ¼k, °k • swap(R,v){ • tmp = R • return tmp • }

Result holds also for other objects • E.g., counters • Constructed execution contains many increment operations • Better algorithms when • Few increment operations • Max register holds bounded values [Aspnes, Attiya, Censor-Hillel]

Counters with CAS Counters can be implemented with a single location R, and a single CAS per operation: • To increment, simply: • read previous value from R • CAS +1 to R • To read the counter, simply read R  Lots of contention on R!  This is inevitable

The memory stalls measure [Dwork, Herlihy, Waarts] If k processes access (or modify) the same location at the same configuration • The first process incurs one step, and no stalls • The second process incurs one step, and one stall • . • . • . • The k’th process incurs one step, and k-1 stalls

Lower bound on number of stalls Theorem: ReadCounter must incur n stalls + steps. Similar construction as in previous theorem p1 … pkperform Increment operations p1 … pkpoised onR1 … Rm, m · k Pnperforms ReadCounter operation accessesR1 … Rm

Lower bound on number of stalls Theorem: ReadCounter must incur n stalls + steps. Similar construction as in previous theorem p1 … pkperform Increment operations p1 … pkpoised onR1 … Rm, m · k Pnperforms ReadCounter operation incurs k stalls + steps accessesR1 … Rk

Wrap-up • There are many lower bound results But fewer techniques… • Some results & techniques are relevant to questions asked in Transform • Material is based on monograph-in-writing with Faith Ellen • Let me know if you want to proof-read it!

Two Techniques for Proving Lower Bounds