Ordering of events in Distributed Systems & Eventual Consistency

Ordering of events in Distributed Systems&Eventual Consistency Jinyang Li

What is consistency? • Consistency model: • A constraint on the system state observable by application operations • Examples: • X86 memory: • Database: write x=5 read x (should be 5) time x:=x+1; y:=y-1 assert(x+y==const) time

Consistency • No right or wrong consistency models • Tradeoff between ease of programmability and efficiency • Consistency is hard in (distributed) systems: • Data replication (caching) • Concurrency • Failures

Consistency challenges: example • Each node has a local copy of state • Read from local state • Send writes to the other node, but do not wait

Consistency challenges: example W(x)1 W(y)1 x=1 If y==0 critical section y=1 If x==0 critical section

Does this work? W(x)1 W(y)1 R(x)0 R(y)0 x=1 If y==0 critical section y=1 If x==0 critical section

Diff CPUs see different event orders! What went wrong? W(x)1 W(y)1 R(x)0 R(y)0 CPU1 sees: W(y)1 R(x)0 W(x)1 CPU0 sees: W(x)1 R(y)0 W(y)1

Strict consistency • Each operation is stamped with a global wall-clock time • Rules: • Each read gets the latest write value • All operations at one CPU have time-stamps in execution order

W must have timestamp later than R Contradicts rule 1: R must see W(x)1 Strict consistency gives “intuitive” results • No two CPUs in the critical section • Proof: suppose mutual exclusion is violated CPU0: W(x)1 R(y)0 CPU1: W(y)1 R(x)0 • Rule 1: read gets latest write CPU0: W(x)1 R(x)0 CPU1: W(y)1 R(x)0

Sequential consistency • Strict consistency is not practical • No global wall-clock available • Sequential consistency is the closest • Rules: There is a total order of ops s.t. • All CPUs see results according to total order (i.e. reads see most recent writes) • Each CPUs’ ops appear in order

Lamport clock gives a total order • Each CPU keeps a logical clock • Each CPU updates its logical clock between successive events • A sender includes its clock value in the message. • A receiver advances its clock be greater than the message’s clock value. • Lamport clocks define a total order. • Ties are broken based on CPU ids.

Fix the example W(x)1 ack W(y)1 R(x)1 R(y)0 ack CPU1 should see order W(x)1 W(y)1 CPU0 should see order W(x)1 W(y)1

Lamport clock: an example W(x)1 1,0 S: W(x)1 W(y)1 1,1 S: W(y)1 2,1R: W(x)1 2,0 R: W(y)1 3,1S: ack 3,0 S: ack 4,1 R: ack 4,0 R: ack 1,0 S W(x)1 1,1 S W(y)1 2,0 R W(y)1 2,1 R W(x)1 3,0 S ack 3,1 S ack 4,0 R ack 4,1 S ack Defines one possible total order: W(x)1 < W(y)1

1,0 S W(x)1 ????? ?????? 1,1 S: W(x)1 1,0 S W(x)1 1,0 S W(x)1 1,1 S W(y)1 2,1 R: W(x)1 3,1 S: ack 1,0 S W(x)1 1,1 S W(y)1 2,0 R W(y)1 3,0 S ack 1,0 S W(x)1 1,1 S W(y)1 Lamport clock: an example 1,0 S: W(x)1 W(x)1 1,1 S: W(y)1 W(y)1 2,1R: W(x)1 2,0 R: W(y)1 3,1S: ack 3,0 S: ack 4,1 R: ack 4,0 R: ack

Beyond Lamport clock • Typical system obtains a total order differently • Use a single node to order all reads/writes • E.g. the lock_server in Lab1 • Partition state over multiple nodes, each node orders reads/writes for its partition • Invariant: exactly one is in charge of ordering • The ordering node must be online

Weakly consistent systems • Sequential consistency • All read/writes are applied in total order • Reads must see most recent writes • Eventual consistency (Bayou) • Writes are eventually applied in total order • Reads might not see most recent writes in total order

Why (not) eventual consistency? • Support disconnected operations • Better to read a stale value than nothing • Better to save writes somewhere than nothing • Potentially anomalous application behavior • Stale reads and conflicting writes…

Bayou Write log 0:0 1:0 2:0 Version Vector N1 0:0 1:0 2:0 N0 0:0 1:0 2:0 N2

1:0 W(x) 2:0 W(y) 3:0 W(z) 0:3 1:0 2:0 Bayou propagation Write log 1:1 W(x) 0:0 1:1 2:0 Version Vector N1 1:0 W(x) 2:0 W(y) 3:0 W(z) 0:3 1:0 2:0 N0 0:0 1:0 2:0 N2

0:3 1:4 2:0 1:1 W(x) Bayou propagation Write log 1:0 W(x) 1:1 W(x) 2:0 W(y) 3:0 W(z) 0:3 1:4 2:0 Version Vector N1 1:0 W(x) 2:0 W(y) 3:0 W(z) 0:3 1:0 2:0 N0 0:0 1:0 2:0 N2

Which portion of The log is stable? Bayou propagation Write log 1:0 W(x) 1:1 W(x) 2:0 W(y) 3:0 W(z) 0:3 1:4 2:0 Version Vector N1 1:0 W(x) 1:1 W(x) 2:0 W(y) 3:0 W(z) 0:4 1:4 2:0 N0 0:0 1:0 2:0 N2

Bayou propagation Write log 1:0 W(x) 1:1 W(x) 2:0 W(y) 3:0 W(z) 0:3 1:4 2:0 Version Vector N1 1:0 W(x) 1:1 W(x) 2:0 W(y) 3:0 W(z) 0:4 1:4 2:0 N0 1:0 W(x) 1:1 W(x) 2:0 W(y) 3:0 W(z) 0:3 1:4 2:5 N2

Bayou propagation Write log 1:0 W(x) 1:1 W(x) 2:0 W(y) 3:0 W(z) 0:3 1:6 2:5 Version Vector N1 1:0 W(x) 1:1 W(x) 2:0 W(y) 3:0 W(z) 0:4 1:4 2:0 0:3 1:4 2:5 N0 1:0 W(x) 1:1 W(x) 2:0 W(y) 3:0 W(z) 0:4 1:4 2:5 N2

Bayou uses a primary to commit a total order • Why is it important to make log stable? • Stable writes can be committed • Stable portion of the log can be truncated • Problem: If any node is offline, the stable portion of all logs stops growing • Bayou’s solution: • A designated primary defines a total commit order • Primary assigns CSNs (commit-seq-no) • Any write with a known CSN is stable • All stable writes are ordered before tentative writes

∞:1:1 W(x) 0:0 1:1 2:0 Bayou propagation Write log ∞:1:1 W(x) 0:0 1:1 2:0 Version Vector N1 1:1:0 W(x) 2:2:0 W(y) 3:3:0 W(z) 0:3 1:0 2:0 N0 0:0 1:0 2:0 N2

1:1:0 W(x) 2:2:0 W(y) 3:3:0 W(z) 4:1:1 W(x) 0:4 1:1 2:0 Bayou propagation Write log ∞:1:1 W(x) 0:0 1:1 2:0 Version Vector N1 1:1:0 W(x) 2:2:0 W(y) 3:3:0 W(z) 0:4 1:1 2:0 N0 4:1:1 W(x) 0:0 1:0 2:0 N2

Bayou’s limitations • Primary cannot fail • Server creation & retirement makes nodeID grow arbitrarily long • Anomalous behaviors for apps? • Calendar app

Ordering of events in Distributed Systems & Eventual Consistency