180 likes | 473 Views
a shared log design for flash clusters. Mahesh Balakrishnan, Dahlia Malkhi Vijayan Prabhakaran, Ted Wobber John D. Davis, Michael Wei Microsoft Research Silicon Valley. tape is dead disk is tape flash is disk RAM locality is king - Jim Gray, Dec 2006. flash in the data center.
E N D
a shared log design for flash clusters Mahesh Balakrishnan, Dahlia Malkhi Vijayan Prabhakaran, Ted Wobber John D. Davis, Michael Wei Microsoft Research Silicon Valley
tape is dead disk is tape flash is disk RAM localityis king - Jim Gray, Dec 2006
flash in the data center can flash clusters eliminate the trade-off between consistency and performance? what new abstractions are required to manage and access flash clusters?
the CORFU abstraction: a shared log 20K/s 200K/s 500K/s example application: the Hyder database (Bernstein et al., CIDR 2011) infrastructure applications: SMR databases key-value stores filesystems virtual disks application append(value) read(offset) 200K/s 500K/s CORFU read from anywhere append to tail flash cluster
the CORFU hardware: network flash • network-attached flash units • low power: 15W per unit • low latency • low cost cost + power usage of a 1 TB, 10 Gbps flash farm:
problem statement how do we implement a scalable shared log over a cluster of network-attached flash units?
the CORFU design CORFU API: V = read(O) O = append(V) trim(O) application mapping resides at the client CORFU library read from anywhere append to tail 4KB entry each logical entry is mapped to a replica set of physical flash pages
the CORFU protocol: reads client application read(pos) D1 D3 D5 D7 CORFU library D2 D4 D6 D8 read(D1/D2, page#) Projection: D1 D2 D3 D4 D5 D6 D7 D8 CORFU cluster
the CORFU protocol: appends client CORFU append throughput: # of 64-bit tokens issued per second sequencer is only an optimization! clients can probe for tail or reconstruct it from flash units reserve next position in log (e.g., 100) sequencer (T0) application read(pos) append(val) D1 D3 D5 D7 CORFU library D2 D4 D6 D8 write(D1/D2, val) Projection: D1 D2 D3 D4 D5 D6 D7 D8 CORFU cluster
chain replication in CORFU client C1 2 client C2 client C3 1 safety under contention: if multiple clients try to write to same log position concurrently, only one wins writes to already written pages => error durability: data is only visible to reads if entire chain has seen it reads on unwritten pages => error requires `write-once’ semantics from flash unit
handling failures: flash units each Projection is a list of views 0 - D1 D2 D3 D4 D5 D6 D7 D8 0 - 7 D1 a D3 D4 D5 D6 D7 D8 8 - D1 D9 D3 D4 D5 D6 D7 D8 0 - 7 D1 a D3 D4 D5 D6 D7 D8 8 – 9 D1 D9 D3 D4 D5 D6 D7 D8 9 - D10 D11 D12 D13 D14 D15 D16 D17 Projection 0 Projection 1 Projection 2 reconfiguration steps: ‘seal’ current projection at flash units write new projection at auxiliary D10 D12 D14 D16 D1 D3 D5 D7 latency for 32-drive cluster: tens of milliseconds D9 D2 D4 D6 D8 D11 D13 D15 D17
handling failures: clients • client obtains token from sequencer and crashes:holes in the log • solution: other clients can fill the hole • fast CORFU fill operation (<1ms) ‘walks the chain’: • completes half-written entries • writes junk on unwritten entries (metadata operation, conserves flash cycles, bandwidth)
garbage collection: two models • prefix trim(O): invalidate all entries before offset O • entry trim(O): invalidate only entry at offset O ∞ ∞ invalid entries invalid entries valid entries valid entries
CORFU throughput sequencer bottleneck reads scale linearly
how far is CORFU from Paxos? Paxos-like protocols are IO-bound at leader… D1 D3 D5 D7 … so is a single CORFU chain D2 D4 D6 D8 CORFU cluster Projection ‘stitches’ together multiple chains: no I/O bottleneck!
conclusion CORFU is a scalable shared log: linearly scalable reads, 1M appends/s CORFU uses network-attached flash to construct inexpensive, power-efficient clusters