340 likes | 363 Views
C lusters O f R aw F lash U nits. Mahesh Balakrishnan, John Davis, Dahlia Malkhi , Vijayan Prabhakaran, Ted Wobber Microsoft Research Silicon Valley Thanks: Phil Bernstein, Ken Eguro, Jeremy Elson , Ed Nightingale, Colin Reid, Michael Wei, Ming Wu. tape is dead disk is tape
E N D
Clusters Of Raw Flash Units Mahesh Balakrishnan, John Davis, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber Microsoft Research Silicon Valley Thanks: Phil Bernstein, Ken Eguro, Jeremy Elson , Ed Nightingale, Colin Reid, Michael Wei, Ming Wu
tape is dead disk is tape flash is disk RAM localityis king - Jim Gray, Dec 2006
flash in the data center can flash clusters eliminate the trade-off between consistency and performance? what new abstractions are required to manage and access flash clusters?
what is CORFU? application infrastructure applications: key-value stores databases SMR (stmach rep) filesystems virtual disks application TOFU (trans over) append(value) read(offset) CORFU Replicated, fault-tolerant append read from anywhere append to tail network-attached flash 1 Gbps, 15W, 75 µs reads, 200 µs writes (4KB) cluster of 32 server-attached SSDs, 1.5 TB 0.5M reads/s, 0.2M appends/s (4KB) flash cluster
the case for CORFU applications append/read data why a shared log interface? • easy to build strongly consistent (transactional) in-memory applications • effective way to pool flash: • SSD uses logging to avoid write-in-place • random reads are fast • GC is feasible
CORFU is a distributed SSD with a shared log interface getting 10 Gbps random-IO from 1TB flash farm:
the CORFU design mapping resides at the client CORFU API: read(O) append(V) trim(O) application CORFU append throughput: # of 64-bit tokens issued per second CORFU library read from anywhere append to tail sequencer 4KB log entry each logical entry is mapped to a replica set of physical flash pages
CORFU throughput (server+SSD)[CORFU: A Shared Log Design for Flash Clusters, NSDI 2012] capped at sequencer bottleneck scales linearly
the CORFU protocol: mapping client application read(pos) D1 D3 D5 D7 CORFU library D2 D4 D6 D8 read(D1/D2, page#) Projection: D1 D2 D3 D4 D5 D6 D7 D8 CORFU cluster
the CORFU protocol: tail-finding client CORFU append throughput: # of 64-bit tokens issued per second reserve next position in log (e.g., 100) sequencer (T0) application read(pos) append(val) D1 D3 D5 D7 CORFU library D2 D4 D6 D8 write(D1/D2, val) Projection: D1 D2 D3 D4 D5 D6 D7 D8 CORFU cluster
the CORFU protocol: (chain) replication client C1 client C2 client C3 safety under contention: if multiple clients try to write to same log position concurrently, only one ‘wins’ durability: data is only visible to reads if entire chain has seen it
handling failures: clients • client obtains token from sequencer and crashes:holes in the log • solution: other clients can ‘fill’ the hole • fast fill operation in CORFU API (<1ms): • completes half-written entries • writes junk on unwritten entries (metadata operation, conserves flash cycles, bandwidth)
handling failures: flash units each Projection is a list of views 0 - D1 D2 D3 D4 D5 D6 D7 D8 0 - 7 D1 a D3 D4 D5 D6 D7 D8 8 - D1 D9 D3 D4 D5 D6 D7 D8 0 - 7 D1 a D3 D4 D5 D6 D7 D8 8 – 9 D1 D9 D3 D4 D5 D6 D7 D8 9 - D10 D11 D12 D13 D14 D15 D16 D17 Projection 0 Projection 1 Projection 2 D10 D12 D14 D16 D1 D3 D5 D7 D9 D2 D4 D6 D8 D11 D13 D15 D17
handling failures: sequencer the sequencer is an optimization • clients can instead probe for tail of the log log tail is soft state • value can be reconstructed from flash units • sequencer identity is stored in Projection
from State-Machine-Replication to CORFU[From Paxos to CORFU: A Flash-Speed Shared Log, ACM SIGOPS, 2012] SMR CORFU decide on sequence of log-entry values CORFU-chain: first in chain proposes value (due to seq’cer, no contention) each entry has a designated replica-set throughput capped at rate of sequencing 64-bit tokens scale-out by partitioning over time rather than over data • form a sequence of agreement decisions • leader-based solutions: • leader proposes each entry to group of replicas • same leader, same replicas, for all entries • throughput capped at leader IO capacity • scale-out through partitioning: • multiple autonomous sequences Corfu Paxos
how far is CORFU from Paxos? Paxos instance Paxos instance Paxos instance • CORFU partitions across time, not space • (CORFU replica chain == Paxos instance)
TOFU: lock-free transactions with CORFU application where is your data? TOFU library read/update (key, version, value) did you and I do all reads/updates atomically? versioned table store (BigTable,HBase, PacificA, FDS)
TOFU: lock-free transactions with CORFU application lookup TOFU library read/update (key, version, value) entries versioned table store (BigTable,HBase, PacificA, FDS) update index
TOFU: lock-free transactions with CORFU application lookup TOFU library CORFU library read/update (key, version, value) entries append “commit” intention- record construct index by“playing” log versioned table store (BigTable,HBase, PacificA, FDS)
TOFU: lock-free transactions with CORFU application lookup TOFU library CORFU library read/update (key, version, value) entries append “commit” intention- record construct index by“playing” log versioned table store (BigTable,HBase, PacificA, FDS) ref conflict zone intent
CORFU applications key-value store what: data + commit why: multi-put/get, snapshots, mirroring database (Hyder, Bernstein et al.) what: speculative commit records why: decide commit/abort virtual disk storage what: volume sectors why: multiplex flash cycles / IOPS / bytes state machine replication what: proposed commands why: order/persist commands others: metadata services, distributed STM, filesystems…
grieving for HD • Denial: • too expensive, vulnerable, wait for PCM • Anger: • SSD-FTL, use flash as a block-device • random writes exhibit bursts in performance • duplicate functionality • Bargaining: • use flash as memory-cache/memory-extension • Depression: • throw money and power at it, PCI-e controllers • Acceptance: • new design for farms of flash in a cluster
CORFU Beehive-based flash unit[A Design for Network Flash, NVMW 2012] • Beehive architecture overBEE3 and XUPv5 FPGAs • Multiple slow (soft) cores • Token ring interconnect • Vehicle for practicalarchitectureresearch
Conclusion CORFU is a distributed SSD: 1 million write IOPS, linearly scalable read IOPS CORFU is a shared log: strong consistency at wire speed CORFU uses network-attached flash to construct inexpensive, power-efficient clusters
reconfiguration to reconfigure, a client must: • seal the current Projection on drives • determine last log position modified in current Projection • propose a new Projection to the auxiliary as the next Projection an auxiliary stores a sequence of Projections • propose kth Projection; fails if it already exists • read kth Projection; fails if none exists why an auxiliary? • can tolerate f failures with f+1 replicas • can be implemented via conventional Paxos reconfiguration latency = tens of milliseconds
flash unit design • device exposes sparse linear address space • three message types: read, write, seal • write-once semantics: • reads must fail on unwritten pages • writes must fail on already written pages • bad-block remapping • Three form factors: • FPGA+SSD: implemented, single unit • server+SSD: implemented, 32-unit deployment • FPGA+flash: under development
the FPGA+SSD flash unit • BeeHive architecture over BEE3 and XUPv5 FPGAs • BEE3: 80 GB SSD, 8 GB RAM • power: 15W • tput: 800 Mbps • end-to-end latency: • reads: 400 µsecs • mirrored appends: 800 µsecs
garbage collection • once appended, data does not move in the log • application is required to trim invalid positions • log address space can become sparse ∞ invalid entries valid entries
the CORFU cluster (server+SSD) • $3K of flash • 32 drives, 16 servers, 2 racks • Intel X25V (40 GB) • 20,000 4KB read IOPS • each server limited by 1 Gbps NIC at 30,000 4KB IOPS • mirrored across racks max read tput: 500K 4KB IOPS (= 16 Gbps = 2 GB/s) max write tput: 250K 4KB IOPS
flash unit design - message-passing ring of cores - logical separation of functionality + specialized cores - slow cores (100 MHz) = low power (15W)
flash primer • page granularity (e.g., 4KB) • fast random reads • overwriting 4KB page requires surrounding 512KB block to be erased • erasures are slow • erasures cause wear-out (high BER) • SSDs: LBA-over-log • no erasures in critical path • even wear-out across drive