380 likes | 393 Views
Scaling a file system to many cores using an operation log. Srivatsa S. Bhat , Rasha Eqbal , Austin T. Clements, M. Frans Kaashoek , Nickolai Zeldovich MIT CSAIL. Motivation: Current file systems don’t scale well. Filesystem: Linux ext4 (4.9.21) Benchmark:
E N D
Scaling a file system to many cores using an operation log Srivatsa S. Bhat, RashaEqbal, Austin T. Clements, M. FransKaashoek, NickolaiZeldovich MIT CSAIL
Motivation: Current file systems don’t scale well • Filesystem: • Linux ext4 (4.9.21) • Benchmark: • dbench[https://dbench.samba.org] • Experimental setup: • 80-cores, 256 GB RAM • Backing store: “RAM” disk
Concurrent file creation in Linux ext4 creat(dirA/file2) creat(dirA/file1) CORE 2 CORE 1 DISK MEMORY ext4 Journal dirA’s block
Block contention limits scalability of file creation creat(dirA/file2) creat(dirA/file1) CORE 2 CORE 1 DISK MEMORY ext4 Journal dirA’s block Contends on the directory block! • Contention on blocks limits scalability on 80 cores • Even apps not limited by disk I/O don’t scale
Goal : Multicore scalability • Problem : Contention limits scalability • Contention involves cache-line conflicts • Goal : Multicore scalability = No cache-line conflicts • Even a single contended cache-line can wreck scalability • Commutative operations can be implemented without cache-line conflicts • [Scalable Commutativity Rule, Clements SOSP ’13] • How do we scale all commutative operations in file systems?
ScaleFS approach: Two separate file systems DISK MEMORY MemFS DiskFS Journal Block cache Directories (as hash-tables) fsync Designed for durability Designed for multicore scalability
Concurrent file creation scales in ScaleFS creat (dirA/file1) creat (dirA/file2) CORE 1 CORE 2 DISK MEMORY MemFS DiskFS Journal dirA Block cache
Concurrent file creation scales in ScaleFS creat (dirA/file1) creat (dirA/file2) CORE 1 CORE 2 DISK MEMORY MemFS DiskFS Journal dirA Block cache No contention No cache-line conflicts Scalability!
Challenge: How to implement fsync? fsync DISK MEMORY MemFS DiskFS Journal dirA Block cache
Challenge: How to implement fsync? • DiskFS updates must be consistent with MemFS • fsync must preserve conflict-freedom for commutative ops fsync DISK MEMORY MemFS DiskFS Journal dirA Block cache
Contributions • ScaleFS, a file system that achieves excellent multicore scalability • Two separate file systems: MemFS and DiskFS • Design for fsync: • Per-core operation logs to scalably defer updates to DiskFS • Ordering operations using Time Stamp Counters • Evaluation : • Benchmarks on ScaleFS scale 35x-60x on 80 cores • Workload/Machine independent analysis for cache-conflicts • Suggests ScaleFS a good fit for workloads not limited by disk I/O
ScaleFS design : Two separate file systems MEMORY Designed for multicore scalability Designed for durability DISK MemFS DiskFS fsync Journal Uses: hash-tables, radix-trees, seqlocks for lock-free reads Uses: blocks, transactions, journaling Per-core Operation Logs
Design challenges • How to order operations in the per-core operation logs? • How to operate MemFS and DiskFS independently: • How to allocate inodes in a scalable manner in MemFS? • . . .
Problem: Preserve ordering of non-commutative ops unlink (file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs
Problem: Preserve ordering of non-commutative ops unlink (file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs
Problem: Preserve ordering of non-commutative ops creat (file1) unlink (file1) CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs
Problem: Preserve ordering of non-commutative ops creat (file1) unlink (file1) CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs
Problem: Preserve ordering of non-commutative ops creat (file1) unlink (file1) fsync CORE 3 CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Order: How??
[ RDTSCP does not incur cache-line conflicts ] Solution: Use synchronized Time Stamp Counters DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs
[ RDTSCP does not incur cache-line conflicts ] Solution: Use synchronized Time Stamp Counters creat (file1) unlink (file1) fsync CORE 3 CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Order: ts1 < ts2
Problem: How to allocate inodesscalably in MemFS? creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Inode Allocator
Solution (1) : Separate mnodes in MemFS from inodes in DiskFS creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Mnode Allocator Inode Allocator
Solution (1) : Separate mnodes in MemFS from inodes in DiskFS creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Mnode Allocator Inode Allocator
Solution (2) : Defer allocating inodes in DiskFS until an fsync DISK MemFS DiskFS Journal dirA Block cache mnodeinode table Per-core Mnode Allocator Inode Allocator
Other design challenges • How to scale concurrent fsyncs? • How to order lock-free reads? • How to resolve dependencies affecting multiple inodes? • How to ensure internal consistency despite crashes?
Implementation • ScaleFS is implemented on the sv6 research operating system • Supported filesystem system calls: • creat, open, openat, mkdir, mkdirat, mknod, dup, dup2, lseek, read, pread, write, pwrite, chdir, readdir, pipe, pipe2, stat, fstat, link, unlink, rename, fsync, sync, close
Evaluation • Does ScaleFS achieve good scalability? • Measure scalability on 80 cores • Observe conflict-freedom for commutative operations • Does ScaleFS achieve good disk throughput? • What memory overheads are introduced by ScaleFS’s split of MemFS and DiskFS?
Evaluation methodology • Machine configuration: • 80-cores, with Intel E7-8870 2.4 GHz CPUs • 256 GB RAM • Backing store: “RAM” disk • Benchmarks: • mailbench: mail server workload • dbench: file server workload • largefile: Creates a file, writes 100 MB, fsyncs and deletes it • smallfile: Creates, writes, fsyncs and deletes lots of 1KB files
ScaleFS scales 35x-60x on a RAM disk [ Single-core performance of ScaleFS is on par with Linux ext4. ]
Machine-independent methodology • Use Commuter [Clements SOSP ’13] • to observe conflict-freedom for commutative ops • Commuter: • Generates testcases for pairs of commutative ops • Reports observed cache-conflicts
Conflict-freedom for commutative ops on Linux ext4 : 65% 138
Conflict-freedom for commutative ops on ScaleFS: 99.2% • Why not 100% conflict-free? • Tradeoff scalability for performance • Probabilistic conflicts
Evaluation summary • ScaleFS scales well on an 80 core machine • Commuter reports 99.2% conflict-freedom on ScaleFS • Workload/machine independent • Suggests scalability beyond our experimental setup and benchmarks
Related Work • Scalability studies: FxMark [USENIX ’16], Linux Scalability [OSDI ’10] • Scaling file systems using sharding: • Hare [Eurosys ’15], SpanFS [USENIX ’15] • ScaleFS uses similar techniques: • Operation Logging:OpLog [CSAIL TR ’14] • Per-inode/ Per-core logs : • NOVA [FAST ’16], iJournaling[USENIX ’17], Strata [SOSP ’17] • Decoupling in-memory and on-disk representations: • Linux dcache, ReconFS[FAST ’14] • ScaleFS focus : Achieve scalability by avoiding cache-line conflicts
Conclusion • ScaleFS– a novel file system design for multicore scalability • Two separate file systems : MemFS and DiskFS • Per-core operation logs • Ordering using Time Stamp Counters • ScaleFS scales 35x-60x on an 80 core machine • ScaleFSis conflict-free for 99.2% of testcases in Commuter • https://github.com/mit-pdos/scalefs