380 likes | 403 Views
Explore how ScaleFS tackles scalability issues in Linux.ext4 on 80-core machines using an operation log to handle cache conflicts and ordering of file system operations. The solution involves two separate file systems, MemFS and DiskFS, with per-core operation logs and synchronized Time Stamp Counters for efficient operation ordering. ScaleFS achieves excellent multicore scalability, overcoming cache conflicts and scaling file creation operations up to 60x on 80 cores.
E N D
Scaling a file system to many cores using an operation log Srivatsa S. Bhat, RashaEqbal, Austin T. Clements, M. FransKaashoek, NickolaiZeldovich MIT CSAIL
Motivation: Current file systems don’t scale well • Filesystem: • Linux ext4 (4.9.21) • Benchmark: • dbench[https://dbench.samba.org] • Experimental setup: • 80-cores, 256 GB RAM • Backing store: “RAM” disk
Concurrent file creation in Linux ext4 creat(dirA/file2) creat(dirA/file1) CORE 2 CORE 1 DISK MEMORY ext4 Journal dirA’s block
Block contention limits scalability of file creation creat(dirA/file2) creat(dirA/file1) CORE 2 CORE 1 DISK MEMORY ext4 Journal dirA’s block Contends on the directory block! • Contention on blocks limits scalability on 80 cores • Even apps not limited by disk I/O don’t scale
Goal : Multicore scalability • Problem : Contention limits scalability • Contention involves cache-line conflicts • Goal : Multicore scalability = No cache-line conflicts • Even a single contended cache-line can wreck scalability • Commutative operations can be implemented without cache-line conflicts • [Scalable Commutativity Rule, Clements SOSP ’13] • How do we scale all commutative operations in file systems?
ScaleFS approach: Two separate file systems DISK MEMORY MemFS DiskFS Journal Block cache Directories (as hash-tables) fsync Designed for durability Designed for multicore scalability
Concurrent file creation scales in ScaleFS creat (dirA/file1) creat (dirA/file2) CORE 1 CORE 2 DISK MEMORY MemFS DiskFS Journal dirA Block cache
Concurrent file creation scales in ScaleFS creat (dirA/file1) creat (dirA/file2) CORE 1 CORE 2 DISK MEMORY MemFS DiskFS Journal dirA Block cache No contention No cache-line conflicts Scalability!
Challenge: How to implement fsync? fsync DISK MEMORY MemFS DiskFS Journal dirA Block cache
Challenge: How to implement fsync? • DiskFS updates must be consistent with MemFS • fsync must preserve conflict-freedom for commutative ops fsync DISK MEMORY MemFS DiskFS Journal dirA Block cache
Contributions • ScaleFS, a file system that achieves excellent multicore scalability • Two separate file systems: MemFS and DiskFS • Design for fsync: • Per-core operation logs to scalably defer updates to DiskFS • Ordering operations using Time Stamp Counters • Evaluation : • Benchmarks on ScaleFS scale 35x-60x on 80 cores • Workload/Machine independent analysis for cache-conflicts • Suggests ScaleFS a good fit for workloads not limited by disk I/O
ScaleFS design : Two separate file systems MEMORY Designed for multicore scalability Designed for durability DISK MemFS DiskFS fsync Journal Uses: hash-tables, radix-trees, seqlocks for lock-free reads Uses: blocks, transactions, journaling Per-core Operation Logs
Design challenges • How to order operations in the per-core operation logs? • How to operate MemFS and DiskFS independently: • How to allocate inodes in a scalable manner in MemFS? • . . .
Problem: Preserve ordering of non-commutative ops unlink (file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs
Problem: Preserve ordering of non-commutative ops unlink (file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs
Problem: Preserve ordering of non-commutative ops creat (file1) unlink (file1) CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs
Problem: Preserve ordering of non-commutative ops creat (file1) unlink (file1) CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs
Problem: Preserve ordering of non-commutative ops creat (file1) unlink (file1) fsync CORE 3 CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Order: How??
[ RDTSCP does not incur cache-line conflicts ] Solution: Use synchronized Time Stamp Counters DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs
[ RDTSCP does not incur cache-line conflicts ] Solution: Use synchronized Time Stamp Counters creat (file1) unlink (file1) fsync CORE 3 CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Order: ts1 < ts2
Problem: How to allocate inodesscalably in MemFS? creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Inode Allocator
Solution (1) : Separate mnodes in MemFS from inodes in DiskFS creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Mnode Allocator Inode Allocator
Solution (1) : Separate mnodes in MemFS from inodes in DiskFS creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Mnode Allocator Inode Allocator
Solution (2) : Defer allocating inodes in DiskFS until an fsync DISK MemFS DiskFS Journal dirA Block cache mnodeinode table Per-core Mnode Allocator Inode Allocator
Other design challenges • How to scale concurrent fsyncs? • How to order lock-free reads? • How to resolve dependencies affecting multiple inodes? • How to ensure internal consistency despite crashes?
Implementation • ScaleFS is implemented on the sv6 research operating system • Supported filesystem system calls: • creat, open, openat, mkdir, mkdirat, mknod, dup, dup2, lseek, read, pread, write, pwrite, chdir, readdir, pipe, pipe2, stat, fstat, link, unlink, rename, fsync, sync, close
Evaluation • Does ScaleFS achieve good scalability? • Measure scalability on 80 cores • Observe conflict-freedom for commutative operations • Does ScaleFS achieve good disk throughput? • What memory overheads are introduced by ScaleFS’s split of MemFS and DiskFS?
Evaluation methodology • Machine configuration: • 80-cores, with Intel E7-8870 2.4 GHz CPUs • 256 GB RAM • Backing store: “RAM” disk • Benchmarks: • mailbench: mail server workload • dbench: file server workload • largefile: Creates a file, writes 100 MB, fsyncs and deletes it • smallfile: Creates, writes, fsyncs and deletes lots of 1KB files
ScaleFS scales 35x-60x on a RAM disk [ Single-core performance of ScaleFS is on par with Linux ext4. ]
Machine-independent methodology • Use Commuter [Clements SOSP ’13] • to observe conflict-freedom for commutative ops • Commuter: • Generates testcases for pairs of commutative ops • Reports observed cache-conflicts
Conflict-freedom for commutative ops on Linux ext4 : 65% 138
Conflict-freedom for commutative ops on ScaleFS: 99.2% • Why not 100% conflict-free? • Tradeoff scalability for performance • Probabilistic conflicts
Evaluation summary • ScaleFS scales well on an 80 core machine • Commuter reports 99.2% conflict-freedom on ScaleFS • Workload/machine independent • Suggests scalability beyond our experimental setup and benchmarks
Related Work • Scalability studies: FxMark [USENIX ’16], Linux Scalability [OSDI ’10] • Scaling file systems using sharding: • Hare [Eurosys ’15], SpanFS [USENIX ’15] • ScaleFS uses similar techniques: • Operation Logging:OpLog [CSAIL TR ’14] • Per-inode/ Per-core logs : • NOVA [FAST ’16], iJournaling[USENIX ’17], Strata [SOSP ’17] • Decoupling in-memory and on-disk representations: • Linux dcache, ReconFS[FAST ’14] • ScaleFS focus : Achieve scalability by avoiding cache-line conflicts
Conclusion • ScaleFS– a novel file system design for multicore scalability • Two separate file systems : MemFS and DiskFS • Per-core operation logs • Ordering using Time Stamp Counters • ScaleFS scales 35x-60x on an 80 core machine • ScaleFSis conflict-free for 99.2% of testcases in Commuter • https://github.com/mit-pdos/scalefs