Scaling a file system to many cores using an operation log

Scaling a file system to many cores using an operation log Srivatsa S. Bhat, RashaEqbal, Austin T. Clements, M. FransKaashoek, NickolaiZeldovich MIT CSAIL

Motivation: Current file systems don’t scale well • Filesystem: • Linux ext4 (4.9.21) • Benchmark: • dbench[https://dbench.samba.org] • Experimental setup: • 80-cores, 256 GB RAM • Backing store: “RAM” disk

Linux ext4 scales poorly on multicore machines

Concurrent file creation in Linux ext4 creat(dirA/file2) creat(dirA/file1) CORE 2 CORE 1 DISK MEMORY ext4 Journal dirA’s block

Block contention limits scalability of file creation creat(dirA/file2) creat(dirA/file1) CORE 2 CORE 1 DISK MEMORY ext4 Journal dirA’s block Contends on the directory block! • Contention on blocks limits scalability on 80 cores • Even apps not limited by disk I/O don’t scale

Goal : Multicore scalability • Problem : Contention limits scalability • Contention involves cache-line conflicts • Goal : Multicore scalability = No cache-line conflicts • Even a single contended cache-line can wreck scalability • Commutative operations can be implemented without cache-line conflicts • [Scalable Commutativity Rule, Clements SOSP ’13] • How do we scale all commutative operations in file systems?

ScaleFS approach: Two separate file systems DISK MEMORY MemFS DiskFS Journal Block cache Directories (as hash-tables) fsync Designed for durability Designed for multicore scalability

Concurrent file creation scales in ScaleFS creat (dirA/file1) creat (dirA/file2) CORE 1 CORE 2 DISK MEMORY MemFS DiskFS Journal dirA Block cache

Concurrent file creation scales in ScaleFS creat (dirA/file1) creat (dirA/file2) CORE 1 CORE 2 DISK MEMORY MemFS DiskFS Journal dirA Block cache No contention No cache-line conflicts Scalability!

Challenge: How to implement fsync? fsync DISK MEMORY MemFS DiskFS Journal dirA Block cache

Challenge: How to implement fsync? • DiskFS updates must be consistent with MemFS • fsync must preserve conflict-freedom for commutative ops fsync DISK MEMORY MemFS DiskFS Journal dirA Block cache

Contributions • ScaleFS, a file system that achieves excellent multicore scalability • Two separate file systems: MemFS and DiskFS • Design for fsync: • Per-core operation logs to scalably defer updates to DiskFS • Ordering operations using Time Stamp Counters • Evaluation : • Benchmarks on ScaleFS scale 35x-60x on 80 cores • Workload/Machine independent analysis for cache-conflicts • Suggests ScaleFS a good fit for workloads not limited by disk I/O

ScaleFS design : Two separate file systems MEMORY Designed for multicore scalability Designed for durability DISK MemFS DiskFS fsync Journal Uses: hash-tables, radix-trees, seqlocks for lock-free reads Uses: blocks, transactions, journaling Per-core Operation Logs

Design challenges • How to order operations in the per-core operation logs? • How to operate MemFS and DiskFS independently: • How to allocate inodes in a scalable manner in MemFS? • . . .

Problem: Preserve ordering of non-commutative ops unlink (file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs

Problem: Preserve ordering of non-commutative ops creat (file1) unlink (file1) CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs

Problem: Preserve ordering of non-commutative ops creat (file1) unlink (file1) fsync CORE 3 CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Order: How??

[ RDTSCP does not incur cache-line conflicts ] Solution: Use synchronized Time Stamp Counters DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs

[ RDTSCP does not incur cache-line conflicts ] Solution: Use synchronized Time Stamp Counters creat (file1) unlink (file1) fsync CORE 3 CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Order: ts1 < ts2

Problem: How to allocate inodesscalably in MemFS? creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Inode Allocator

Solution (1) : Separate mnodes in MemFS from inodes in DiskFS creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Mnode Allocator Inode Allocator

Solution (2) : Defer allocating inodes in DiskFS until an fsync DISK MemFS DiskFS Journal dirA Block cache mnodeinode table Per-core Mnode Allocator Inode Allocator

Other design challenges • How to scale concurrent fsyncs? • How to order lock-free reads? • How to resolve dependencies affecting multiple inodes? • How to ensure internal consistency despite crashes?

Implementation • ScaleFS is implemented on the sv6 research operating system • Supported filesystem system calls: • creat, open, openat, mkdir, mkdirat, mknod, dup, dup2, lseek, read, pread, write, pwrite, chdir, readdir, pipe, pipe2, stat, fstat, link, unlink, rename, fsync, sync, close

Evaluation • Does ScaleFS achieve good scalability? • Measure scalability on 80 cores • Observe conflict-freedom for commutative operations • Does ScaleFS achieve good disk throughput? • What memory overheads are introduced by ScaleFS’s split of MemFS and DiskFS?

Evaluation methodology • Machine configuration: • 80-cores, with Intel E7-8870 2.4 GHz CPUs • 256 GB RAM • Backing store: “RAM” disk • Benchmarks: • mailbench: mail server workload • dbench: file server workload • largefile: Creates a file, writes 100 MB, fsyncs and deletes it • smallfile: Creates, writes, fsyncs and deletes lots of 1KB files

ScaleFS scales 35x-60x on a RAM disk [ Single-core performance of ScaleFS is on par with Linux ext4. ]

Machine-independent methodology • Use Commuter [Clements SOSP ’13] • to observe conflict-freedom for commutative ops • Commuter: • Generates testcases for pairs of commutative ops • Reports observed cache-conflicts

Conflict-freedom for commutative ops on Linux ext4 : 65% 138

Conflict-freedom for commutative ops on ScaleFS: 99.2%

Conflict-freedom for commutative ops on ScaleFS: 99.2% • Why not 100% conflict-free? • Tradeoff scalability for performance • Probabilistic conflicts

Evaluation summary • ScaleFS scales well on an 80 core machine • Commuter reports 99.2% conflict-freedom on ScaleFS • Workload/machine independent • Suggests scalability beyond our experimental setup and benchmarks

Related Work • Scalability studies: FxMark [USENIX ’16], Linux Scalability [OSDI ’10] • Scaling file systems using sharding: • Hare [Eurosys ’15], SpanFS [USENIX ’15] • ScaleFS uses similar techniques: • Operation Logging:OpLog [CSAIL TR ’14] • Per-inode/ Per-core logs : • NOVA [FAST ’16], iJournaling[USENIX ’17], Strata [SOSP ’17] • Decoupling in-memory and on-disk representations: • Linux dcache, ReconFS[FAST ’14] • ScaleFS focus : Achieve scalability by avoiding cache-line conflicts

Conclusion • ScaleFS– a novel file system design for multicore scalability • Two separate file systems : MemFS and DiskFS • Per-core operation logs • Ordering using Time Stamp Counters • ScaleFS scales 35x-60x on an 80 core machine • ScaleFSis conflict-free for 99.2% of testcases in Commuter • https://github.com/mit-pdos/scalefs

Scaling a file system to many cores using an operation log

Scaling a file system to many cores using an operation log

Presentation Transcript

Log file analysis

The Design and Implementation of a Log-Structured File System

The Design and Implementation of a Log-Structured File System

Corey: An Operating System for Many Cores

The Design and Implementation of a Log-Structured File System

SQL Log File: Spelunking

Disconnected Operation in the Coda File System

Log-Structured File System (LFS)

Using DSVM to Implement a Distributed File System

AN IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM FOR UNIX

THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM

Web-Based Electronic Operation Log System Zlog System

The Design and Implementation of a Log-Structured File System

Acquisition session log file

Using Alcoa to Specify a UNIX File System

Corey – An Operating System for Many Cores

Disconnected Operation in the Coda File System

AN IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM FOR UNIX

Examples Using a Distributed File System

The Design and Implementation of a Log-Structured File System

Log-Structured File Systems

Log Scaling