1 / 37

Scaling a file system to many cores using an operation log

Scaling a file system to many cores using an operation log. Srivatsa S. Bhat , Rasha Eqbal , Austin T. Clements, M. Frans Kaashoek , Nickolai Zeldovich MIT CSAIL. Motivation: Current file systems don’t scale well. Filesystem: Linux ext4 (4.9.21) Benchmark:

fcrawford
Download Presentation

Scaling a file system to many cores using an operation log

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scaling a file system to many cores using an operation log Srivatsa S. Bhat, RashaEqbal, Austin T. Clements, M. FransKaashoek, NickolaiZeldovich MIT CSAIL

  2. Motivation: Current file systems don’t scale well • Filesystem: • Linux ext4 (4.9.21) • Benchmark: • dbench[https://dbench.samba.org] • Experimental setup: • 80-cores, 256 GB RAM • Backing store: “RAM” disk

  3. Linux ext4 scales poorly on multicore machines

  4. Concurrent file creation in Linux ext4 creat(dirA/file2) creat(dirA/file1) CORE 2 CORE 1 DISK MEMORY ext4 Journal dirA’s block

  5. Block contention limits scalability of file creation creat(dirA/file2) creat(dirA/file1) CORE 2 CORE 1 DISK MEMORY ext4 Journal dirA’s block Contends on the directory block! • Contention on blocks limits scalability on 80 cores • Even apps not limited by disk I/O don’t scale

  6. Goal : Multicore scalability • Problem : Contention limits scalability • Contention involves cache-line conflicts • Goal : Multicore scalability = No cache-line conflicts • Even a single contended cache-line can wreck scalability • Commutative operations can be implemented without cache-line conflicts • [Scalable Commutativity Rule, Clements SOSP ’13] • How do we scale all commutative operations in file systems?

  7. ScaleFS approach: Two separate file systems DISK MEMORY MemFS DiskFS Journal Block cache Directories (as hash-tables) fsync Designed for durability Designed for multicore scalability

  8. Concurrent file creation scales in ScaleFS creat (dirA/file1) creat (dirA/file2) CORE 1 CORE 2 DISK MEMORY MemFS DiskFS Journal dirA Block cache

  9. Concurrent file creation scales in ScaleFS creat (dirA/file1) creat (dirA/file2) CORE 1 CORE 2 DISK MEMORY MemFS DiskFS Journal dirA Block cache No contention No cache-line conflicts Scalability!

  10. Challenge: How to implement fsync? fsync DISK MEMORY MemFS DiskFS Journal dirA Block cache

  11. Challenge: How to implement fsync? • DiskFS updates must be consistent with MemFS • fsync must preserve conflict-freedom for commutative ops fsync DISK MEMORY MemFS DiskFS Journal dirA Block cache

  12. Contributions • ScaleFS, a file system that achieves excellent multicore scalability • Two separate file systems: MemFS and DiskFS • Design for fsync: • Per-core operation logs to scalably defer updates to DiskFS • Ordering operations using Time Stamp Counters • Evaluation : • Benchmarks on ScaleFS scale 35x-60x on 80 cores • Workload/Machine independent analysis for cache-conflicts • Suggests ScaleFS a good fit for workloads not limited by disk I/O

  13. ScaleFS design : Two separate file systems MEMORY Designed for multicore scalability Designed for durability DISK MemFS DiskFS fsync Journal Uses: hash-tables, radix-trees, seqlocks for lock-free reads Uses: blocks, transactions, journaling Per-core Operation Logs

  14. Design challenges • How to order operations in the per-core operation logs? • How to operate MemFS and DiskFS independently: • How to allocate inodes in a scalable manner in MemFS? • . . .

  15. Problem: Preserve ordering of non-commutative ops unlink (file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs

  16. Problem: Preserve ordering of non-commutative ops unlink (file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs

  17. Problem: Preserve ordering of non-commutative ops creat (file1) unlink (file1) CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs

  18. Problem: Preserve ordering of non-commutative ops creat (file1) unlink (file1) CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs

  19. Problem: Preserve ordering of non-commutative ops creat (file1) unlink (file1) fsync CORE 3 CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Order: How??

  20. [ RDTSCP does not incur cache-line conflicts ] Solution: Use synchronized Time Stamp Counters DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs

  21. [ RDTSCP does not incur cache-line conflicts ] Solution: Use synchronized Time Stamp Counters creat (file1) unlink (file1) fsync CORE 3 CORE 1 CORE 2 DISK MemFS DiskFS Journal dirA Block cache Per-core Operation Logs Order: ts1 < ts2

  22. Problem: How to allocate inodesscalably in MemFS? creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Inode Allocator

  23. Solution (1) : Separate mnodes in MemFS from inodes in DiskFS creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Mnode Allocator Inode Allocator

  24. Solution (1) : Separate mnodes in MemFS from inodes in DiskFS creat(dirA/file1) CORE 1 DISK MemFS DiskFS Journal dirA Block cache Per-core Mnode Allocator Inode Allocator

  25. Solution (2) : Defer allocating inodes in DiskFS until an fsync DISK MemFS DiskFS Journal dirA Block cache mnodeinode table Per-core Mnode Allocator Inode Allocator

  26. Other design challenges • How to scale concurrent fsyncs? • How to order lock-free reads? • How to resolve dependencies affecting multiple inodes? • How to ensure internal consistency despite crashes?

  27. Implementation • ScaleFS is implemented on the sv6 research operating system • Supported filesystem system calls: • creat, open, openat, mkdir, mkdirat, mknod, dup, dup2, lseek, read, pread, write, pwrite, chdir, readdir, pipe, pipe2, stat, fstat, link, unlink, rename, fsync, sync, close

  28. Evaluation • Does ScaleFS achieve good scalability? • Measure scalability on 80 cores • Observe conflict-freedom for commutative operations • Does ScaleFS achieve good disk throughput? • What memory overheads are introduced by ScaleFS’s split of MemFS and DiskFS?

  29. Evaluation methodology • Machine configuration: • 80-cores, with Intel E7-8870 2.4 GHz CPUs • 256 GB RAM • Backing store: “RAM” disk • Benchmarks: • mailbench: mail server workload • dbench: file server workload • largefile: Creates a file, writes 100 MB, fsyncs and deletes it • smallfile: Creates, writes, fsyncs and deletes lots of 1KB files

  30. ScaleFS scales 35x-60x on a RAM disk [ Single-core performance of ScaleFS is on par with Linux ext4. ]

  31. Machine-independent methodology • Use Commuter [Clements SOSP ’13] • to observe conflict-freedom for commutative ops • Commuter: • Generates testcases for pairs of commutative ops • Reports observed cache-conflicts

  32. Conflict-freedom for commutative ops on Linux ext4 : 65% 138

  33. Conflict-freedom for commutative ops on ScaleFS: 99.2%

  34. Conflict-freedom for commutative ops on ScaleFS: 99.2% • Why not 100% conflict-free? • Tradeoff scalability for performance • Probabilistic conflicts

  35. Evaluation summary • ScaleFS scales well on an 80 core machine • Commuter reports 99.2% conflict-freedom on ScaleFS • Workload/machine independent • Suggests scalability beyond our experimental setup and benchmarks

  36. Related Work • Scalability studies: FxMark [USENIX ’16], Linux Scalability [OSDI ’10] • Scaling file systems using sharding: • Hare [Eurosys ’15], SpanFS [USENIX ’15] • ScaleFS uses similar techniques: • Operation Logging:OpLog [CSAIL TR ’14] • Per-inode/ Per-core logs : • NOVA [FAST ’16], iJournaling[USENIX ’17], Strata [SOSP ’17] • Decoupling in-memory and on-disk representations: • Linux dcache, ReconFS[FAST ’14] • ScaleFS focus : Achieve scalability by avoiding cache-line conflicts

  37. Conclusion • ScaleFS– a novel file system design for multicore scalability • Two separate file systems : MemFS and DiskFS • Per-core operation logs • Ordering using Time Stamp Counters • ScaleFS scales 35x-60x on an 80 core machine • ScaleFSis conflict-free for 99.2% of testcases in Commuter • https://github.com/mit-pdos/scalefs

More Related