350 likes | 364 Views
Explore how to balance performance and durability concerns in file systems, comparing local and distributed scenarios with a user-centric view and innovative implementations.
E N D
Improving File System Synchrony CS 614 Lecture – Fall 2007 – Tuesday October 16 By Jonathan Winter
Introduction Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Motivation • File system I/O is a major performance bottleneck. • On chip computation and caches access are very fast. • Even main memory accesses are comparatively quick. • Distributed computing further exacerbates the problem. • Durability and fault-tolerance are also key concerns. • Files systems depend on mechanical disks. • Common source of crashes, data loss, and system incoherence. • Distributed file systems further complicate reliability issues. • Typically performance and durability must be traded off. • Ease-of-use is also important. • Synchronous I/O semantics make programming easier. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Outline • Overview of Local File System I/O Scenario • Overview of Distributed File System Scenario • A User-Centric View of Synchronous I/O • Similarities and Differences in Problem Domains • Details of the Speculator Infrastructure • Implementation of External Synchrony • Benchmark Descriptions • Performance Results • Conclusions Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Local File System I/O • Traditional File Systems Come in Two Flavors. • Synchronous provide durability guarantees by blocking. • OS crashes and power failures will not cause data loss. • File modifications are ordered providing determinism. • Blocking and sequential execution for ordering reduces performance. • Asynchronous files systems don’t block on modifications. • Commit can occur long after completion. • Users can view output later invalidated by a crash. • Synchronization can be enforced through explicit commands (fsync). • fsync does not protect against data loss on a typical desktop OS. • Performance is higher through buffering and group commit. • ext3 is a standard journaling Linux local file system. • Can be configured in async, sync, and durable modes. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Distributed File Systems • Distributed file systems typically use synchronous I/O. • Provides straightforward abstraction of single namespace. • Enables cache coherence and durability. • Synchronous messages have long latencies over network. • Weaker consistency used (close-to-open) for speed. • Common systems include AFS and NFS. • Earlier research by authors created the Blue File System. • Provides single-copy semantics. • Distributed nature of network is transparent. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
User-Centric View of Synchrony • Synchronous I/O assumes an application-centric view. • Durability of file system guaranteed for application. • System views application as external entity. • Application state must be kept consistent. • Application must not see uncommitted results. • Application must block on distributed file I/O. • User-centric view considers application state as internal. • Observable output must be synchronous. • Kernel and applications are both internal state. • Internal implementation can run asynchronously. • Only external output to the screen or network must be synchronous. • Execution of internal components can be speculative. • Results of speculative execution cannot be observed by outside. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Similarities Between Scenarios • Both local and distributed file system solutions require buffering output to user and external environment. • Asynchrony of implementation hidden from user. • Durability must be preserved in the presence of faults. • Speculative execution must not be seen until commit. • Both speculation and external synchrony require dependence tracking for uncommitted process and kernel state. • Tracking allows speculative execution rollback misspeculations. • Tracking determines which data should not yet be user visible. • Asynchronous implementation allows computation, IPC, I/O messages, network communication, and disk writes to overlap. • Major source of systems’ performance improvement. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Differences Between Scenarios • Speculative execution in distributed file systems requires checkpointing and recovery on misspeculation. • External synchrony does not speculate, it just allows internal state to run ahead of the output to user. • Speculative execution must block in some situations. • Checkpointing challenges limit the kinds of supported IPC. • Shared memory was not implemented in distributed setting. • External synchrony conservatively assumes all readers in shared memory inherit dependencies. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Details of the Speculator Infrastructure • Major bottlenecks in original NFS are blocking of processes and serialization of network traffic. • Speculation allows for concurrency of computation and I/O. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Conditions for Success of Speculations • File systems chosen as first target for Speculator because: • Results of speculative operations are highly predictable. • Clients cache data and concurrent updates are rare. • Speculating that cached data is valid is successful most of the time. • Network I/O much slower than checkpointing. • Checkpoint is low overhead and a lot speculative work can be completed in the time that the cached data is verified. • Computers have spare resources available for speculation. • Processors are idle significant portions of the time. • Extra memory is available for checkpoints. • Spare resources are available for use to speed up I/O throughput. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Speculation Interface • Speculator requires modifications to system calls to allow for speculative distributed I/O and to propagate dependencies. • Interface is designed to encapsulate implementation details. • Speculator provides: • create_speculation • commit_speculation • fail_speculation • Speculator doesn’t worry about details of speculative hypotheses. • Distributed file system is oblivious to checkpointing and recovery. • Partitioning of responsibilities allows for easy modification of internal implementation and expansion of support for IPC. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Speculation Implementation • Checkpointing performed by executing a copy-on-write fork. • Must save the state of open file descriptors and copy signals. • Forked child only run if speculation fails, discarded otherwise. • If speculation fails, child fork is given identify of original process. • Two data structures added to kernel to track speculative state. • Speculation structure created by create_speculation to track the set of kernel objects that depend on the new speculation. • The undo log is an ordered list of speculative operations with information to allow speculative operations to be undone. • Multiple speculations can be started for the same process, with multiple speculation structures and checkpoints. • If a previous speculation was read only, checkpoints are shared. • New checkpoints are required every 500ms to cap recovery time. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Ensuring Correct Speculative Execution • Two invariants must hold for correct execution. • Speculative state should not be visible to the user or external devices. • Output to screen, network, and other interfaces must be buffered. • Processes cannot view speculative state unless they are registered as dependent upon that state. • Non-speculative processes must block or become speculative when viewing speculative state. • Blocking can always be used to ensure correctness. • System calls that do not modify state or modify only private state can be performed speculatively unmodified. • Speculation flags set in file system superblocks and for read and write system calls to indicate dependency relationships. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Multi-Process Speculation • To extend the amount of possible speculative work, a speculative process can perform inter-process communication. • Dependencies must propagate from a process P to an object X when P modifies X and P depends on speculation that X does not. • Typically propagations are bi-directional between objects. • A commit_speculation will deleted the associated speculation structure and removed related undo log entries. • Fail_speculation will atomically perform rollback. • The undo log, undo entries, and speculations are generic. • Undo log entries point to type-specific state and functions to implement type-specific rollback for different forms of IPC. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Causal Dependency Propagation Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Forms of Supported IPC • Distributed file system objects. • Cached copies used speculatively, deleted and retrieved if stale. • Local memory file system – RAMFS was modified. • Modified ext3 to allow speculation for local disk file system. • Speculative data never written to disk. • Calling fdatasync blocks the process. • Processes can observe speculative metadata in ext3 superblocks, bitmaps, and group descriptors. Metadata can be written to disk. • ext3 journal modified to separate speculative and non-speculative data in compound transactions. • Pipes and fifos handled like local file systems. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Forms of Supported IPC (continued) • Unix sockets propagate dependencies bi-directionally • Signals challenging because exitting process cannot restart. • Signaling processes are checkpointed and managed with queue. • During fork, child inherits all dependencies of parent. • Exiting processes not deallocated until all dependencies are resolved. • Other forms of IPC not supported: • System V IPC, futexes, and shared memory. • Processes block to ensure proper behavior. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Using Speculation in the File System • For read operations, cached version of file is required. • Speculation assumes file has not been modified, • RPCs changed from synchronous to asynchronous. • Server with full knowledge managing mutating operations. • Server permits other processes to see speculatively changed files only if the cached version matches the server version. • Server must process messages in same ordering as clients see. • Server never stores speculative data. • Clients group commit multiple operations with one disk write. • NFS modified to support Speculator (keeps close-to-open). • Blue File System modified to show speculation can enabled strong consistency and safety as well as good performance. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
External Synchrony • Goal: Provide the reliability and ease-of-use of synchronous I/O and the performance of asynchronous. • Implementation called xsyncfs build on top of ext3. • File system transactions are completed in non-blocking manner but output is not allowed to be externalized. • All output is buffered in the OS to be released when all disk transactions depended on commit. • Processes with commit dependencies propagate output restrictions when interacting with other processes through IPC. • Xsyncfs uses output-triggered commits to balance throughput and latency. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Example of External Synchrony Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
External Synchrony Design Overview • Synchrony defined by externally observable behavior. • I/O is externally synchronous if output cannot be distinguished from output that could be produced from synchronous I/O. • Requires values of external outputs to be the same. • Outputs must occur in same causal order as defined by Lamport’s happens before relation. • Disk commits are considered external output. • File system does all the same processing as for synchronous. • Need not commit the modification to disk before returning. • Two optimizations made to improve performance. • Group committing is used (commits are atomic). • External output is buffered and processes continue execution. • Output guaranteed to be committed every 5 seconds. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
External Synchrony Implementation • Xsyncfs leverages Speculator infrastructure for output buffering and dependency tracking for uncommitted state. • Checkpointing and rollback features unneeded and are disabled. • Speculator tracks commit dependencies between processes and uncommitted file system transactions. • Processes interacting with the dependent process are marked as dependent on the same set of uncommitted transactions. • Many-to-many relationships between objects tracked in undo logs. • ext3 operates in journaled mode. • Multiple modifications are grouped into compound transactions. • Single transaction active at any time and committed atomically. • Likewise only one transaction can be committing at a time. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
External Synchrony Data Structures Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Some Additional Issues • External synchrony must be augmented to support explicit synchronization operations such as sync and fdatasync. • A commit dependency is created between the calling process and active transaction, creating a visible event causing a commit. • Xsyncfs does not require application modification. • Programmers can write the same code as for synchronous. • Explicit synchronization is not needed. • Programmers don’t need to added group commit to the code. • Hand-tuned code can provide benefits in when programmers have specialized information. • However, xsyncfs has global information about external output which can be used to optimize commit throughput. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Evaluation Methodology • All experiments run on Pentium 4 processors. • RedHat Enterprise Linux release 3 (kernel 2.4.2.1) used. • Speculative execution evaluated for two scenarios. • First has no delay and second assumes 30ms round trip. • Packets routed through NISTnet network emulator. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Durability Experiment • Desired to obtain confirmation that ext3 does not guarantee durability. • Test consists of continuously writing to local file system and sending UDP messages after each write completes. • Power is cut during experiment and the file system state and log are compared. • ext3 did not provide durability when mounted asynchronously or synchronously and even when fsync commands where issued after writes. • The problem is that modifications are only written to the hard drive cache and not the platter unless write barriers are employed. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Workload Descriptions • PostMark Benchmark • Performs hundreds or thousands of transactions consisting of file reads, writes, creates, and deletes, and then removes all the files. • Replicates small file workloads of electronic mail, netnews, and web-based commerce. • Good test of file system throughput since there is little output or computation. • Apache Build Benchmark • Benchmark untars Apache 2.0.48 source tree, runs configure in an object directory, runs make, and then removes all the files. • File system must balance throughput and latency since there is a lot of screen output interleaved with disk I/O and computation. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Workload Descriptions (continued) • MySQL Benchmark • Runs OSDL TPC-C benchmark with MySQL 5.0.16 and the InnoDB storage engine. • Used to see how xsyncfs performs when application performs its own group commit strategy. • Both MySQL and TPC-C client are multi-threaded so this measures xsyncfs’s support for shared memory as well. • SPECweb99 Benchmark • Provides a network intensive application with 50 clients, saturating the server. • High level of network traffic challenges xsyncfs because the messages externalize state. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
PostMark File System Benchmark Results Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Apache Build Benchmark Results Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
MySQL and SPECweb99 Results Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Average Latency of HTTP Requests • xsyncfs adds less than 33ms of delay to a request, less than the 50 ms commonly cited perception threshold. • xsyncfs performance significantly better on large request sizes. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Benefit of Output-Triggered Commits • The goal is to assess the speedup of this lazy approach to commits. • Output-triggered commits allows grouping but can cost latency. • Output-triggered commits perform better on all benchmarks except SPECweb99 where there is so much traffic that both policies have similar behavior. Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony
Conclusions • Speed need not be sacrificed for durability and ease-of-use. • Both papers succeed in developing a system that achieves performance near that of an asynchronous implementation with the fault-tolerance and simplicity of the synchronous abstraction. • Key insight is user-centric view abstraction. • Speculator infrastructure provides powerful functionality through dependency tracking and checkpointing/rollback. • Papers focus on using system to speed up local and distributed file systems but many other applications are possible. • Amazing order-of-magnitude speedups are achieved. • Simple ideas that surprisingly took until 2005 to be developed. • Why didn’t I think of this for my research? Tues. Oct 16, 2007 – CS 614 – Improving File System Synchrony