Fast and Reliable Stream Storage through Differential Data Journaling

Fast and Reliable Stream Storage through Differential Data Journaling Andromachi Hatzieleftheriou MSc Thesis Supervisor: Stergios Anastasiadis

Thesis Motivation • We study the real-time storage of massive stream data • real-time or retrospective processing • e.g. monitoring applications • continuous data received from sensors in real-time • video and audio streams of high quality at high rates • environmental measurements at much lower rates • Traditional file and database systems insufficient • excessive resource requirements in case of high-volume streaming traffic • need for system facilities for the storage of heterogeneous streams • different rate and content characteristics • General-purpose file systems use journaling to synchronously move data or metadata from memory to disk with sequential throughput • data journaling high disk overhead

Thesis Motivation • Data journaling should be enabled with random writes but disabled with large sequential writes • Need to efficiently and reliably store multiple concurrent streams • individual stream appends perfectly sequential • aggregate workload random-access • unclear what is the most appropriate way to handle the incoming data • We examine the possibility of employing data journaling techniques • combine sequential throughput with low latency during synchronous writes • We introduce differential data journaling in order to minimize the cost of data journaling • only the actually modified bytes are logged, not the entire corresponding blocks

Outline • Related Work • Ext3 • Architectural Definition • Prototype Implementation • Performance Evaluation • Conclusions & Future Work

Fast and Reliable Storage • File system operations can be: • data operationsthat update user data • metadata operationsthat modify the structure of the file system • Several techniques have been proposed to achieve high performance during data and metadata updates • Operating systems susceptible to hardware and power failures that damage their efficiency and reliability • special utility needed during reboot to recover the file system • the system remains offline while the disk is scanned and repaired

Synchronous Writes & Soft Updates • Synchronous writes • pending writes must complete before the next ones can be submitted • significant performance loss • Soft updates • ordering between metadata writes • list of metadata dependencies per disk block • after a crash • system mounted and used immediately • remaining inconsistencies corrected in the background

Log-Structured File Systems • Data and metadata updates • initially buffered in the cache • then written sequentiallyto a continuous stream • Main features • disk treated as a segmented append-only log • indexing information needed for efficient read • costly seeks are avoided maximized disk write throughput • After a crash • the system reconstructs its state from the last consistent point in the log • Log space needs to be constantly reclaimed • garbage collection

Journaling File Systems • Metadata updates written to a circular append-only journal before committed to the main file system • batching opportunities • synchronous writes complete faster • sequential throughput • Logging of data modifications also supported • performance improvement for synchronous writes • significant journal throughput • fullblockslogged even with small writes instead of the modified parts • After a crash • replay the last updates from the journal

General Features • Each high-level change to the file system is performed in two steps: • the modified blocks are copied into the journal • the modified blocks are sent to their final disk location • Journal features: • treated as a circular buffer • file within the same file system or separate disk partition

Journaling Modes • Data Mode • both data and metadata logged • data blocks written twice • strong consistency semantics • Ordered Mode • only metadatalogged • data writes forced to the fixed location right before metadata is written to the journal • strong consistency semantics Final Location (Data) Final Location (Data) • Writeback Mode • only metadata logged • data blocks written directly to final location • no ordering between the journal and fixed-location data writes • weak consistency guarantees sync Journal (Metadata) Journal (Metadata) Journal (Data+Metadata) sync sync Commit Commit sync Commit Final Location (Data) time time time Final Location (Metadata) Final Location (Metadata) Final Location (Data +Metadata) Checkpoint Checkpoint Checkpoint Final Location (Data)

Journal Structure • Journal superblock • tracks summary information for the journal • Journal descriptor block • marks the beginning of a transaction • describes the subsequent journaled blocks • Journal dataand metadatablocks • Journal commit block • written at the end of a transaction • marks that data and metadata are safe on disk

Kernel Buffers • Page cache • keeps page copies from recently accessed disk files in memory • Block buffer • in-memory buffer of each disk block • allocated in units called buffer pages • Buffer head descriptor • specifies all the handling information required by the kernel to locate the corresponding block on disk

Flushing Dirty Buffers to Disk • Goal: Dirty pages that accumulate in memory need to be written to disk • pdflush kernel threads • systematically scan the page cache for dirty pages to flush every writeback period • ensure that no page remains dirty for too long more than expiration period • kjournaldkernel thread • commits the current state of the file system every commit interval period of time • flushes the dirty buffers of the committed transactions to their final location • checkpoint process • fsync system call • forces all data and metadata dirty buffers of a specified file descriptor to disk

Commit Policy • Process of writting to journal the dirty buffers modified by a transaction • Commit is initiated when: • the commit interval expires • write updates need to be synchronously written to disk • For each journal block buffer: • a buffer head specifies the respective block number in the journal • points to the original copy of the block buffer • a journal head points to the corresponding transaction

Commit Process • A journal descriptor block is allocated • contains tagsthat map block buffers to their final location • When it fills up • it is written to the journal • the corresponding block buffers follow • a journal commit block is synchronously written to the journal • Additional journal descriptor blocks can be allocated

Recovery Policy • Recovery process • automatically started after an unclean shutdown • scans the log for complete transactions that need to be replayed • Three phases needed: • PASS_SCANscans the end of the journal • PASS_REVOKEis used to prevent older journal records from being replayed on top of newer data using the same block • PASS_REPLAYwrites to their final disk location the newest versions of all the blocks that need to be replayed • The system can crash before the recovery finishes • the same journal can be reused

Design Goals • We investigate the performance characteristics of datajournalingin the context of synchronouswrites • Data journaling features: • Synchronous writes complete faster • take advantage of the sequential journal throughput • Significant amount of traffic sent to the journal • high journal device throughput • Traffic changes sublinearly as a function of the write rate • Substantial overhead even with small write requests • due to the full-block logging scheme • Proposal: a new journaling mode • accumulation of multiple write modifications in a single journal block

Design Goals • Partial Block • new journal block type • accumulates the modifications from multiple writes • Commit Policy • only the modified part of individual data blocks should be journaled • for fully modified data or metadata blocks entire blocks can be logged • Recovery Policy • whole blocks read from the journal and written back to their final location • for partially modified blocks • the original disk block should be first read from the final location • then written back updated with the difference retrieved from the journal

Partial Blocks • Used to gather the partial updates of data blocks • Two different types of journal blocks used: • partialthat store multiple data writes smaller than the default block size • non-partialthat correspond to metadata or fully written data buffers

Journal Heads & Tags • Journal heads • we use them to preparethe blocks that are actually sent to the journal • we added two new fields for partial modifications • offset and length of the partially modified block • Tags • allocated during commit per block buffer • contain the following fields: • final disk location of the modified block • four flags for journal-specific block properties • a flag indicating whether the corresponding block is partially modified or not(added) • length of the new bytes (added) • starting offset in the data block of the final disk location (added)

Commit Process • A descriptor and a partial data block allocated • Partially modified data blocks • modifications copied consecutively in the partial data block • Metadata or fully written data blocks • the corresponding fullblocks are logged • When the descriptor fills up • it is written to the journal • all the corresponding block buffers follow • a journal commit block is written to the journal

Recovery Policy • Data modifications retrieved from the journal and applied to the final blocks • From each retrieved journal descriptor block the included tags are extracted • describe partial or full write or metadata modifications • Non-partial modification • next block retrieved from the journal and written to the final location • Partial modification • next block retrieved from the journal partial data block • the original disk block is read into a kernel buffer • the modification is copied from the journal block buffer to the proper final buffer • starting offset andlength tag fields used • when the end of the current partial block is exceeded the next one is retrieved from the journal

Performance Measurements • We examine the disk throughput requirements and the average latency of each write under streaming workloads • We measure performance in an environment of temporary small files • investigate the benefit of data journaling in applications other than streaming • We examine the possible overheads of our implementation • recovery time • CPU load

Streaming Workloads • Massive numbers of streams synchronously written on the same disk facility • We examine the performance characteristics of streams with differentrates, while varying the degree of concurrency • data rate: the amount of data that is stored per unit of time • At each execution: • a sequence of write updates synchronously applied to the system for a specified amount of time • according to the rate different record sizes are used • low rates small request sizes • high rates large request sizes

Flushing Policy • Manual tuning of the dirty page flush timers according to the rate and the number of the streams • Low-rate streams • accumulation of multiple write updates in memory for a longperiod • batching opportunities • longexpiration interval and frequentwriteback period • avoid filling up the journal and the memory • High-rate streams • high volumes of data fill up the journal and the memory rather soon • default or slightly reduced expiration and writeback periods • according to the generated amount of data

Journal Device Throughput • Low-ratestreams • low for the writeback and ordered modes • much higher for the default data journaling • comparable to metadata-only modes for the differential data journaling • High-ratestreams • significant high for both data journaling modes

Final Location Throughput • Low-rate streams • low for both data journaling modes • several factors higher for metadata-only journaling modes • High-rate streams • comparable to all the four modes

Write Response Time • Much higher for the metadata-only journaling modes compared to the data journaling modes • Data journaling benefits from the journal’s sequential throughput • fast and reliable storage opportunities

CPU Utilization • Expected CPU overhead for differential data journaling • memory copy of the modified parts to the appropriate journal partial block • For both high and low rate streams • CPU load less than 10% • mostly idle • Insignificant extra CPU cost for differential data journaling

Postmark Benchmark • Postmark used to study the performance of smallwrites • typical for electronic mail, newsgroups and web-based commerce • Significant improvement of the supported transaction rate for data journaling modes • low write latency more transactions served per second

Recovery Time • Scan phase • high latency for default data journaling • low latency for metadata-only modes • latency of differential data journaling comparable to metadata-only modes • Revoke phase • equal to all the four modes • Replay phase • comparable latency for the two data journaling modes • despite the extra block reads of differential data journaling • much lower latency for metadata-only modes

Experimental Results • Streaming workloads • differential data journaling reduces substantially the journal traffic of data journaling • especially for low-rate streams • significant reduction of the write latency for the data journaling modes with respect to metadata-only journaling • Typical small-write workload • substantial improvement in the supported transaction rate

Conclusions • Emerging need for the real-time storage of massive stream data • fresh look at file systems that support data journaling • New journaling mode: differential data journaling • accumulation of multiple updates into a single journal block • fast and reliable storage at relatively low disk throughput requirements

Future Work • Many directions for future work, mainly regarding the performance evaluation of our implementation • Investigation of the automatic tuning of system parameters related to the timing of dirty page flushes • Direct comparison with a log-structured or other journaling file systems in order to demonstrate the benefits of our architecture • Further examination of the performance of differential data journaling under heterogeneous workloads • Examination of the behavior of differential data journaling under some database workload • Experimentation in a real streaming environment

Thank you..!

References [1] P. J. Desnoyers and P. Shenoy, Hyperion: High Volume Stream Archival for Retrospective Querying, USENIX Annual Technical Conference, June 2007. [2] Prabhakaran et al., Analysis and Evolution of Journaling File Systems, USENIX Annual Technical Conference,April 2005, pp. 105-120. [3] S. Tweedie, Journaling the Linux Ext2fs Filesystem, Fourth Annual Linux Expo, Durham, North Carolina, May 1998. [4] D. Carney et al., Monitoring Streams – A New Class of Data Management Applications, VLDB Conference, August 2002, pp. 215-226.

Journaling Objects • Log record • corresponds to a low-level operation that updates a disk block • represented as full blocks • Atomic operation handle • corresponds to a high-level operation • multiple low-level operations • during recovery • either the whole high-level operation is applied • or none of its low-level operations • Transaction • consists of multiple atomic operation handles

Stream Archival Servers • Design can be based on two possible architectures: • a relational database • not designed for rapid and continuous loading of individual data items • ill-equipped to handle numerous continuous queries over data streams • insufficient for real-time requirements • a conventional file system • mainly care to maintain their integrity across crashes without compromising performance • should not compromise the playback performance • should exploit the particular I/O characteristics of individual streams • e.g. StreamFS used for the storage of high-volume streams

Checkpoint Policy • Limited amount of journal space that needs to be reclaimed • Process of ensuring that a section of the log is committed fully to disk, so that that portion of the log can be reused • Checkpoint occurs when: • there is not enough journal space left • free space is between 1/4 and 1/2 of the journal size • the journal is being flushed to disk

Enabling/ Disabling Disk Write Cache • Synchronous write operations return as soon as the data reaches the on-disk write cache rather than the storage media • Disabling the write cache scales down the performance of the different modes • Significant advantage of data journaling with respect to the ordered mode • small writes

Fast and Reliable Stream Storage through Differential Data Journaling

Fast and Reliable Stream Storage through Differential Data Journaling

Presentation Transcript

Fast, Accurate and Reliable Robotic Storage System

Reliable stream service--TCP

Differential Response and Data

Civil Service Fast-stream

Fast Adaptive Storage and Retrieval

Reliable Data Storage using Reed Solomon Code

Reliable stream service--TCP

Developing Fluency Through Journaling

The European Fast Stream

Reliable Cloud Storage

Journaling

TCP : Reliable Byte Stream

Reliable Byte-Stream (TCP)

Storage and Data

Fast and Reliable SMS Messaging

Fast and Reliable Locksmith Services

Reliable Byte-Stream (TCP)

Journaling: