450 likes | 539 Views
Fast and Reliable Stream Storage through Differential Data Journaling. Andromachi Hatzieleftheriou MSc Thesis Supervisor: Stergios Anastasiadis. Thesis Motivation. We study the real-time storage of massive stream data real-time or retrospective processing e.g. monitoring applications
E N D
Fast and Reliable Stream Storage through Differential Data Journaling Andromachi Hatzieleftheriou MSc Thesis Supervisor: Stergios Anastasiadis
Thesis Motivation • We study the real-time storage of massive stream data • real-time or retrospective processing • e.g. monitoring applications • continuous data received from sensors in real-time • video and audio streams of high quality at high rates • environmental measurements at much lower rates • Traditional file and database systems insufficient • excessive resource requirements in case of high-volume streaming traffic • need for system facilities for the storage of heterogeneous streams • different rate and content characteristics • General-purpose file systems use journaling to synchronously move data or metadata from memory to disk with sequential throughput • data journaling high disk overhead
Thesis Motivation • Data journaling should be enabled with random writes but disabled with large sequential writes • Need to efficiently and reliably store multiple concurrent streams • individual stream appends perfectly sequential • aggregate workload random-access • unclear what is the most appropriate way to handle the incoming data • We examine the possibility of employing data journaling techniques • combine sequential throughput with low latency during synchronous writes • We introduce differential data journaling in order to minimize the cost of data journaling • only the actually modified bytes are logged, not the entire corresponding blocks
Outline • Related Work • Ext3 • Architectural Definition • Prototype Implementation • Performance Evaluation • Conclusions & Future Work
Fast and Reliable Storage • File system operations can be: • data operationsthat update user data • metadata operationsthat modify the structure of the file system • Several techniques have been proposed to achieve high performance during data and metadata updates • Operating systems susceptible to hardware and power failures that damage their efficiency and reliability • special utility needed during reboot to recover the file system • the system remains offline while the disk is scanned and repaired
Synchronous Writes & Soft Updates • Synchronous writes • pending writes must complete before the next ones can be submitted • significant performance loss • Soft updates • ordering between metadata writes • list of metadata dependencies per disk block • after a crash • system mounted and used immediately • remaining inconsistencies corrected in the background
Log-Structured File Systems • Data and metadata updates • initially buffered in the cache • then written sequentiallyto a continuous stream • Main features • disk treated as a segmented append-only log • indexing information needed for efficient read • costly seeks are avoided maximized disk write throughput • After a crash • the system reconstructs its state from the last consistent point in the log • Log space needs to be constantly reclaimed • garbage collection
Journaling File Systems • Metadata updates written to a circular append-only journal before committed to the main file system • batching opportunities • synchronous writes complete faster • sequential throughput • Logging of data modifications also supported • performance improvement for synchronous writes • significant journal throughput • fullblockslogged even with small writes instead of the modified parts • After a crash • replay the last updates from the journal
Outline • Related Work • Ext3 • Architectural Definition • Prototype Implementation • Performance Evaluation • Conclusions & Future Work
General Features • Each high-level change to the file system is performed in two steps: • the modified blocks are copied into the journal • the modified blocks are sent to their final disk location • Journal features: • treated as a circular buffer • file within the same file system or separate disk partition
Journaling Modes • Data Mode • both data and metadata logged • data blocks written twice • strong consistency semantics • Ordered Mode • only metadatalogged • data writes forced to the fixed location right before metadata is written to the journal • strong consistency semantics Final Location (Data) Final Location (Data) • Writeback Mode • only metadata logged • data blocks written directly to final location • no ordering between the journal and fixed-location data writes • weak consistency guarantees sync Journal (Metadata) Journal (Metadata) Journal (Data+Metadata) sync sync Commit Commit sync Commit Final Location (Data) time time time Final Location (Metadata) Final Location (Metadata) Final Location (Data +Metadata) Checkpoint Checkpoint Checkpoint Final Location (Data)
Journal Structure • Journal superblock • tracks summary information for the journal • Journal descriptor block • marks the beginning of a transaction • describes the subsequent journaled blocks • Journal dataand metadatablocks • Journal commit block • written at the end of a transaction • marks that data and metadata are safe on disk
Kernel Buffers • Page cache • keeps page copies from recently accessed disk files in memory • Block buffer • in-memory buffer of each disk block • allocated in units called buffer pages • Buffer head descriptor • specifies all the handling information required by the kernel to locate the corresponding block on disk
Flushing Dirty Buffers to Disk • Goal: Dirty pages that accumulate in memory need to be written to disk • pdflush kernel threads • systematically scan the page cache for dirty pages to flush every writeback period • ensure that no page remains dirty for too long more than expiration period • kjournaldkernel thread • commits the current state of the file system every commit interval period of time • flushes the dirty buffers of the committed transactions to their final location • checkpoint process • fsync system call • forces all data and metadata dirty buffers of a specified file descriptor to disk
Commit Policy • Process of writting to journal the dirty buffers modified by a transaction • Commit is initiated when: • the commit interval expires • write updates need to be synchronously written to disk • For each journal block buffer: • a buffer head specifies the respective block number in the journal • points to the original copy of the block buffer • a journal head points to the corresponding transaction
Commit Process • A journal descriptor block is allocated • contains tagsthat map block buffers to their final location • When it fills up • it is written to the journal • the corresponding block buffers follow • a journal commit block is synchronously written to the journal • Additional journal descriptor blocks can be allocated
Recovery Policy • Recovery process • automatically started after an unclean shutdown • scans the log for complete transactions that need to be replayed • Three phases needed: • PASS_SCANscans the end of the journal • PASS_REVOKEis used to prevent older journal records from being replayed on top of newer data using the same block • PASS_REPLAYwrites to their final disk location the newest versions of all the blocks that need to be replayed • The system can crash before the recovery finishes • the same journal can be reused
Outline • Related Work • Ext3 • Architectural Definition • Prototype Implementation • Performance Evaluation • Conclusions & Future Work
Design Goals • We investigate the performance characteristics of datajournalingin the context of synchronouswrites • Data journaling features: • Synchronous writes complete faster • take advantage of the sequential journal throughput • Significant amount of traffic sent to the journal • high journal device throughput • Traffic changes sublinearly as a function of the write rate • Substantial overhead even with small write requests • due to the full-block logging scheme • Proposal: a new journaling mode • accumulation of multiple write modifications in a single journal block
Design Goals • Partial Block • new journal block type • accumulates the modifications from multiple writes • Commit Policy • only the modified part of individual data blocks should be journaled • for fully modified data or metadata blocks entire blocks can be logged • Recovery Policy • whole blocks read from the journal and written back to their final location • for partially modified blocks • the original disk block should be first read from the final location • then written back updated with the difference retrieved from the journal
Outline • Related Work • Ext3 • Architectural Definition • Prototype Implementation • Performance Evaluation • Conclusions & Future Work
Partial Blocks • Used to gather the partial updates of data blocks • Two different types of journal blocks used: • partialthat store multiple data writes smaller than the default block size • non-partialthat correspond to metadata or fully written data buffers
Journal Heads & Tags • Journal heads • we use them to preparethe blocks that are actually sent to the journal • we added two new fields for partial modifications • offset and length of the partially modified block • Tags • allocated during commit per block buffer • contain the following fields: • final disk location of the modified block • four flags for journal-specific block properties • a flag indicating whether the corresponding block is partially modified or not(added) • length of the new bytes (added) • starting offset in the data block of the final disk location (added)
Commit Process • A descriptor and a partial data block allocated • Partially modified data blocks • modifications copied consecutively in the partial data block • Metadata or fully written data blocks • the corresponding fullblocks are logged • When the descriptor fills up • it is written to the journal • all the corresponding block buffers follow • a journal commit block is written to the journal
Recovery Policy • Data modifications retrieved from the journal and applied to the final blocks • From each retrieved journal descriptor block the included tags are extracted • describe partial or full write or metadata modifications • Non-partial modification • next block retrieved from the journal and written to the final location • Partial modification • next block retrieved from the journal partial data block • the original disk block is read into a kernel buffer • the modification is copied from the journal block buffer to the proper final buffer • starting offset andlength tag fields used • when the end of the current partial block is exceeded the next one is retrieved from the journal
Outline • Related Work • Ext3 • Architectural Definition • Prototype Implementation • Performance Evaluation • Conclusions & Future Work
Performance Measurements • We examine the disk throughput requirements and the average latency of each write under streaming workloads • We measure performance in an environment of temporary small files • investigate the benefit of data journaling in applications other than streaming • We examine the possible overheads of our implementation • recovery time • CPU load
Streaming Workloads • Massive numbers of streams synchronously written on the same disk facility • We examine the performance characteristics of streams with differentrates, while varying the degree of concurrency • data rate: the amount of data that is stored per unit of time • At each execution: • a sequence of write updates synchronously applied to the system for a specified amount of time • according to the rate different record sizes are used • low rates small request sizes • high rates large request sizes
Flushing Policy • Manual tuning of the dirty page flush timers according to the rate and the number of the streams • Low-rate streams • accumulation of multiple write updates in memory for a longperiod • batching opportunities • longexpiration interval and frequentwriteback period • avoid filling up the journal and the memory • High-rate streams • high volumes of data fill up the journal and the memory rather soon • default or slightly reduced expiration and writeback periods • according to the generated amount of data
Journal Device Throughput • Low-ratestreams • low for the writeback and ordered modes • much higher for the default data journaling • comparable to metadata-only modes for the differential data journaling • High-ratestreams • significant high for both data journaling modes
Final Location Throughput • Low-rate streams • low for both data journaling modes • several factors higher for metadata-only journaling modes • High-rate streams • comparable to all the four modes
Write Response Time • Much higher for the metadata-only journaling modes compared to the data journaling modes • Data journaling benefits from the journal’s sequential throughput • fast and reliable storage opportunities
CPU Utilization • Expected CPU overhead for differential data journaling • memory copy of the modified parts to the appropriate journal partial block • For both high and low rate streams • CPU load less than 10% • mostly idle • Insignificant extra CPU cost for differential data journaling
Postmark Benchmark • Postmark used to study the performance of smallwrites • typical for electronic mail, newsgroups and web-based commerce • Significant improvement of the supported transaction rate for data journaling modes • low write latency more transactions served per second
Recovery Time • Scan phase • high latency for default data journaling • low latency for metadata-only modes • latency of differential data journaling comparable to metadata-only modes • Revoke phase • equal to all the four modes • Replay phase • comparable latency for the two data journaling modes • despite the extra block reads of differential data journaling • much lower latency for metadata-only modes
Experimental Results • Streaming workloads • differential data journaling reduces substantially the journal traffic of data journaling • especially for low-rate streams • significant reduction of the write latency for the data journaling modes with respect to metadata-only journaling • Typical small-write workload • substantial improvement in the supported transaction rate
Outline • Related Work • Ext3 • Architectural Definition • Prototype Implementation • Performance Evaluation • Conclusions & Future Work
Conclusions • Emerging need for the real-time storage of massive stream data • fresh look at file systems that support data journaling • New journaling mode: differential data journaling • accumulation of multiple updates into a single journal block • fast and reliable storage at relatively low disk throughput requirements
Future Work • Many directions for future work, mainly regarding the performance evaluation of our implementation • Investigation of the automatic tuning of system parameters related to the timing of dirty page flushes • Direct comparison with a log-structured or other journaling file systems in order to demonstrate the benefits of our architecture • Further examination of the performance of differential data journaling under heterogeneous workloads • Examination of the behavior of differential data journaling under some database workload • Experimentation in a real streaming environment
References [1] P. J. Desnoyers and P. Shenoy, Hyperion: High Volume Stream Archival for Retrospective Querying, USENIX Annual Technical Conference, June 2007. [2] Prabhakaran et al., Analysis and Evolution of Journaling File Systems, USENIX Annual Technical Conference,April 2005, pp. 105-120. [3] S. Tweedie, Journaling the Linux Ext2fs Filesystem, Fourth Annual Linux Expo, Durham, North Carolina, May 1998. [4] D. Carney et al., Monitoring Streams – A New Class of Data Management Applications, VLDB Conference, August 2002, pp. 215-226.
Journaling Objects • Log record • corresponds to a low-level operation that updates a disk block • represented as full blocks • Atomic operation handle • corresponds to a high-level operation • multiple low-level operations • during recovery • either the whole high-level operation is applied • or none of its low-level operations • Transaction • consists of multiple atomic operation handles
Stream Archival Servers • Design can be based on two possible architectures: • a relational database • not designed for rapid and continuous loading of individual data items • ill-equipped to handle numerous continuous queries over data streams • insufficient for real-time requirements • a conventional file system • mainly care to maintain their integrity across crashes without compromising performance • should not compromise the playback performance • should exploit the particular I/O characteristics of individual streams • e.g. StreamFS used for the storage of high-volume streams
Checkpoint Policy • Limited amount of journal space that needs to be reclaimed • Process of ensuring that a section of the log is committed fully to disk, so that that portion of the log can be reused • Checkpoint occurs when: • there is not enough journal space left • free space is between 1/4 and 1/2 of the journal size • the journal is being flushed to disk
Enabling/ Disabling Disk Write Cache • Synchronous write operations return as soon as the data reaches the on-disk write cache rather than the storage media • Disabling the write cache scales down the performance of the different modes • Significant advantage of data journaling with respect to the ordered mode • small writes