Bigtable : A Distributed Storage System for Structured Data

Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. WallachM. Burrows, T. Chandra, A. Fikes, R.E. Gruber Google, Inc. TianyangHU

Outline • Introduction • Data Model • Building Blocks • Implementation • Refinements • Performance Evaluation • Future Work

Motivation • Scalability • worldwide applications & users • huge amount of communication & data

Bigtable • Distributed storage system • petabytes of data, thousands of machines • simple data model with dynamic control • applicability, scalability, performance, availability • Used by more than 60 applications of Google

Data Model – Overview • Sparse, distributed, persistent multidimensional sorted map.

Data Model – Example • “Webtable” stores copy of web pages & their related information. • row key: URL (reverse hostname) • column key: attribute name • timestamp: time that the page is fetched

Data Model – Rows • Row key: string (usually 10-100KB, max 64KB) • Every R/W of data under a single row key is atomic

Data Model – Rows • Sorted by row key in lexicographic order • Tablet: a certain range of rows • the unit of distribution & load balancing • good locality for data access

Data Model – Columns • Column families: group of column keys (same type) • the unit of access control • Column key: family:qualifier

Data Model – Timestamps • Timestamp: index multiple versions of the same data • not necessarily the “real time” • data clean up, garbage collection

Building Blocks • SSTablefile format • persistent, ordered, immutable key-value (string-string) pairs • used internally to store Bigtabledata

Building Blocks • GFS • store log & data files • scalability, reliability, performance, fault tolerance • Chubby • a highly-available and persistent distributed lock service

Bigtable Components • A library that is linked into every client • Many tablet servers • handle R/W to tablets with clients • One tablet master • assign tablets to tablet servers • detect addition & expiration of tablet servers • balance tablet-server load

Architecture

Tablet Location • Three-level hierarchy • root tablet (Only one, stores addresses of METADATA tablets) • METADATA tablets (stores addresses of user tablets) • user tablets

Tablet Location • Client caches (multiple) tablet locations • if the cache is stale, query again

Tablet Assignment • The tablet master uses Chubby to keeps track of • live tablet servers • each live tablet server acquires an exclusive lock on a corresponding file • tablet assignment status • compare tablets registered in METADATA tablet with tablets in tablet servers

Tablet Assignment • Case 1: some tablets are unassigned • master assigns them to tablet servers with sufficient room • Case 2: a tablet server stops its service • master detects it and assigns outstanding tablets to other servers. • Case 3: too many small tablets • master initiates merge • Case 4: a tablet grows too large • the corresponding tablet server initiates split and notifies master

Tablet Serving • A tablet is stored as a sequence of SSTables in GFS • Tablet mutations are logged in commit log • the “commit log” stores redo records • recent tablet versions are stored in memory (memtable) • older tablet versions are stored in GFS

Tablet Serving • Recover a tablet • 1. Tablet server fetches its metadata from METADATA tablet, which contains a list of SSTables that comprises a tablet and redo points. • 2. The server reads the indices of the SSTables into memory. • 3. The server applies all the mutations after the redo point.

Tablet Serving • Write operation on a tablet • 1. The tablet server checks the validity of the operation. • 2. The operation is logged in the commit log. • 3. Commit the operation. • 4. The content of tablet is inserted into memtable.

Tablet Serving • Read operation on a tablet • 1. The tablet server checks the validity of the operation. • 2. Execute the operation on a merged view of memtable & SSTables.

Compactions • Memtable grows as write operations execute • Two types of compactions • minor compaction • merging (major) compaction

Compactions • Minor compaction (when memtable size reaches a threshold) • 1. Freeze the memtable • 2. Create a new memtable • 3. Convert the memtable to an SSTable and write to GFS

Compactions • Merging compaction (periodically) • 1. Freeze the memtable • 2. Create a new memtable • 3. Merge a few SSTables & memtable into a new SSTable

Compactions • Major compaction • special case of merging compaction • merges allSSTables & memtable

Compactions • Why freeze & create memtable? • Incoming read and write operations can continue during compactions. • Advantages of compaction: • release the memory of the tablet server • reduce the amount of data that has to be read from the commit log during recovery if this tablet server dies

Refinements • Locality groups • group multiple column families together • different locality groups are not typically accessed together • for each tablet, store each locality group in a separate SSTable • more efficient R/W

Refinements • Compression • similar data in same column, neighbouringrows, multiple versions • customized compression on SSTable block level (smallest component) • two-pass compression scheme • 1. Bentley and McIlroy’sscheme, compress long strings across a large window • 2. fast compression algorithm, look for repetitions in small window • experimental compression ratio: 10% (Gzip: 25-33%)

Refinements • Caching • two-level cache on tablet server • Scan cache (high-level): caches key-value pairs • Case: read the same data repeatedly • Block cache (low-level): caches SSTableblocks • Case: sequential read

Refinements • Commit-log implementation • one commit log per tablet incurs a large # of disk seeks • use single commit log for all tablets on a tablet server • benefits significantly during normal operation • complicates recovery • solution: sort the commit log entries first

Performance Evaluation R/W rate per tablet server aggregate R/W rate

Future Work • Resource sharing for different applications? • Hybrid with relational database? • complex query • security

Bigtable : A Distributed Storage System for Structured Data

Bigtable : A Distributed Storage System for Structured Data

Presentation Transcript

Web Services

Keyword Search on Structured and Semi-Structured Data

Chapter 9: Structured Data Extraction

Storage Performance for SQL Server

DATA

IBM System Storage – N series

Difference between Structured Analysis and Object Oriented Analysis?

Keyword Search on Structured and Semi-Structured Data

Lecture XIII: Replication-II

虛擬化技術 Virtualization and Virtual Machines

Data Structures

Module 16: Distributed System Structures

Querying and Monitoring Distributed Business Processes

Chapter 19: Distributed Databases

Chapter 1: Data Storage

Beyond the File System

Chapter 11: Storage and File Structure

Chapter 11: Storage and File Structure

ECS 152A

Web of Data

Chapter 11: Storage and File Structure