390 likes | 539 Views
Bigtable : A Distributed Storage System for Structured Data. F. Chang, J. Dean, S. Ghemawat , W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes , R.E. Gruber Google, Inc . Tianyang HU. Outline. Introduction Data Model Building Blocks Implementation Refinements
E N D
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. WallachM. Burrows, T. Chandra, A. Fikes, R.E. Gruber Google, Inc. TianyangHU
Outline • Introduction • Data Model • Building Blocks • Implementation • Refinements • Performance Evaluation • Future Work
Motivation • Scalability • worldwide applications & users • huge amount of communication & data
Bigtable • Distributed storage system • petabytes of data, thousands of machines • simple data model with dynamic control • applicability, scalability, performance, availability • Used by more than 60 applications of Google
Outline • Introduction • Data Model • Building Blocks • Implementation • Refinements • Performance Evaluation • Future Work
Data Model – Overview • Sparse, distributed, persistent multidimensional sorted map.
Data Model – Example • “Webtable” stores copy of web pages & their related information. • row key: URL (reverse hostname) • column key: attribute name • timestamp: time that the page is fetched
Data Model – Rows • Row key: string (usually 10-100KB, max 64KB) • Every R/W of data under a single row key is atomic
Data Model – Rows • Sorted by row key in lexicographic order • Tablet: a certain range of rows • the unit of distribution & load balancing • good locality for data access
Data Model – Columns • Column families: group of column keys (same type) • the unit of access control • Column key: family:qualifier
Data Model – Timestamps • Timestamp: index multiple versions of the same data • not necessarily the “real time” • data clean up, garbage collection
Outline • Introduction • Data Model • Building Blocks • Implementation • Refinements • Performance Evaluation • Future Work
Building Blocks • SSTablefile format • persistent, ordered, immutable key-value (string-string) pairs • used internally to store Bigtabledata
Building Blocks • GFS • store log & data files • scalability, reliability, performance, fault tolerance • Chubby • a highly-available and persistent distributed lock service
Outline • Introduction • Data Model • Building Blocks • Implementation • Refinements • Performance Evaluation • Future Work
Bigtable Components • A library that is linked into every client • Many tablet servers • handle R/W to tablets with clients • One tablet master • assign tablets to tablet servers • detect addition & expiration of tablet servers • balance tablet-server load
Tablet Location • Three-level hierarchy • root tablet (Only one, stores addresses of METADATA tablets) • METADATA tablets (stores addresses of user tablets) • user tablets
Tablet Location • Client caches (multiple) tablet locations • if the cache is stale, query again
Tablet Assignment • The tablet master uses Chubby to keeps track of • live tablet servers • each live tablet server acquires an exclusive lock on a corresponding file • tablet assignment status • compare tablets registered in METADATA tablet with tablets in tablet servers
Tablet Assignment • Case 1: some tablets are unassigned • master assigns them to tablet servers with sufficient room • Case 2: a tablet server stops its service • master detects it and assigns outstanding tablets to other servers. • Case 3: too many small tablets • master initiates merge • Case 4: a tablet grows too large • the corresponding tablet server initiates split and notifies master
Tablet Serving • A tablet is stored as a sequence of SSTables in GFS • Tablet mutations are logged in commit log • the “commit log” stores redo records • recent tablet versions are stored in memory (memtable) • older tablet versions are stored in GFS
Tablet Serving • Recover a tablet • 1. Tablet server fetches its metadata from METADATA tablet, which contains a list of SSTables that comprises a tablet and redo points. • 2. The server reads the indices of the SSTables into memory. • 3. The server applies all the mutations after the redo point.
Tablet Serving • Write operation on a tablet • 1. The tablet server checks the validity of the operation. • 2. The operation is logged in the commit log. • 3. Commit the operation. • 4. The content of tablet is inserted into memtable.
Tablet Serving • Read operation on a tablet • 1. The tablet server checks the validity of the operation. • 2. Execute the operation on a merged view of memtable & SSTables.
Compactions • Memtable grows as write operations execute • Two types of compactions • minor compaction • merging (major) compaction
Compactions • Minor compaction (when memtable size reaches a threshold) • 1. Freeze the memtable • 2. Create a new memtable • 3. Convert the memtable to an SSTable and write to GFS
Compactions • Merging compaction (periodically) • 1. Freeze the memtable • 2. Create a new memtable • 3. Merge a few SSTables & memtable into a new SSTable
Compactions • Major compaction • special case of merging compaction • merges allSSTables & memtable
Compactions • Why freeze & create memtable? • Incoming read and write operations can continue during compactions. • Advantages of compaction: • release the memory of the tablet server • reduce the amount of data that has to be read from the commit log during recovery if this tablet server dies
Outline • Introduction • Data Model • Building Blocks • Implementation • Refinements • Performance Evaluation • Future Work
Refinements • Locality groups • group multiple column families together • different locality groups are not typically accessed together • for each tablet, store each locality group in a separate SSTable • more efficient R/W
Refinements • Compression • similar data in same column, neighbouringrows, multiple versions • customized compression on SSTable block level (smallest component) • two-pass compression scheme • 1. Bentley and McIlroy’sscheme, compress long strings across a large window • 2. fast compression algorithm, look for repetitions in small window • experimental compression ratio: 10% (Gzip: 25-33%)
Refinements • Caching • two-level cache on tablet server • Scan cache (high-level): caches key-value pairs • Case: read the same data repeatedly • Block cache (low-level): caches SSTableblocks • Case: sequential read
Refinements • Commit-log implementation • one commit log per tablet incurs a large # of disk seeks • use single commit log for all tablets on a tablet server • benefits significantly during normal operation • complicates recovery • solution: sort the commit log entries first
Outline • Introduction • Data Model • Building Blocks • Implementation • Refinements • Performance Evaluation • Future Work
Performance Evaluation R/W rate per tablet server aggregate R/W rate
Outline • Introduction • Data Model • Building Blocks • Implementation • Refinements • Performance Evaluation • Future Work
Future Work • Resource sharing for different applications? • Hybrid with relational database? • complex query • security