250 likes | 629 Views
Google Bigtable A Distributed Storage System for Structured Data. Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University of Science and Technology, hsalimi@iust.ac.ir. Introduction. BigTable is a distributed storage system for managing structured data.
E N D
Google BigtableA Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University of Science and Technology, hsalimi@iust.ac.ir
Introduction • BigTable is a distributed storage system for managing structured data. • Scales to Petabytes of data and thousands of machines. • Developed and in use at Google since 2005. Used for more than 60 Google products.
Data Model • (row, column, time) => string • Row, column, value are arbitrary strings. • Every read or write of data under a single row key is atomic (regardless of the number of different columns being read or written in the row). • Columns are dynamically added. • Timestamps for different versions of data. • Assigned by client application. • Older versions are garbage-collected. • Example: Web map
Tablets • Rows are sorted lexicographically. • Consecutive keys are grouped together as “tablets”. • Allows data locality. • Example rows: com.google.maps/index.html and com.google.maps/foo.html are likely to be in same tablet.
Column Families • Column keys are grouped into sets called “column families”. • Column key is named using syntax: family:qualifier • Access control and disk/memory accounting are at column family level • Example: “anchor:cnnsi.com”
API • Data Design • Creating/deleting tables and column families • Changing cluster, table and column family metadata like access control rights • Client Interactions • Write/Delete values • Read values • Scan row ranges • Single-row transactions (e.g., read/modify/write sequence for data under a row key) • Map/Reduce integration. • Read from Big Table; Write to Big Table.
Building Blocks • SSTable file: Data structure for storage • Maps keys to values • Ordered. Enables data locality for efficient writes/reads. • Immutable. On reads, no concurrency control needed. Need to garbage collect deleted data. • Stored in Google File System (GFS), and optionally can be mapped into memory. • Replicates data for redundancy. • Chubby: Distributed lock service. • Store the root tablet, schema info, access control list • Synchronize and detect tablet servers
Implementation 3 components: • Client library • Master Server (exactly 1). • Assigns tablets to tablet servers. • Detecting the addition and expiration of tablet servers. • Balancing tablet-server load • Garbage collection of GFS files • Schema changes such as table and column family creations. • Tablet Servers (multiple, dynamically added/removed) • Handles read and write requests to the tablets that it has loaded • Splits tablets that have grown too large. Each tablet 100-200 MB.
Tablet Location • How to know which node to route client request? • 3-level hierarchy • One file in Chubby for location of Root Tablet • Root tablet contains location of Metadata tablets • Metadata table contains location of user tablets • Row: [Tablet’s Table ID] + [End Row] • Key: [Node ID] • Client library caches tablet locations.
Tablet Assignment • Master keeps track of tablet assignment and live servers • Chubby • Tablet server creates & locks a unique file. • Tablet server stops serving if loses lock. • Master periodically checks tablet servers. If fails, master tries to lock the file and un-assigns the tablet. • Master failure does not change tablets assignments. • Master restart
Tablet Serving Write • Check well-formedness of request. • Check authorization in Chubby file. • Write to “tablet log” (i.e., a transaction log for “redo” in case of failure). • Write to memtable (RAM). • Separately, “compaction” moves memtable data to SSTable. And truncates tablet log. Read • Check well-formedness of request. • Check authorization in Chubby file. • Merge memtable and SSTables to find data. • Return data.
Compaction In order to control size of memtable, tablet log, and SSTable files, “compaction” is used. • MinorCompaction. Move data from memtable to SSTable. Truncate tablet log. • Merging Compaction. Merge multiple SSTables and memtable to a single SSTable. • Major Compaction. Remove deleted data.
Refinements • Locality group. • Client can group multiple column families into a locality group. Enables more efficient reads since each locality group is a separate SSTable. • Compression. • Client can choose to compress at locality group level. • Two level caching in servers • Scan cache ( K/V pairs) • Block cache (SSTable blocks read from GFS) • Bloom filter • Efficient check if a SSTable contain data for a row/column pair. • Commit log implementation • Each tablet server has a single commit log (not one-per-tablet).
Performance Evaluation • Random reads are slowest. Need to access SSTable block from disk. • Writes are faster than reads. Commit log is append-only. Reads require merging of SSTables and memtable. • Scans reduce number of read operations.
Performance Evaluation: Scaling • Not linear, but not bad up to 250 tablet servers. • Random read has worst scaling. Block transfers saturate network.
Conclusions • Satisfies goals of high-availability, high-performance, massively scalable data storage. • API. Successfully used by various Google products (>60). • Additional features in progress: • Secondary indexes • Cross data center replication. • Deploy as a hosted service. • Advantages of the custom development: • Significant flexibility due to own data model. • Can remove bottlenecks and inefficiencies as they arise.
Big Table Family Tree Non-relational DBs (HBase, Cassandra, MongoDB, etc.) • Column-oriented data model. • Multi-level storage (commit log, RAM table, SSTable) • Tablet management (assignment, splitting, recovery, GC, Bloom filters) Google related technologies and open-source equivalents • GFS => Hadoop Distributed File System (HDFS) • Chubby => Zookeeper • Map/Reduce => Apache Map/Reduce