650 likes | 889 Views
Bigtable : A Distributed Storage System for Structured Data. Fay Chang et al. (Google, Inc.) Presenter: Kyungho Jeon kyunghoj@buffalo.edu. Motivation and Design Goal. Distributed Storage System for Structured Data Scalability Petabytes of data on Thousands of (commodity) machines
E N D
Bigtable: A Distributed Storage System for Structured Data Fay Chang et al. (Google, Inc.) Presenter: KyunghoJeonkyunghoj@buffalo.edu Fall 2012: CSE 704 Web-scale Data Management
Motivation and Design Goal • Distributed Storage System for Structured Data • Scalability • Petabytes of data on Thousands of (commodity) machines • Wide Applicability • Throughput-oriented and Latency-sensitive • High Performance • High Availability Fall 2012: CSE 704 Web-scale Data Management
Data Model Fall 2012: CSE 704 Web-scale Data Management
Data Model • Not a Full Relational Data Model • Provides a simple data model • Supports Dynamic Control over Data Layout • Allows clients to reason about the locality properties Fall 2012: CSE 704 Web-scale Data Management
Data Model – A Big Table • A Table in Bigtable is a: • Sparse • Distributed • Persistent • Multidimensional • Sorted map Fall 2012: CSE 704 Web-scale Data Management
Data Model • Data is indexed using row and column names • Data is treated as uninterpretedstrings • (row:string, column:string, time:int64) string • Data locality can be controlled through careful choices of the schema Fall 2012: CSE 704 Web-scale Data Management
Data Model • Rows • Data maintained in lexicographic order by row key • Tablet: rows with consecutive keys • Units of distribution and load balancing • Columns • Column families • Family:qualifier • Cells • Timestamps Fall 2012: CSE 704 Web-scale Data Management
Data Model – WebTable Example A large collection of web pages and related information Fall 2012: CSE 704 Web-scale Data Management
Data Model – WebTable Example Row Key Tablet - Group of rows with consecutive keys. Unit of Distribution Bigtable maintains data in lexicographic order by row key Fall 2012: CSE 704 Web-scale Data Management
Data Model – WebTable Example Column Family Column family is the unit of access control Fall 2012: CSE 704 Web-scale Data Management
Data Model – WebTable Example Column Column key is specified by “Column family:qualifier” Fall 2012: CSE 704 Web-scale Data Management
Data Model – WebTable Example Column You can add a column in a column family if the column family was created Fall 2012: CSE 704 Web-scale Data Management
Data Model – WebTable Example Cell Cell: the storage referenced by a particular row key, column key, and timestamp Fall 2012: CSE 704 Web-scale Data Management
Data Model – WebTable Example Different cells in a table can contain multiple versions indexed by timestamp Fall 2012: CSE 704 Web-scale Data Management
API Fall 2012: CSE 704 Web-scale Data Management
API Write or Delete values in Bigtable Look up values from individual rows Iterate over a subset of the data in a table Fall 2012: CSE 704 Web-scale Data Management
API – Update a Row Fall 2012: CSE 704 Web-scale Data Management
API – Update a Row Opens a Table Fall 2012: CSE 704 Web-scale Data Management
API – Update a Row We’re going to mutate the row Fall 2012: CSE 704 Web-scale Data Management
API – Update a Row Store a new item under the column key “anchor:www.c-span.org” Fall 2012: CSE 704 Web-scale Data Management
API – Update a Row Delete an item under the column key “anchor:www.abc.com” Fall 2012: CSE 704 Web-scale Data Management
API – Update a Row Atomic Mutation Fall 2012: CSE 704 Web-scale Data Management
API – Iterate over a Table Create a Scanner instance Fall 2012: CSE 704 Web-scale Data Management
API – Iterate over a Table Access “anchor” column family Fall 2012: CSE 704 Web-scale Data Management
API – Iterate over a Table Specify “return all versions” Fall 2012: CSE 704 Web-scale Data Management
API – Iterate over a Table Specify a row key Fall 2012: CSE 704 Web-scale Data Management
API – Iterate over a Table Iterate over rows Fall 2012: CSE 704 Web-scale Data Management
API – Other Features Single row transaction Client-supplied scripts in the address space of the server Input source/Output target for MapReduce jobs Fall 2012: CSE 704 Web-scale Data Management
A Typical Google Machine Fall 2012: CSE 704 Web-scale Data Management
A Google Cluster Fall 2012: CSE 704 Web-scale Data Management
A Google Cluster Fall 2012: CSE 704 Web-scale Data Management
Building Blocks • Chubby • Highly-available and persistent distributed lock service • GFS • Store logs and data files • SSTable • Google’s immutable file format • A persistent, ordered immutable map from keys to values • http://code.google.com/p/leveldb/ Fall 2012: CSE 704 Web-scale Data Management
Chubby • Highly-available and persistent distributed lock service • 5 replicas, one is elected as a master • Paxos • Provides a namespace that consists of directories and small files Fall 2012: CSE 704 Web-scale Data Management
Implementation • Client Library • Master • one and only one! • Tablet Servers • Many Fall 2012: CSE 704 Web-scale Data Management
Implementation - Master • Responsible for assigning tablets to table servers • Addition/removal of tablet server • Tablet-server load balancing • Garbage collecting files in GFS • Handles schema changes • Single master system (as GFS did) Fall 2012: CSE 704 Web-scale Data Management
Tablet Server Manages a set of tablets Handles read and write requests to the tablets Splits tablets that have grown too large Fall 2012: CSE 704 Web-scale Data Management
How Does a Client Find a Tablet? Fall 2012: CSE 704 Web-scale Data Management
Tablet Assignment Each tablet is assigned to at most one tablet server at a time When a tablet is unassigned, and a tablet server is available, the master assigns the tablet by sending a tablet load request Bigtable uses Chubby to keep track of tablet servers Fall 2012: CSE 704 Web-scale Data Management
Tablet Assignment • Detecting a tablet server which is no longer serving its tablets • The master periodically asks each tablet server for the status of its lock • If a tablet server reports it has lost its lock, or if the master cannot reach a tablet server, • The master attempts to acquire an exclusive lock on the server’s file • If the lock acquire is successful -> Chubby is alive, so the tablet server must have a problem • The master deletes the server’s file in Chubby to ensure the tablet server can never serve again • Then, the master move all the tablets that were previously assigned to that server into the set of unassigned tablets Fall 2012: CSE 704 Web-scale Data Management
Tablet Assignment • When a master is started, the master… • Grabs a unique master lock in Chubby • Scans the servers directory in Chubby to find the live servers • Communicates with every live tablet server to discover the current tablet assignment • Scans the METADATA table and adds unassigned tablets to the set of unassigned tablets Fall 2012: CSE 704 Web-scale Data Management
Tablet Serving Fall 2012: CSE 704 Web-scale Data Management
Tablet Serving • Memtable • A sorted buffer • Maintains the updates on a row-by-row basis • Each row is copy-on-write to maintain row-level consistency • Older updates are stored in a sequence of SSTable Fall 2012: CSE 704 Web-scale Data Management
Tablet Serving Fall 2012: CSE 704 Web-scale Data Management
Tablet Serving - Write • Write operation • The server checks if the operation is valid • A valid mutation is written to the commit log • After the write has been committed, its contents are inserted into the memtable Fall 2012: CSE 704 Web-scale Data Management
Tablet Serving Fall 2012: CSE 704 Web-scale Data Management
Tablet Serving - Read • Read operation • Check if the operation is valid • A valid operation is executed on a merged view of the sequence of SSTables and the memtable • The merged view can be formed efficiently since SSTables and the memtable are lexicographically sorted data structure Fall 2012: CSE 704 Web-scale Data Management
Tablet Serving - Recover Fall 2012: CSE 704 Web-scale Data Management
Tablet Serving - Recover • Recover a table • A tablet server reads its metadata from METADATA table • The metadata contains the list of SSTables that comprise a tablet and a set of redo points • The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have committed since the redo points Fall 2012: CSE 704 Web-scale Data Management
Compaction • Minor compaction • When the memtable size reaches a threshold, the memtable is frozen, a new memtable is created, and the frozen memtable is converted to an SSTable • Major compaction • Rewrite multiple SSTables into one SSTable Fall 2012: CSE 704 Web-scale Data Management
Compaction memtable Memory GFS Commit Log SSTable SSTable SSTable SSTable Write Op Fall 2012: CSE 704 Web-scale Data Management