150 likes | 326 Views
HBase. A column-centered database. Overview. An Apache project Influenced by Google’s BigTable Built on Hadoop A distributed file system Supports Map-Reduce Goals Scalability Versions Compression In memory tables. Architectural issues. Cluster of nodes is general architecture
E N D
HBase A column-centered database
Overview • An Apache project • Influenced by Google’s BigTable • Built on Hadoop • A distributed file system • Supports Map-Reduce • Goals • Scalability • Versions • Compression • In memory tables
Architectural issues • Cluster of nodes is general architecture • Standalone mode for single machine • There is a Java API accessed with JRuby • There is a JRuby shell
Modeling constructs • Table • Has a row key • A series of column families • Each has a column name and a value • Operations • Create table • Insert a row with “Put” command • Only one column at a time • Query a table with a “Get” command • (uses a table name and a row key)
Filters • Scan • can get a series of rows based on two key values • Can provide a filter for such things as column families, timestamps • Filters can be pushed to the server
Updating • When a column value is written to the db, old values are kept and organized by timestamp • Each such value is a cell • You can explicitly assign timestamps manually • Otherwise, current timestamp with insert • When getting, uses most recent version • Operations that alter column family structures is expensive
Other characteristics • Text compression • Rows are stored in order by key value • A region is some set of rows • Each is stored in a single region server • Regions can be automatically merged and split • Uses write-ahead logging to prevent loss of data with node failures • This is called journaling in Unix file systems • Supports a master/slave multi-cluster strategy
An HBase clustertaken from: http://www.packtpub.com/article/hbase-basic-performance-tuning
Tasks of components • Zookeeper cluster is a coordination service for the HBase cluster • Finds the correct server • Selects the master • Master allocates regions & load balancing • Region servers hold the regions • Hadoop supports Map-Reduce
Some key concepts • De-normalization • Fast random, key-row retrieval • Use of a multi-component architecture to leverage existing software tools • Controllable in-memory selection