HBASE. Present by Brandon thai Professor Thanh Tran CS157B, SAN JOSE STATE UNIVERSITY. Agenda. Overview HBase Usage Scenarios Installing HBase HBase + Hive Integration . Overview. Modeled after Google’s Big table Strong consistent Read/Write Automatic sharding
Agenda • Overview • HBase Usage Scenarios • Installing HBase • HBase + Hive Integration
Overview • Modeled after Google’s Big table • Strong consistent Read/Write • Automatic sharding • Tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows • Automatic RegionServer Failover • HBase supports HDFS out of the box as its distributed file system
Overview • MapReduce • Hbase supports massively parallelized processing via MapReduce for using Hbase as both source and sink • Java Client API • Support an easy to use Java API for programmatic access • Thrift/REST API • Hbase also support Thrift and REST for non-Java front-ends • Block Cache and Bloom Filters • Hbase supports a Block Cache and Bloom Filters for high volume query optimization
Usage scenarios • Storing large amounts of data • Hundreds of gigabytes up to petabytes • Situations requiring high write throughput • Thousands of insert, update or delete per second • Rapid lookup of values by key • Don’t need to use features that an RDBMS provides • Typed columns, secondary indexes, transactions, advanced query languages, etc. • Make sure to have enough hardware (DataNodes)
Hbaseterminology • Region • A subset of a table’s rows similar to a partition • HRegionServer • Serves data for read and writes • Master • Responsible for coordinating HRegionServers • Assigns regions, detects failures of HRegionServers, and controls administrative function
Data model in HBase • Table – Tables are declared up front at schema definition time • Row – row keys are un-interpreted bytes. Rows are lexicographically sorted with the lowest order appearing first in a table • Column Family – Columns in Hbase are grouped into column families. All column members of a column family have the same prefix. For example, courses:history and courses:math are both member of the courses column family. • Cell – A {row, column, version} tuple exactly specifies a cell in Hbase
Data model operations • There are four primary data model operations in Hbase • Get – returns attributes for a specified row • Put – eithers adds new rows to a table ( if the key is new) or can update existing rows (if key already exists) • Scan – allow iteration over multiple rows for specified attributes • Delete – removes a row from a table • All data model operations return data in sorted order. First by row, then by ColumnFamily, followed by column qualifier, and finally timestamp • Hbase does not support join directly • Two primary strategies to do complex join are: • Denormalizing the data upon writing to Hbase • To have a lookup tables and do the join between Hbase tables in your application or MapReduce code
Hbase run modes • Standalone • This is the default mode • Uses the local filesystem instead of HDFS • Runs all Hbase daemons and a local Zookeeper all up in the same JVM • Distributed • Preudo-Distributed mode • Can run against the local filesystem or HDFS • Uses this mode only for testing and prototyping on a single host • Fully-distributed mode • Can only run on HDFS
Download Hbase • http://www.apache.org/dyn/closer.cgi/hbase/ • Download from one of mirror sites
Install Hbase • Decompress and change into the unpacked directory • $ tar xfz hbase-0.99.0-SNAPSHOT.tar.gz • $ cd hbase-0.99.0-SNAPSHOT • Edit conf/hbase-site.xml
Start hbase • Start HBase • $ ./bin/start-hbase.sh • Hbase logs can be found in the logs subdirectory if you encounter any problems • Connect to running Hbase via the Shell • Type Help and then <RETURN> to see listing of shell commands
Create a Table • Create a Table name test with single column family named cf • Connect to running Hbase via the Shell • Verify its creation by listing all tables • Insert some values • The first insert is at row1, column cf: a with value of value1
Verify the data • Running the scan of the table • Get a single row • Disable and drop the table • Exit the shell by typing exit • Stop Hbase instance by running the stop script: stop-hbase.sh
Distributed Hbase • Basic System Requirements • HBase requires at least Java 6 • ssh much be installed and sshd must be running to used Hadoop’s scripts to manage remote Hadoop and Hbase daemons • DNS • HBase uses the local hostname to self-report its IP address • If your machine has multiple interface, HBase will use the interface that the primary hostname resolves to • Another option is to set hbase.regionserver.dns.interface to indicate the primary interface • Another option is to set hbase.regionserver.dns.nameserver to choose a different nameserver than the system default
Ulimit and nproc • Apache Hbase is a database – uses a lot of files at the same time • Default Linux systems user file limit is 1024 • Any significant amount of loading will lead to Java IO Exception error • Upping the file descriptors and nproc for the user is an operating system configuration • To upping the file descriptors, add the following lines in the file /etc/security/limits.conf • <user running Hadoop> - nofile 32768 • <user running Hadoop> soft/hard nproc 3200 • In the file /etc/pam.d/common-session add the following line to the last line in the file • session required pam_limits.so • Restart you system for the changes to take effect
Replace Hadoop • HBase depends on Hadoop, it bundles an instance of the Hadoop jar under its lib directory • The bundled jar is only use in standalone mode • In distribute mode, it is critical that the version of Hadoop that is out on your cluster match what is under Hbase • Replace the Hadoop jar found in the Hbase lib directory with the hadoop jar you are running on your cluster to avoid version mismatch
Pseudo-distributed Mode • A pseudo-distributed mode is simple a fully-distributed mode run on a single host • Use for configuration testing and prototyping • Setup HDFS in your single host • Edit conf/hbase-site.xml • Tell Hbase to run in (pseudo) distributed mode rather than in default standalone mode. • With this configuration, HBase will start up an HBase Master process, a ZooKeeper server, and a RegionServer process
Hbase and Hdfs • Hbase uses the local filesystem and writing to the operating systems’s temporary directory • Hadoop local filesystem does not support sync, so unless the system is shutdown properly, the data will be lost. • Operating system’s temporary directory can also make for data loss. • For a more permanent setup, make use of an instance of HDFS, Hbase data will be written to the Hadoop distributed filesystem rather than to the local filesystem directory • Let Hbase create the hbase.rootdir directory. If you don’t, you will get warning saying Hbase needs a migration run because the directory is missing files expected by HBase • Add the following lines to the conf/hbase-site.xml • <name>hbase.rootdir</name> • <value>hdfs://<local HDFS instance homed>:8020/hbase</value>
Testing the Installation • Make sure HDFS is running • bin/start-hdfs.sh in the HADOOP_HOME directory • Start HBase • bin/start-hbase.sh in the HBASE_HOME directory • If you have trouble running Hbase, check Hbase logs files in the logs subdirectory • Hbase UI listing vital attributes • By default its deployed on the master host at port 16010 • http://<host named>:16010
Hive + Hbase motivation • Characteristics of Hive • Batch • Structured • Analysts • Characteristics of Hbase • Online • Unstructured (schemaless) • Programmers • Hive data warehouses on HDFS has long ETL times • Hive does not have access to real time data • Analyzing HBase data with MapReduce requires custom coding • Hive and SQL are already known by many analysts
Use case 1: Hbase as ETL Data Sink • Use Hive to query HDFS Table and put the results into HBase • Use the HBase for online Queries
Use case 2: Hbase as data source • Use both data from Hbase and HDFS tables to do the query
Use case 3: low latency warehouse • Both table have the same table definition • HBaseget continuous updates of data • HDFS Table get periodic Dump of data
Example of Hive table • HbaseStorageHandler is the integration driver between Hive and Hbase
