1 / 29

HBASE

HBASE. Present by Brandon thai Professor Thanh Tran CS157B, SAN JOSE STATE UNIVERSITY. Agenda. Overview HBase Usage Scenarios Installing HBase HBase + Hive Integration . Overview. Modeled after Google’s Big table Strong consistent Read/Write Automatic sharding

royce
Download Presentation

HBASE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HBASE Present by Brandon thai Professor Thanh Tran CS157B, SAN JOSE STATE UNIVERSITY

  2. Agenda • Overview • HBase Usage Scenarios • Installing HBase • HBase + Hive Integration

  3. Overview • Modeled after Google’s Big table • Strong consistent Read/Write • Automatic sharding • Tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows • Automatic RegionServer Failover • HBase supports HDFS out of the box as its distributed file system

  4. Overview • MapReduce • Hbase supports massively parallelized processing via MapReduce for using Hbase as both source and sink • Java Client API • Support an easy to use Java API for programmatic access • Thrift/REST API • Hbase also support Thrift and REST for non-Java front-ends • Block Cache and Bloom Filters • Hbase supports a Block Cache and Bloom Filters for high volume query optimization

  5. Usage scenarios • Storing large amounts of data • Hundreds of gigabytes up to petabytes • Situations requiring high write throughput • Thousands of insert, update or delete per second • Rapid lookup of values by key • Don’t need to use features that an RDBMS provides • Typed columns, secondary indexes, transactions, advanced query languages, etc. • Make sure to have enough hardware (DataNodes)

  6. Hbaseterminology • Region • A subset of a table’s rows similar to a partition • HRegionServer • Serves data for read and writes • Master • Responsible for coordinating HRegionServers • Assigns regions, detects failures of HRegionServers, and controls administrative function

  7. Data model in HBase • Table – Tables are declared up front at schema definition time • Row – row keys are un-interpreted bytes. Rows are lexicographically sorted with the lowest order appearing first in a table • Column Family – Columns in Hbase are grouped into column families. All column members of a column family have the same prefix. For example, courses:history and courses:math are both member of the courses column family. • Cell – A {row, column, version} tuple exactly specifies a cell in Hbase

  8. Data model operations • There are four primary data model operations in Hbase • Get – returns attributes for a specified row • Put – eithers adds new rows to a table ( if the key is new) or can update existing rows (if key already exists) • Scan – allow iteration over multiple rows for specified attributes • Delete – removes a row from a table • All data model operations return data in sorted order. First by row, then by ColumnFamily, followed by column qualifier, and finally timestamp • Hbase does not support join directly • Two primary strategies to do complex join are: • Denormalizing the data upon writing to Hbase • To have a lookup tables and do the join between Hbase tables in your application or MapReduce code

  9. Hbase run modes • Standalone • This is the default mode • Uses the local filesystem instead of HDFS • Runs all Hbase daemons and a local Zookeeper all up in the same JVM • Distributed • Preudo-Distributed mode • Can run against the local filesystem or HDFS • Uses this mode only for testing and prototyping on a single host • Fully-distributed mode • Can only run on HDFS

  10. Download Hbase • http://www.apache.org/dyn/closer.cgi/hbase/ • Download from one of mirror sites

  11. Install Hbase • Decompress and change into the unpacked directory • $ tar xfz hbase-0.99.0-SNAPSHOT.tar.gz • $ cd hbase-0.99.0-SNAPSHOT • Edit conf/hbase-site.xml

  12. Start hbase • Start HBase • $ ./bin/start-hbase.sh • Hbase logs can be found in the logs subdirectory if you encounter any problems • Connect to running Hbase via the Shell • Type Help and then <RETURN> to see listing of shell commands

  13. Create a Table • Create a Table name test with single column family named cf • Connect to running Hbase via the Shell • Verify its creation by listing all tables • Insert some values • The first insert is at row1, column cf: a with value of value1

  14. Verify the data • Running the scan of the table • Get a single row • Disable and drop the table • Exit the shell by typing exit • Stop Hbase instance by running the stop script: stop-hbase.sh

  15. Distributed Hbase • Basic System Requirements • HBase requires at least Java 6 • ssh much be installed and sshd must be running to used Hadoop’s scripts to manage remote Hadoop and Hbase daemons • DNS • HBase uses the local hostname to self-report its IP address • If your machine has multiple interface, HBase will use the interface that the primary hostname resolves to • Another option is to set hbase.regionserver.dns.interface to indicate the primary interface • Another option is to set hbase.regionserver.dns.nameserver to choose a different nameserver than the system default

  16. Ulimit and nproc • Apache Hbase is a database – uses a lot of files at the same time • Default Linux systems user file limit is 1024 • Any significant amount of loading will lead to Java IO Exception error • Upping the file descriptors and nproc for the user is an operating system configuration • To upping the file descriptors, add the following lines in the file /etc/security/limits.conf • <user running Hadoop> - nofile 32768 • <user running Hadoop> soft/hard nproc 3200 • In the file /etc/pam.d/common-session add the following line to the last line in the file • session required pam_limits.so • Restart you system for the changes to take effect

  17. Replace Hadoop • HBase depends on Hadoop, it bundles an instance of the Hadoop jar under its lib directory • The bundled jar is only use in standalone mode • In distribute mode, it is critical that the version of Hadoop that is out on your cluster match what is under Hbase • Replace the Hadoop jar found in the Hbase lib directory with the hadoop jar you are running on your cluster to avoid version mismatch

  18. Pseudo-distributed Mode • A pseudo-distributed mode is simple a fully-distributed mode run on a single host • Use for configuration testing and prototyping • Setup HDFS in your single host • Edit conf/hbase-site.xml • Tell Hbase to run in (pseudo) distributed mode rather than in default standalone mode. • With this configuration, HBase will start up an HBase Master process, a ZooKeeper server, and a RegionServer process

  19. Hbase and Hdfs • Hbase uses the local filesystem and writing to the operating systems’s temporary directory • Hadoop local filesystem does not support sync, so unless the system is shutdown properly, the data will be lost. • Operating system’s temporary directory can also make for data loss. • For a more permanent setup, make use of an instance of HDFS, Hbase data will be written to the Hadoop distributed filesystem rather than to the local filesystem directory • Let Hbase create the hbase.rootdir directory. If you don’t, you will get warning saying Hbase needs a migration run because the directory is missing files expected by HBase • Add the following lines to the conf/hbase-site.xml • <name>hbase.rootdir</name> • <value>hdfs://<local HDFS instance homed>:8020/hbase</value>

  20. Testing the Installation • Make sure HDFS is running • bin/start-hdfs.sh in the HADOOP_HOME directory • Start HBase • bin/start-hbase.sh in the HBASE_HOME directory • If you have trouble running Hbase, check Hbase logs files in the logs subdirectory • Hbase UI listing vital attributes • By default its deployed on the master host at port 16010 • http://<host named>:16010

  21. Hive + Hbase motivation • Characteristics of Hive • Batch • Structured • Analysts • Characteristics of Hbase • Online • Unstructured (schemaless) • Programmers • Hive data warehouses on HDFS has long ETL times • Hive does not have access to real time data • Analyzing HBase data with MapReduce requires custom coding • Hive and SQL are already known by many analysts

  22. Use case 1: Hbase as ETL Data Sink • Use Hive to query HDFS Table and put the results into HBase • Use the HBase for online Queries

  23. Use case 2: Hbase as data source • Use both data from Hbase and HDFS tables to do the query

  24. Use case 3: low latency warehouse • Both table have the same table definition • HBaseget continuous updates of data • HDFS Table get periodic Dump of data

  25. Example of Hbase table

  26. Example of Hive table • HbaseStorageHandler is the integration driver between Hive and Hbase

  27. Hive and Hbase Architecture

  28. Thank you!Questions?

  29. Works Cited • http://hbase.apache.org • https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration • http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/hbasecon-2013--integration-of-apache-hive-and-hbase-video.html

More Related