590 likes | 1.16k Views
HBase. By team MARS: Ankush Gupta Mayank Gupta Rajeev Ravikumar Simranjit Singh Gill. HBase : Overview . HBase is a distributed column-oriented data store built on top of HDFS
E N D
HBase By team MARS:Ankush Gupta Mayank Gupta Rajeev Ravikumar Simranjit Singh Gill
HBase: Overview • HBase is a distributed column-oriented data store built on top of HDFS • HBase is an Apache open source project whose goal is to provide storage for the Hadoop Distributed Computing • Data is logically organized into tables, rows and columns • Table are sorted by Row • Table schema only defines Column families • column family can have any number of columns • Each cell value has a timestamp • Hbase is a distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage
Big Table • Definition • Big table is a sparse, distributed, persistent multi-dimensional sorted map. • Key Features • Distributed storage across cluster of machines • Random, online read and write data access • Schemaless data model • Self-managed data partitions • Architecture • Tables consist of billions of rows, millions of columns • Records ordered by row key • Continuous sequences of rows partitioned into Regions • Regions automatically split when they grow too large • Regions automatically distributed around the cluster
HBase is not: • A SQL Database; No access or manipulation via SQL • No Joins, No schema, no query engine, no data types, no SQL • Programmatic access via Java, REST, or Thrift APIs • De normalized data • HBase is good for: • Large datasets • Sparse datasets • Loosely coupled (de normalized) records • Lots of concurrent clients • Scale linearly and automatically with new nodes • Try to avoid: • Small datasets • Highly relational records • Schema designs requiring transactions
Limitations of Relational Databases • Data Set going into Petabytes • RDBMS don't scale inherently • Scale up (Vertical Scaling)/Scale out (Horizontal Scaling) • Hard to shard / partition • Both read / write throughput not possible • Transactional / Analytical databases • Specialized Hardware is very expensive • Oracle clustering
Where HBase makes life easy • Perform thousands of operations per second on multiple TB of data • Random read, random write or both but neither • Well known and simple access paterns • Replication • Batch analysis • Scans and queries can select a subset of available columns • Where Sql makes life easy • Joining • Secondary Indexing • Referential Integrity (updates) • ACID
HBase Features • Automatic partitioning of data • Transparent multi-node distribution of data • Known as ‘scaling out’ or ‘horizontal scaling’ • Ingest and retain more data, to petabyte scale and beyond • Store and access huge data volumes • Store data of any structure • No Single point of failure • Use the entire Hadoop ecosystem to gain deep insight on your data • Multi-Datacenter Replication. Load balancing or disaster recovery / Machine failure tolerance
HBase Use Case • Big data, big number of users, big number of computers • Facebook needs 135 billion messages a month • Twitter stores 7 TB data per day • Fast key-value access • Time series data • Real-time inserts, updates, and queries • Fraud detection by comparing transactions to known patterns in real-time • Analytics - Use Map Reduce, Hive, or Pig to perform analytical queries • Managing large streams of non-transactional data: Apache logs, application logs, MySQL logs, etc. • Exposing Machine Learning models
Architecture: The Basics • Data in Hbase is stored in tables similar to RDBMS • Tables are split into equal sized regions • A region is a contiguous array of keys • As the regions grow in size, they are split dynamically
Three Components of Hbase • Region: • A subset of table’s rows • Multiple regions are present within the HRegionServer • RegionServer: • Acts as slave • Similar to the data node in HDFS • Multiple HRegionServers possible • Master: • Responsible for coordinating the slave • Similar to the name node in HDFS • One Master
Master Manages and Monitors the cluster Assigns regions to Region Servers Controls Load Balancing and Failover of Region Servers Master DOES NOT: Handle any write requests Get involved in read/write path
Region Server Contains multiple regions Contains one Write-Ahead Log Provides atomicity & durability Records all the changes to the data Helpful if server crashes Responsibilities: Handles Read/Write Requests Splits the regions Communicating with clients directly
Region Identified by start and end key Consists of: MemStore StoreFile (Hfile) MemStore: Holds in-memory modifications to data Similar to a RAM Data is flushed to Store upon reaching the threshold HFile: Contains data stored as column families
ZooKeeper • A highly available, scalable, distributed coordination kernel • Purpose: • Master selection in case of failover • Storage for Region Server addresses for fast lookup Coordination Service ZooKeeper
HbaseStorage Model Column- Oriented Database Tables are sorted by rows Table schema only defines Column families Column families can have any number of column Each cell value has a timestamp
Hbase Vs. RDBMS Hbase Table Normal RDBMS table
Storage Model Contd. • SortedMap ( rowkey, List( sortedMap(column, List( value, timestamp))) • The first Sorted map is the table containing a list of column families • The family contains another Sorted Map, which represents • the columns and their values • These values are in the final list that holds the value and • the timestamp it was set.
Running HBase Configuration & data operations…
STEP 1: Download and Unpack Hbase installation file • Download from here: • http://www.apache.org/dyn/closer.cgi/hbase/ • Download this file: • hbase-0.94.13.tar.gz • Decompressthefile: • Open Terminal windowtorunfollowingcommands: • $ tarxfz hbase-0.94.13.tar.gz -> ToDecompress • $ cd hbase-0.94.13 -> Tochangedirectorytothedecompressedfolder
STEP 2: Configuring Hbase environment • edit conf/hbase-site.xml <configuration> <property> //Set the directory HBase writes data to <name>hbase.rootdir</name> <value>file:///DIRECTORY/hbase</value> </property> <property> // Set the directory ZooKeeper writes its data to <name>hbase.zookeeper.property.dataDir</name> <value>/DIRECTORY/zookeeper</value> </property> </configuration>
STEP 2: Configuring Hbaseenvironment continued • Edit conf/hbase-env.sh • Add Java by adding following code: export JAVA_HOME=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/
STEP 3: Start HBase Services • Run the following Shell Script to start Hbase services: $ ./bin/start-hbase.sh
STEP 4: Starting Hbase via Shell • Allows running Hbase via Shell:$ ./bin/hbaseshell • Ready to go:hbase(main):001:0>
Tasks and Data Operations • CREATE Table create 'test', 'cf’ • Table test with single column family cf • LIST all the table in Hbase: list 'test’ • INSERT/UPDATE data intotable – Put() operation: put 'test', 'row1', 'cf:a', 'value1’ put 'test', 'row2', 'cf:b', 'value2’ put 'test', 'row3', 'cf:c', 'value3’ • Insertrowsandcolumns a, b and c respectively
Tasks and Data Operations continued • RETRIEVE Data – Get () operation: get 'test', 'row1’ • To get specific columns: get'test', 'row1', {COLUMN => 'cf:a'}get 'test', 'row2', {COLUMN => ['cf:a','cf:b']} • RETRIEVE range of data – Scan () operation: • Scan 'test', {COLUMNS => ['cf:a','cf:b', 'cf:c'], LIMIT => 2, STARTROW => 'row2'}
Tasks and Data Operations continued • ALTER Table: • alter 'test', NAME => 'cf', VERSIONS => 5 //max 5 cells • alter 'test', 'delete'=>'cf' • DELETE Data from table: • Delete columns: delete ‘test’.’cf:a’ // To delete column a from column family cf from table test • Delete rows from table:delete ‘test’, ‘row1’
Tasks and Data Operations continued • DISBALE Table: disable 'test’ • DROP Table: drop 'test'
Closing out • Exit Hbase Shell: exit • Stop Hbase instance: ./bin/stop-hbase.sh Learn more about HBase and commands at: • http://wiki.apache.org/hadoop/Hbase/Shell • http://hbase.apache.org/book/schema.html
REFERENCES • http://wiki.apache.org/hadoop/Hbase/Shell • http://hbase.apache.org/book/quickstart.html • The Apache HBase™ Reference Guide • Chicago Data Summit: Apache HBase: An Introduction • http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html • http://blog.sematext.com/2012/07/ • http://netwovenblogs.com/2013/10/10/hbase-overview-of-architecture-and-data-model/ • http://wiki.apache.org/hadoop/Hbase/DataModel
Questions? Thank You! Have a wonderful thanksgiving weekend!