1 / 42

HBase

HBase. By team MARS: Ankush Gupta Mayank Gupta Rajeev Ravikumar Simranjit Singh Gill. HBase : Overview . HBase is a distributed column-oriented data store built on top of HDFS

varen
Download Presentation

HBase

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HBase By team MARS:Ankush Gupta Mayank Gupta Rajeev Ravikumar Simranjit Singh Gill

  2. HBase: Overview • HBase is a distributed column-oriented data store built on top of HDFS • HBase is an Apache open source project whose goal is to provide storage for the Hadoop Distributed Computing • Data is logically organized into tables, rows and columns • Table are sorted by Row • Table schema only defines Column families • column family can have any number of columns • Each cell value has a timestamp • Hbase is a distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage

  3. Big Table • Definition • Big table is a sparse, distributed, persistent multi-dimensional sorted map. • Key Features • Distributed storage across cluster of machines • Random, online read and write data access • Schemaless data model • Self-managed data partitions • Architecture • Tables consist of billions of rows, millions of columns • Records ordered by row key • Continuous sequences of rows partitioned into Regions • Regions automatically split when they grow too large • Regions automatically distributed around the cluster

  4. HBase is not: • A SQL Database; No access or manipulation via SQL • No Joins, No schema, no query engine, no data types, no SQL • Programmatic access via Java, REST, or Thrift APIs • De normalized data • HBase is good for: • Large datasets • Sparse datasets • Loosely coupled (de normalized) records • Lots of concurrent clients • Scale linearly and automatically with new nodes • Try to avoid: • Small datasets • Highly relational records • Schema designs requiring transactions

  5. Limitations of Relational Databases • Data Set going into Petabytes • RDBMS don't scale inherently • Scale up (Vertical Scaling)/Scale out (Horizontal Scaling) • Hard to shard / partition • Both read / write throughput not possible • Transactional / Analytical databases • Specialized Hardware is very expensive • Oracle clustering

  6. Where HBase makes life easy • Perform thousands of operations per second on multiple TB of data • Random read, random write or both but neither • Well known and simple access paterns • Replication • Batch analysis • Scans and queries can select a subset of available columns • Where Sql makes life easy • Joining • Secondary Indexing • Referential Integrity (updates) • ACID

  7. HBase Features • Automatic partitioning of data • Transparent multi-node distribution of data • Known as ‘scaling out’ or ‘horizontal scaling’ • Ingest and retain more data, to petabyte scale and beyond • Store and access huge data volumes • Store data of any structure • No Single point of failure • Use the entire Hadoop ecosystem to gain deep insight on your data • Multi-Datacenter Replication. Load balancing or disaster recovery / Machine failure tolerance

  8. HBase Use Case • Big data, big number of users, big number of computers • Facebook needs 135 billion messages a month • Twitter stores 7 TB data per day • Fast key-value access • Time series data • Real-time inserts, updates, and queries • Fraud detection by comparing transactions to known patterns in real-time • Analytics - Use Map Reduce, Hive, or Pig to perform analytical queries • Managing large streams of non-transactional data: Apache logs, application logs, MySQL logs, etc. • Exposing Machine Learning models

  9. HBASE High-Level Architecture

  10. Architecture: The Basics • Data in Hbase is stored in tables similar to RDBMS • Tables are split into equal sized regions • A region is a contiguous array of keys • As the regions grow in size, they are split dynamically

  11. Three Components of Hbase • Region: • A subset of table’s rows • Multiple regions are present within the HRegionServer • RegionServer: • Acts as slave • Similar to the data node in HDFS • Multiple HRegionServers possible • Master: • Responsible for coordinating the slave • Similar to the name node in HDFS • One Master

  12. Architecture Diagram

  13. Master Manages and Monitors the cluster Assigns regions to Region Servers Controls Load Balancing and Failover of Region Servers Master DOES NOT: Handle any write requests Get involved in read/write path

  14. Region Server Contains multiple regions Contains one Write-Ahead Log Provides atomicity & durability Records all the changes to the data Helpful if server crashes Responsibilities: Handles Read/Write Requests Splits the regions Communicating with clients directly

  15. Region Identified by start and end key Consists of: MemStore StoreFile (Hfile) MemStore: Holds in-memory modifications to data Similar to a RAM Data is flushed to Store upon reaching the threshold HFile: Contains data stored as column families

  16. Architecture: The complete picture

  17. ZooKeeper • A highly available, scalable, distributed coordination kernel • Purpose: • Master selection in case of failover • Storage for Region Server addresses for fast lookup Coordination Service ZooKeeper

  18. HBase and ZooKeeper

  19. Read/Write Operation

  20. HbaseStorage Model Column- Oriented Database Tables are sorted by rows Table schema only defines Column families Column families can have any number of column Each cell value has a timestamp

  21. Hbase Vs. RDBMS Hbase Table Normal RDBMS table

  22. Storage Model Contd. • SortedMap ( rowkey, List( sortedMap(column, List( value, timestamp))) • The first Sorted map is the table containing a list of column families • The family contains another Sorted Map, which represents • the columns and their values • These values are in the final list that holds the value and • the timestamp it was set.

  23. Schema Design

  24. Schema Design

  25. Example of Schema Design- RDBMS

  26. Example of Schema Design- hbase

  27. Student Subject Schema Hbase

  28. Column Families Attributes

  29. Region and Splitting

  30. Running HBase Configuration & data operations…

  31. STEP 1: Download and Unpack Hbase installation file • Download from here: • http://www.apache.org/dyn/closer.cgi/hbase/ • Download this file: • hbase-0.94.13.tar.gz • Decompressthefile: • Open Terminal windowtorunfollowingcommands: • $ tarxfz hbase-0.94.13.tar.gz -> ToDecompress • $ cd hbase-0.94.13 -> Tochangedirectorytothedecompressedfolder

  32. STEP 2: Configuring Hbase environment • edit conf/hbase-site.xml <configuration> <property> //Set the directory HBase writes data to <name>hbase.rootdir</name> <value>file:///DIRECTORY/hbase</value> </property> <property> // Set the directory ZooKeeper writes its data to <name>hbase.zookeeper.property.dataDir</name> <value>/DIRECTORY/zookeeper</value> </property> </configuration>

  33. STEP 2: Configuring Hbaseenvironment continued • Edit conf/hbase-env.sh • Add Java by adding following code: export JAVA_HOME=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home/

  34. STEP 3: Start HBase Services • Run the following Shell Script to start Hbase services: $ ./bin/start-hbase.sh

  35. STEP 4: Starting Hbase via Shell • Allows running Hbase via Shell:$ ./bin/hbaseshell • Ready to go:hbase(main):001:0>

  36. Tasks and Data Operations • CREATE Table create 'test', 'cf’ • Table test with single column family cf • LIST all the table in Hbase: list 'test’ • INSERT/UPDATE data intotable – Put() operation: put 'test', 'row1', 'cf:a', 'value1’ put 'test', 'row2', 'cf:b', 'value2’ put 'test', 'row3', 'cf:c', 'value3’ • Insertrowsandcolumns a, b and c respectively

  37. Tasks and Data Operations continued • RETRIEVE Data – Get () operation: get 'test', 'row1’ • To get specific columns: get'test', 'row1', {COLUMN => 'cf:a'}get 'test', 'row2', {COLUMN => ['cf:a','cf:b']} • RETRIEVE range of data – Scan () operation: • Scan 'test', {COLUMNS => ['cf:a','cf:b', 'cf:c'], LIMIT => 2, STARTROW => 'row2'}

  38. Tasks and Data Operations continued • ALTER Table: • alter 'test', NAME => 'cf', VERSIONS => 5 //max 5 cells • alter 'test', 'delete'=>'cf' • DELETE Data from table: • Delete columns: delete ‘test’.’cf:a’ // To delete column a from column family cf from table test • Delete rows from table:delete ‘test’, ‘row1’

  39. Tasks and Data Operations continued • DISBALE Table: disable 'test’ • DROP Table: drop 'test'

  40. Closing out • Exit Hbase Shell: exit • Stop Hbase instance: ./bin/stop-hbase.sh Learn more about HBase and commands at: • http://wiki.apache.org/hadoop/Hbase/Shell • http://hbase.apache.org/book/schema.html

  41. REFERENCES • http://wiki.apache.org/hadoop/Hbase/Shell • http://hbase.apache.org/book/quickstart.html • The Apache HBase™ Reference Guide • Chicago Data Summit: Apache HBase: An Introduction • http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html • http://blog.sematext.com/2012/07/ • http://netwovenblogs.com/2013/10/10/hbase-overview-of-architecture-and-data-model/ • http://wiki.apache.org/hadoop/Hbase/DataModel

  42. Questions? Thank You! Have a wonderful thanksgiving weekend! 

More Related