1 / 23

CS525: Big Data Analytics

CS525: Big Data Analytics. HBase Elke A. Rundensteiner Fall 2013. HBase. HBase is an Apache open source project HBase is a distributed column-oriented data store on top of HDFS Hbase logically organizes data into tables. HBase vs. HDFS.

sileas
Download Presentation

CS525: Big Data Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS525:Big Data Analytics HBase Elke A. Rundensteiner Fall 2013

  2. HBase • HBase is an Apache open source project • HBase is a distributed column-oriented data store on top of HDFS • Hbase logically organizes data into tables

  3. HBase vs. HDFS • Both are distributed systems that scale to thousands of nodes • HDFSis good for batch processing (scans over big files): • Not good for record lookup • Not good for incremental addition of small batches • Not good for updates • HBase is designed for more tuple-level processing: • Faster record lookup • Support for record-level insertion • Support for updates (via new versions)

  4. HBase vs. HDFS (Cont’d) If application has neither random reads or writes  Stick to HDFS

  5. HBaseLogical Data Model

  6. HBase: Keys and Column Families Each record is divided into Column Families Each row has a Key Each column family consists of one or more Columns Based on Google’s Bigtable model (Key-Value Pairs)

  7. HBase: Keys and Column Families • Key • Primary key for the table (byte array) • Indexed far fast lookup • Column Family • Has a name (string) • Contains one or more related columns • Columns • Belongs to one column family • Included inside the row (familyName:columnName) • Column names are encoded inside cells • Different cells can have different columns • Version Number For Each Record • Unique within each key (By default System’s timestamp) • Value (Cell) • Byte array

  8. HBase Physical Data Model

  9. HBase Physical Model • Each column family is stored in a separate file (called HTables) • Key & Version numbers are replicated with each column family • Multi-level index on values : <key, column family, column name, timestamp > • Each column family configurable : compression, version retention, etc. • Empty cells are not stored

  10. HBase Regions HTable(column family) is partitioned horizontally into regions • Regions are counterpart to HDFS blocks Each will be one region

  11. HBaseDetails

  12. Creating a Table HBaseAdminadmin= new HBaseAdmin(config); HColumnDescriptor []column; column= new HColumnDescriptor[2]; column[0]=new HColumnDescriptor("columnFamily1:"); column[1]=new HColumnDescriptor("columnFamily2:"); HTableDescriptordesc= new HTableDescriptor(Bytes.toBytes("MyTable")); desc.addFamily(column[0]); desc.addFamily(column[1]); admin.createTable(desc);

  13. Operations • Get() returns records for certain key and/or version • Put() inserts a new record or cells into an existing record • Delete() mark certain rows or regions as deleted • Scan() iterates over certain region of tuples • But no high-level SQL provided by Hbase itself

  14. Logging Operations

  15. HBase vs. RDBMS

  16. HBase • A table-like data model with index support • Allows for tuple- and region-level random writes or reads • Yet supports high processing needs over huge data sets

  17. Backup More details and examples on Access Support for HBase

  18. Operations On Regions: Get() • Given a key  return corresponding record • For each value return the highest version • Can control the number of versions you want

  19. Operations On Regions: Scan()

  20. Get() Select value from table where key=‘com.apache.www’ AND label=‘anchor:apache.com’

  21. Scan() Select value from table where anchor=‘cnnsi.com’

  22. Operations On Regions: Put() • Insert a new record (with a new key), Or • Insert a record for an existing key Implicit version number (timestamp) Explicit version number

  23. Operations On Regions: Delete() • Marking table cells as deleted • Multiple levels • Can mark an entire column family as deleted • Can make all column families of a given row as deleted • All operations are logged by the RegionServers • The log is flushed periodically

More Related