Agenda

Agenda Lab time Work on Hadoop Problems (week 5) Due Next Week (May 13) Answer 15 questions to pass, more to learn a lot Ask questions as needed!!! Lecture on HBase

Last Time Wrap up Hadoop Introduce Distributed Key/Value stores Memcache Introduce HBase

This Week HBase Pros and Cons Architecture Schema Examples

Next Week Lab Hadoop Problems are due HBase Problems assigned Lecture HBase usage and examples Rest of the Hadoop Ecosystem Cassandra Hive Pig Mahout, Katta, etc Move into clouds Virtualization Amazon EC2

Review Hadoop Batch processing, no random-access Not real-time Free form (no concept of a schema) Distributed Key-Value stores Map some value to some other value Pairs are distributed across servers Distributed Column-Oriented Databases Impose more structure than DHT More freedom than Relational Database Organize/group data by column rather than row

HBase: Key Features Distributed (Fast and Scalable) Column-Oriented Versioned (Multi-Dimensional w/ Time) Highly Available (Robust) Integration with Hadoop for performance Wide and sparsely populated tables Nulls are stored free

HBase: Limitations Not SQL! No joins, queries, types Fairly new, unlike a RDBMS Secondary indexing is slow Transactions are not as robust No data types Not always a bad thing Consider the trade-offs from a relation database!

HBase Architecture Table is made up of any number of regions Region has a startKey and endKey (WeatherTable, LAStation, JanuaryTemp) → (WeatherTable, NYStation, JanuaryTemp) Regions are distributed to different nodes Nodes store regions as 1 or more files in HDFS Each file is broken into blocks by HDFS HDFS replicates each block to other HDFS nodes Two type of nodes: Master Region Server

HBase Architecture Tables are sorted by Row e.g. WeatherStation (LA, NY, etc) Table schema defines column families e.g. temperature, humidity, precipitation Family consists of zero or more columns e.g. temperature:current, temperature:high, temperature:low, temperature:average Families are sorted and stored together for performance Tend to look at all 'attributes' of a group together Columns are versioned Changes are stored as a 3rd dimension, which is the timestamp Designed for timestamp, but does not really have to be Columns only exist when inserted, NULLs are free Everything is a byte[]

HBase Data Model (Table, Row, Family:Column, Timestamp) → Value SortedMap( RowKey, List( SortedMap( Column, List( Value, Timestamp ) ) ) )

Agenda

Agenda

Presentation Transcript

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda

Agenda:

Agenda

Agenda

AGENDA