100 likes | 248 Views
Agenda. Lab time Work on Hadoop Problems (week 5) Due Next Week (May 13) Answer 15 questions to pass, more to learn a lot Ask questions as needed!!! Lecture on HBase. Last Time. Wrap up Hadoop Introduce Distributed Key/Value stores Memcache Introduce HBase. This Week. HBase
E N D
Agenda Lab time Work on Hadoop Problems (week 5) Due Next Week (May 13) Answer 15 questions to pass, more to learn a lot Ask questions as needed!!! Lecture on HBase
Last Time Wrap up Hadoop Introduce Distributed Key/Value stores Memcache Introduce HBase
This Week HBase Pros and Cons Architecture Schema Examples
Next Week Lab Hadoop Problems are due HBase Problems assigned Lecture HBase usage and examples Rest of the Hadoop Ecosystem Cassandra Hive Pig Mahout, Katta, etc Move into clouds Virtualization Amazon EC2
Review Hadoop Batch processing, no random-access Not real-time Free form (no concept of a schema) Distributed Key-Value stores Map some value to some other value Pairs are distributed across servers Distributed Column-Oriented Databases Impose more structure than DHT More freedom than Relational Database Organize/group data by column rather than row
HBase: Key Features Distributed (Fast and Scalable) Column-Oriented Versioned (Multi-Dimensional w/ Time) Highly Available (Robust) Integration with Hadoop for performance Wide and sparsely populated tables Nulls are stored free
HBase: Limitations Not SQL! No joins, queries, types Fairly new, unlike a RDBMS Secondary indexing is slow Transactions are not as robust No data types Not always a bad thing Consider the trade-offs from a relation database!
HBase Architecture Table is made up of any number of regions Region has a startKey and endKey (WeatherTable, LAStation, JanuaryTemp) → (WeatherTable, NYStation, JanuaryTemp) Regions are distributed to different nodes Nodes store regions as 1 or more files in HDFS Each file is broken into blocks by HDFS HDFS replicates each block to other HDFS nodes Two type of nodes: Master Region Server
HBase Architecture Tables are sorted by Row e.g. WeatherStation (LA, NY, etc) Table schema defines column families e.g. temperature, humidity, precipitation Family consists of zero or more columns e.g. temperature:current, temperature:high, temperature:low, temperature:average Families are sorted and stored together for performance Tend to look at all 'attributes' of a group together Columns are versioned Changes are stored as a 3rd dimension, which is the timestamp Designed for timestamp, but does not really have to be Columns only exist when inserted, NULLs are free Everything is a byte[]
HBase Data Model (Table, Row, Family:Column, Timestamp) → Value SortedMap( RowKey, List( SortedMap( Column, List( Value, Timestamp ) ) ) )