270 likes | 402 Views
Big Data – Distributed Database (HBase). Kalapriya Kannan IBM Research Labs July, 2013. No SQL Vs SQL, But wait…. What is SQL? Select- From- Where Statements Multirelational Queries Subqueries. Says “what to do” rather than “how to do”
E N D
Big Data – Distributed Database (HBase) Kalapriya Kannan IBM Research Labs July, 2013
No SQL Vs SQL, But wait…. • What is SQL? • Select- From- Where Statements • Multirelational Queries • Subqueries. • Says “what to do” rather than “how to do” • DBMS Engine figures out the `best’ way to execute query. • Called ‘query optimization’
What is RDBMS? • DBMS is Collection of –interrelated data and – set of programs to access the data • Convenient and efficient processing of data Database Application Software • Database: collection of entities and relationship among entities • RESPECTS – ACID • But only when its required • viz banking, finance, safety systems etc., • The kind of systems that people were building with computers decades ago…
RDBMS • Set of relation names: R • Set of attribute names: A • Relation schema: S=(r,{a1, ...,an}) • r relation name in R • {a1, ...,an} subset of A • e.g., (Dog,{Name, Age, Weight, Breed}) Relationship Name DOG
Structure of a SQL statement SELECT desired attributes FROM one or more tables WHERE condition about tuples of the tables
Example query Using University(name, degree) what degrees are offered by IITJodhpur? SELECT name FROM univeristy WHERE name = ’IITJodhpur’; Single relation query Begin with the relation in the “FROM” clause. Apply the selection indicated by the “WHERE” Clause Apply the extended projection indicated by the “SELECT” clause
Operational Semantics Include t.degree in the output • Think of a tuple variable visiting each tuple of the relation mentioned in FROM. • Check if the “current” tuple satisfies the WHERE clause. • If so, compute the attributes or expressions of the SELECT clause using the components of this tuple. Tuple variable ‘t’ loops over all tuples Checks if name is IIIT Jodhpur
Multi-relationship queries • Interesting queries often combine data from more than one relation. • We can address several relations in one query by listing them all in the FROM clause. • Distinguish attributes of the same name by “<relation>.<attribute>” . • Example joining two relationship • SELECT UNIVERSITY.name • FROM UNIVERSITY, DEPARMENTS • Where university.degree = departments.degree AND department.name = “computer science”
Operational Semantics • Almost the same as for single-relation queries: • Start with the product of all the relations in the FROM clause. • Apply the selection condition from the WHERE clause. • Project onto the list of attributes and expressions in the SELECT clause.
NoSQL: The Name • “SQL” = Traditional relational DBMS • Recognition over past decade or so: Not every data management/analysis problem is best solved using a traditional relational DBMS • “NoSQL” = “No SQL” = Not using traditional relational DBMS • “No SQL” Don’t use SQL language • “No SQL” = “Not ONLY SQL”
NoSQL systems • Alternative to traditional relational DBMS • Flexible schema • Quicker/cheaper to set up • Massive scalability • Relaxed consistency higher performance & availability • No declarative query language more programming • Relaxed consistency fewer guarantees • Several incarnations • MapReduce framework • Key-value stores • Document stores • Graph database systems
Different types of NoSQL • Column Store • Column is saved together, as opposed to row data • Super useful for data analytics • Hadoop, Cassandra, hypertable. • Key Value store • A key refers to a payload • Redis, Azure, HBase • Document/Object/XML Store • Key (and possibly other indexes) point at a serialized object • DB can operate against values in document. • Mango DB, Couch DB etc., • Graph Store • Nodes are stored independently, and the relationship between nodes (edges) are stored with data.
Key Value stores • Extremely simple interface • Data model: (key, value) pairs • Operations: Insert(key,value), Fetch(key), • Update(key), Delete(key) • Some allow (non-uniform) columns within value • Some allow Fetch on range of keys • Implementation: efficiency, scalability, fault-tolerance • Records distributed to nodes based on key • Replication • Single-record transactions, “eventual consistency” • Example systems • Google BigTable, Amazon Dynamo, Cassandra,Voldemort, HBase, …
Why NoSQL? • The opposite of ACID is • BASE • Basically Available – guaranteed availability • Soft state – the state of the system may change, even without a query (because of the node updates). • Eventually consistent - the system will eventually become consistent over time. • Eventual Consistency • Because of the distributed model, any server can answer any query • Servers communicate amongst themselves at their own pace (behind the scenes) • The server that answers your query might not have the latest data • Who cares if you see kim’s latest tweet.?
Distributed HBase • Distributed Hash table • Get, put, delete, scan and CaS • Denormalization is necessary • Not a parallel database, just distributed • Write ahead log/data durability • Master/slave replication • ACID compliance
Rows are sorted lexicographically based on row key contiguous set of sorted rows Region: HBase - Introduction • A sparse, multi-dimensional, sorted map • {row, column, timestamp} -> cell • Column = Column Family : Column qualifier • Rows are sorted lexicographically based on row key • Region: contiguous set of sorted rows • HBase: a large number of columns, a low number of column • families (2-3) HBase: a large number of columns, a low number of column families (2 - 3)
Column families • Different set of columns may have different properties and access patterns • Configuration by column family: • Compression (none, gzip,LZO) • Version retention policies • Cache priority • CFs stored seperately on disk: access one without wasting IO on the other. How to access HBase • Java API (thick client) • REST/HTTP • Apache Thrift (any language) • Hive/Pig for analytics.
HBase Basics • Operations are based on row keys • Single row operations: • Put • Get • Scan • Multi-row operations: • Scan • Put • No built-in joins (use MapReduce)
Terms and daemons • Region • A subset of a tables row’s, like a range partition • Automatically sharded • RegionServer • Serves data for reads and writes • Master • Responsible for co-ordinating the slaves • Assigns regions, detects failures of Region servers • Controls some admin functions