HBase and Bigtable Storage

HBase and Bigtable Storage XiaomingGao Judy Qiu Hui Li

Outline • HBase and Bigtable Storage • HBase Uses Cases • Load CSV file to Hbase table with MapReduce • Demo Search Engine System with MapReduce Technologies (Hadoop/HDFS/HBase/Pig)

HBase Introduction • HBase is an open source, distributed, sorted map modeled after Google’s BigTable • HBase is built on Hadoop: • Fault tolerance • Scalability • Batch processing with MapReduce • HBase uses HDFS for storage

HBase Cluster Architecture • Tables split into regions and served by region servers • Regions vertically divided by column families into “stores” • Stores saved as files on HDFS

Data Model: A Big Sorted Map • A Big Sorted Map • Not a relational database, no sql, • Tables consist of rows, each of which has a primary key (row key) • Each row has any number of columns: sortedMap<rowKey, List(sortedMap(Column, List(Value,TimeStamp))))>

HBase VS. RDBMS

When to Use HBase • Dataset Scale • Indexing huge amount of web pages in internet or genome data • Need data mining large social media data sets • Read/Write Scale • reads/writes are distributed as tables are distributed across nodes • Writes are extremely fast and require no index updates • Batch Analysis • Massive and convoluted SQL queries can be executed in parallel via MapReduce jobs

Use Cases: • Facebook Analytics • Real-time counters of URLs shared, preferred links • Twitter • 25 TB of message every month • Mozilla • Store crashes report, 2.5 million per day.

Programming with HBase • HBase shell • Scan, List, Create • Native Java API • Get(byte[] row, byte[] column, long ts, int version) • Non-Java Clients • Thrift server (Ruby, C++, Php) • REST server • HBase MapReduce API • TableInput/TableOuputFormatfor MapReduce • High Level Interface • Pig, Hive

Hands-on HBase MapReduce Programming • HBase MapReduce API import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.client.Scan; import org.apache.hadoop.hbase.io.ImmutableBytesWritable; import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil; import org.apache.hadoop.hbase.mapreduce.TableMapper; import org.apache.hadoop.hbase.mapreduce.TableReducer; import org.apache.hadoop.hbase.util.Bytes; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.util.GenericOptionsParser;

Hands-on: load CSV file into HBasetable with MapReduce • Main entry point of program public static void main(String[] args) throws Exception { Configuration conf = HBaseConfiguration.create(); String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); if(otherArgs.length!= 2) { System.err.println("Wrong number of arguments: " + otherArgs.length); System.err.println("Usage: <csv file> <hbase table name>"); System.exit(-1); }//end if Job job = configureJob(conf, otherArgs); System.exit(job.waitForCompletion(true) ? 0 : 1); }//main

Hands-on: load CSV file into HBasetable with MapReduce • Configure HBase MapReduce job public static Job configureJob(Configuration conf, String [] args) throws IOException { Path inputPath = new Path(args[0]); String tableName = args[1]; Job job = new Job(conf, NAME + "_" + tableName); job.setJarByClass(Uploader.class); FileInputFormat.setInputPaths(job, inputPath); job.setInputFormatClass(SequenceFileInputFormat.class); job.setMapperClass(Uploader.class); TableMapReduceUtil.initTableReducerJob(tableName, null, job); job.setNumReduceTasks(0); return job; }//public static Job configure

Hands-on: load CSV file into HBasetable with MapReduce • The map function public void map(LongWritable key, Text line, Context context) throws IOException { // Input is a CSV file Each map() is a single line, where the key is the line number // Each line is comma-delimited; row,family,qualifier,value String [] values = line.toString().split(","); if(values.length != 4) { return; } byte [] row = Bytes.toBytes(values[0]); byte [] family = Bytes.toBytes(values[1]); byte [] qualifier = Bytes.toBytes(values[2]); byte [] value = Bytes.toBytes(values[3]); Put put = new Put(row); put.add(family, qualifier, value); try { context.write(new ImmutableBytesWritable(row), put); } catch (InterruptedException e) { e.printStackTrace(); } if(++count % checkpoint == 0) { context.setStatus("Emitting Put " + count); } } }

Hands-on: load CSV file into HBasetable with MapReduce • Steps to run program Create hbase table with specified data schema Compile the program with Ant Run the program /bin/hadoop org.apache.hadoop.hbase.mapreduce.CSV2HBase input.csv “test” Check inserted records in Hbase table

Extension: write output to HBasetable public static Job configureJob(Configuration conf, String [] args) throws IOException { conf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(new Scan())); conf.set(TableInputFormat.INPUT_TABLE, tableName); conf.set("index.tablename", tableName); conf.set("index.familyname", columnFamily); String[] fields = new String[args.length - 2]; for(inti = 0; i < fields.length; i++) { fields[i] = args[i + 2]; } conf.setStrings("index.fields", fields); conf.set("index.familyname", "attributes"); Job job = new Job(conf, tableName); job.setJarByClass(IndexBuilder.class); job.setMapperClass(Map.class); job.setNumReduceTasks(0); job.setInputFormatClass(TableInputFormat.class); job.setOutputFormatClass(MultiTableOutputFormat.class); return job; }

Extension: write output to HBase table public static class Map extends Mapper<ImmutableBytesWritable, Result, ImmutableBytesWritable, Writable> { private byte[] family; private HashMap<byte[], ImmutableBytesWritable> indexes; protected void map(ImmutableBytesWritablerowKey, Result result, Context context) throws IOException, InterruptedException { for(java.util.Map.Entry<byte[], ImmutableBytesWritable> index : indexes.entrySet()) { byte[] qualifier = index.getKey(); ImmutableBytesWritabletableName = index.getValue(); byte[] value = result.getValue(family, qualifier); if (value != null) { Put put = new Put(value); put.add(INDEX_COLUMN, INDEX_QUALIFIER, rowKey.get()); context.write(tableName, put); }//if }//for }//map

Big Data Challenge Peta 10^15 Tera 10^12 Giga 10^9 Mega 10^6

Search Engine System with MapReduce Technologies • Search Engine System for Summer School • To give an example of how to use MapReduce technologies to solve big data challenge. • Using Hadoop/HDFS/HBase/Pig • Indexed 656K web pages (540MB in size) selected from Clueweb09 data set. • Calculate ranking values for 2 million web sites.

Architecture for SESSS Apache Lucene Inverted Indexing System PHP script HBase Tables 1. inverted index table 2. page rank table Web UI HBase Hive/Pig script Apache Server on Salsa Portal Thrift client Thrift server Pig script Hadoop Cluster on FutureGrid Ranking System

Demo Search Engine System for Summer School build-index-demo.exe (build index with HBase) pagerank-demo.exe (compute page rank with Pig) http://salsahpc.indiana.edu/sesss/index.php

High Level Language: Pig Latin Hui Li Judy Qiu Some material adapted from slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012

What is Pig • Framework for analyzing large un-structured and semi-structured data on top of Hadoop. • Pig Engine Parses, compiles Pig Latin scripts into MapReduce jobs run on top of Hadoop. • Pig Latin is simple but powerful data flow language similar to scripting languages.

Motivation of Using Pig • Faster development • Fewer lines of code (Writing map reduce like writing SQL queries) • Re-use the code (Pig library, Piggy bank) • One test: Find the top 5 words with most high frequency • 10 lines of Pig Latin V.S 200 lines in Java • 15 minutes in Pig Latin V.S 4 hours in Java

Word Count using MapReduce

Word Count using Pig • Lines=LOAD‘input/hadoop.log’ AS (line: chararray); • Words = FOREACHLines GENERATE FLATTEN(TOKENIZE(line)) AS word; • Groups = GROUPWords BYword; • Counts = FOREACHGroups GENERATE group, COUNT(Words); • Results = ORDER Words BY Counts DESC; • Top5 = LIMIT Results 5; • STORE Top5 INTO /output/top5words;

Pig Tutorial • Basic Pig knowledge: (Word Count) • Pig Data Types • Pig Operations • How to run Pig Scripts • Advanced Pig features: (Kmeans Clustering) • Embedding Pig within Python • User Defined Function

Pig Data Types • Concepts: fields, tuples, bags, relations, • A Field is a piece of data • A Tuple is an ordered set of fields • A Bag is a collection of tuples • A Relation is a bag • Simple Types • Int, long, float, double, boolean,nul, chararray, bytearry, • Complex types • Tuple Row in Database • ( 0002576169, Tome, 21, “Male”) • Data Bag  Table or View in Database {(0002576169 , Tome, 21, “Male”), (0002576170, Mike, 20, “Male”), (0002576171 Lucy, 20, “Female”)…. }

Pig Operations • Loading data • LOAD loads input data • Lines=LOAD ‘input/access.log’ AS (line: chararray); • Projection • FOREACH… GENERTE … (similar to SELECT) • takes a set of expressions and applies them to every record. • Grouping • GROUP collects together records with the same key • Dump/Store • Dump displaysresults to screen, Store save results to file system • Aggregation • AVG, COUNT, COUNT_STAR, MAX, MIN, SUM

How to run Pig Latin scripts • Local mode • Local host and local file system is used • Neither Hadoop nor HDFS is required • Useful for prototyping and debugging • MapReduce mode • Run on a Hadoop cluster and HDFS • Batchmode - run a script directly • Pig –x local my_pig_script.pig • Pig –x mapreducemy_pig_script.pig • Interactivemode use the Pig shell to run script • Grunt> Lines = LOAD ‘/input/input.txt’ AS (line:chararray); • Grunt> Unique = DISTINCT Lines; • Grunt> DUMP Unique;

Hands-on: Word Count using Pig Latin • cd pigtutorial/pig-hands-on/ • tar –xf pig-wordcount.tar • cd pig-wordcount • pig –x local • grunt> Lines=LOAD‘input.txt’ AS (line: chararray); • grunt>Words = FOREACHLines GENERATE FLATTEN(TOKENIZE(line)) AS word; • grunt>Groups = GROUPWords BYword; • grunt>counts = FOREACHGroups GENERATE group, COUNT(Words); • grunt>DUMP counts;

Sample: Kmeans using Pig Latin A method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Assignment step: Assign each observation to the cluster with the closest mean Update step: Calculate the new means to be the centroid of the observations in the cluster. Reference: http://en.wikipedia.org/wiki/K-means_clustering

Kmeans Using Pig Latin PC = Pig.compile("""register udf.jar DEFINEfind_centroidFindCentroid('$centroids'); raw = load 'student.txt' as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = Foreachgrouped Generate group, AVG(centroided.gpa); store result into 'output'; """)

Kmeans Using Pig Latin whileiter_num<MAX_ITERATION: PCB = PC.bind({'centroids':initial_centroids}) results = PCB.runSingle() iter = results.result("result").iterator() centroids = [None] * v distance_move = 0.0 # get new centroid of this iteration, calculate the moving distance with last iteration for i in range(v): tuple = iter.next() centroids[i] = float(str(tuple.get(1))) distance_move = distance_move + fabs(last_centroids[i]-centroids[i]) distance_move = distance_move / v; if distance_move<tolerance: converged = True break ……

Embedding Python scripts with Pig Statements • Pig does not support flow control statement: if/else, while loop, for loop, etc. • Pig embedding API can leverage all language features provided by Python including control flow: • Loop and exit criteria • Similar to the database embedding API • Easier parameter passing • JavaScript is available as well • The framework is extensible. Any JVM implementation of a language could be integrated

User Defined Function • What is UDF • Way to do an operation on a field or fields • Called from within a pig script • Currently all done in Java • Why use UDF • You need to do more than grouping or filtering • Actually filtering is a UDF • Maybe more comfortable in Java land than in SQL/Pig Latin P = Pig.compile("""register udf.jar DEFINEfind_centroidFindCentroid('$centroids');

Hands-on Run Pig Latin Kmeans export PIG_CLASSPATH= /opt/pig/lib/jython-2.5.0.jar Hadoop dfs –copyFromLocal input.txt ./input.txt pig –x mapreduce kmeans.py pig—x local kmeans.py

Hands-on Run Pig Latin Kmeans 2012-07-14 14:51:24,636 [main] INFO org.apache.pig.scripting.BoundScript - Query to run: register udf.jar DEFINE find_centroidFindCentroid('0.0:1.0:2.0:3.0'); raw = load 'student.txt' as (name:chararray, age:int, gpa:double); centroided = foreach raw generate gpa, find_centroid(gpa) as centroid; grouped = group centroided by centroid; result = foreach grouped generate group, AVG(centroided.gpa); store result into 'output'; Input(s): Successfully read 10000 records (219190 bytes) from: "hdfs://iw-ubuntu/user/developer/student.txt" Output(s): Successfully stored 4 records (134 bytes) in: "hdfs://iw-ubuntu/user/developer/output“ last centroids: [0.371927835052,1.22406743491,2.24162171881,3.40173705722]

References: • http://pig.apache.org(Pig official site) • http://en.wikipedia.org/wiki/K-means_clustering • Docs http://pig.apache.org/docs/r0.9.0 • Papers: http://wiki.apache.org/pig/PigTalksPapers • http://en.wikipedia.org/wiki/Pig_Latin • Slides by Adam Kawa the 3rd meeting of WHUG June 21, 2012 • Questions?

Acknowledgement

HBase and Bigtable Storage

HBase and Bigtable Storage

Presentation Transcript

HBase

Bigtable : A Distributed Storage System for Structured Data

HBase

BigTable A System for Distributed Structured Storage

Bigtable : A Distributed Storage System for Structured Data

BigTable

HBASE

Bigtable : A Distributed Storage System for Structured Data

HBase

HBase and Bigtable Storage

HBase

HBase

HBase

Bigtable : A Distributed Storage System for Structured Data

BigTable

CS 347: Parallel and Distributed Data Management Notes 13: BigTable, HBASE, Cassandra

BigTable: A Distributed Storage System for Structured Data

CS 245: Database System Principles Notes 13: BigTable, HBASE, Cassandra

Bigtable : A Distributed Storage System for Structured Data