Storage and Analysis of Tera -scale Data : 2 of 2

Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu

Previously … • (Traditional) Databases are not Swiss-Army knives • Large data problems require radically different solutions • Exploit the power of parallel I/O and computation • MapReduce as a framework for building reliable distributed data processing applications • Storing large data requires redesign from the ground up, i.e. filesystem (HDFS)

Previously … • HDFS : A reliable open source distributed file system • HBase : A sorted multi-dimensional map for record oriented data • Not Relational • No query language other than map semantics (Get and Put)

MapReduce is great but … Got to write all this for a WordCount!!!

MapReduce • Development cycles too long • Writing code • Packaging code • JOINs on large data too hard to implement in MapReduce • Today’s class: Keeping it Simple • Can we abstract users from MapReduce?

Pig • Started in Fall 2007 at Yahoo! • Simplify MapReduce by capturing common data processing patterns • Results in improved productivity • Lowers barrier to entry for large data processing • Today: Runs 40% of Yahoo!’s large data jobs • Who else: Twitter, LinkedIn, AOL, … • Similar efforts elsewhere: Sawzall (Google), Hive (Facebook)

Pig = Query Language + Interpreter • Language: Pig Latin • A data flow language • LOAD, STORE, FILTER, ORDER, GROUP, JOIN • Interpreter: Grunt • An execution environment to convert Pig Latin to MapReduce • Two modes • Local : JVM • Distributed: via Hadoop

Pig Latin Example from Pittsburg Hadoop Users Group

Equivalent MapReduce code

Pig Latin from an Example (Example courtesy: Yahoo! Research) • Find users who visit “good” pages

Conceptual Dataflow

Pig Latin script

Pig Latin: The Language • Structure • Collection of STATEMENTS • Statement has an OPERATOR and ends in ‘;’

Summary of Pig Latin Operators

LOAD/STORE and Schemas grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> records = LOAD ‘input/sample.txt’; grunt> STORE records INTO ‘output/sample.out`;

FILTER grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt>bad_records = FILTER records BY quality < 0; grunt>bad_years = FOREACH bad_records GENERATE year;

STREAM grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> projected = FOREACH records GENERATE $0, $2; grunt> projected = STREAM records THROUGH `cut -f0,2`

JOIN grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> sales = LOAD ‘input/sales.txt’ >>AS (year:int, profit:float); grunt> combined = JOIN records BY year, sales BY year; grunt>profit_year = FOREACH combined GENERATE profit, year;

GROUP grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> combined = GROUP records BY quality; grunt> combined = GROUP sales BY quality < AVG(quality);

ORDER grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> combined = ORDER records BY year, quality DESC;

Parallelism grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> combined = GROUP records BY quality PARALLEL 50; Can use PARALLEL keyword in any statement

User Defined Functions • Unlike SQL, can invoke custom defined functions in query • Proprietary solutions like PL/SQL allow that grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> REGISTER mypackage.jar; grunt> DEFINE MyFuncmypackage.MyFuncImpl.myFunc(); grunt> combined = GROUP records BY MyFunc(quality);

PIG LATIN Review

Revisiting WordCount grunt> sentences = LOAD ‘input/*.txt’ >>USING TextLoader() AS (sentence: chararray); grunt> words =FOREACH sentences GENERATE flatten(TOKENIZE(sentence)) AS word; grunt>word_kinds=GROUP words BY word; grunt>word_count=FOREACHword_kinds >> GENERATE group, COUNT(words) grunt>STORE word_countINTO ‘output/wordcount’;

No more this …

Related Project: Hive • Started in Facebook, now open source • Like PIG but supports SQL • Trend : Move towards in-database MapReduce • Allows existing DB applications to scale up • Makes MapReduce capabilities easily accessible • Business opportunity: www.vertica.com

Summary (this and last class) • MapReduce as a radically different solution to large data problems • Exploit the power of parallel I/O and computation • Need to think from the “ground up” • Filesystem: HDFS • Table store: HBase • Basic MapReduce too complicated DB end users

Summary (this and last class) • Efforts to simplify MapReduce based data processing • PIG from Yahoo! • Pig Latin a-not-so-SQL like language • A data flow language • LOAD, STORE, FILTER, ORDER, GROUP, JOIN • Facebook Hive supports direct SQL interface • Emerging trend: Fusion of MapReduce and DB technologies

Happy Thanksgiving!

Storage and Analysis of Tera -scale Data : 2 of 2

Storage and Analysis of Tera -scale Data : 2 of 2

Presentation Transcript

Computational Methods for Large Scale DNA Data Analysis

CHANGING THE STORAGE LANDSCAPE WITH STORAGE VIRTUALIZATION

Scale-Out NAS

The Large Scale Data Management and Analysis Project (LSDMA) Dr. Andreas Heiss , SCC, KIT

Qualitative Data Analysis: An introduction

Domain agnostic tools for multi-scale/integrative sensor data analysis

Lecture 04: Data Storage

Statistical Analysis I have all this data. Now what does it mean?

Chapter 7

Venue Data Storage

Distributed Tera-Mining

G4/analysis

Chapter 1: Data Storage

The Oceanic Data Utility: (OceanStore) Global-Scale Persistent Storage

Data management in grid. Comparative analysis of storage systems in WLCG.

Chapter 1: Data Storage

ITK Lecture 4 Images in ITK

Chapter 1

Storage

Approximate computation and implicit regularization in large-scale data analysis

Planning