290 likes | 429 Views
Storage and Analysis of Tera -scale Data : 2 of 2. 415 Database Class 11/24/09 delip@jhu.edu. Previously …. (Traditional) Databases are not Swiss-Army knives Large data problems require radically different solutions Exploit the power of parallel I/O and computation
E N D
Storage and Analysis of Tera-scale Data : 2 of 2 415 Database Class 11/24/09 delip@jhu.edu
Previously … • (Traditional) Databases are not Swiss-Army knives • Large data problems require radically different solutions • Exploit the power of parallel I/O and computation • MapReduce as a framework for building reliable distributed data processing applications • Storing large data requires redesign from the ground up, i.e. filesystem (HDFS)
Previously … • HDFS : A reliable open source distributed file system • HBase : A sorted multi-dimensional map for record oriented data • Not Relational • No query language other than map semantics (Get and Put)
MapReduce is great but … Got to write all this for a WordCount!!!
MapReduce • Development cycles too long • Writing code • Packaging code • JOINs on large data too hard to implement in MapReduce • Today’s class: Keeping it Simple • Can we abstract users from MapReduce?
Pig • Started in Fall 2007 at Yahoo! • Simplify MapReduce by capturing common data processing patterns • Results in improved productivity • Lowers barrier to entry for large data processing • Today: Runs 40% of Yahoo!’s large data jobs • Who else: Twitter, LinkedIn, AOL, … • Similar efforts elsewhere: Sawzall (Google), Hive (Facebook)
Pig = Query Language + Interpreter • Language: Pig Latin • A data flow language • LOAD, STORE, FILTER, ORDER, GROUP, JOIN • Interpreter: Grunt • An execution environment to convert Pig Latin to MapReduce • Two modes • Local : JVM • Distributed: via Hadoop
Pig Latin Example from Pittsburg Hadoop Users Group
Pig Latin from an Example (Example courtesy: Yahoo! Research) • Find users who visit “good” pages
Pig Latin: The Language • Structure • Collection of STATEMENTS • Statement has an OPERATOR and ends in ‘;’
LOAD/STORE and Schemas grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> records = LOAD ‘input/sample.txt’; grunt> STORE records INTO ‘output/sample.out`;
FILTER grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt>bad_records = FILTER records BY quality < 0; grunt>bad_years = FOREACH bad_records GENERATE year;
STREAM grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> projected = FOREACH records GENERATE $0, $2; grunt> projected = STREAM records THROUGH `cut -f0,2`
JOIN grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> sales = LOAD ‘input/sales.txt’ >>AS (year:int, profit:float); grunt> combined = JOIN records BY year, sales BY year; grunt>profit_year = FOREACH combined GENERATE profit, year;
GROUP grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> combined = GROUP records BY quality; grunt> combined = GROUP sales BY quality < AVG(quality);
ORDER grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> combined = ORDER records BY year, quality DESC;
Parallelism grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> combined = GROUP records BY quality PARALLEL 50; Can use PARALLEL keyword in any statement
User Defined Functions • Unlike SQL, can invoke custom defined functions in query • Proprietary solutions like PL/SQL allow that grunt> records = LOAD ‘input/sample.txt’ >>AS (year:int, temprature:int, quality:int); grunt> REGISTER mypackage.jar; grunt> DEFINE MyFuncmypackage.MyFuncImpl.myFunc(); grunt> combined = GROUP records BY MyFunc(quality);
Revisiting WordCount grunt> sentences = LOAD ‘input/*.txt’ >>USING TextLoader() AS (sentence: chararray); grunt> words =FOREACH sentences GENERATE flatten(TOKENIZE(sentence)) AS word; grunt>word_kinds=GROUP words BY word; grunt>word_count=FOREACHword_kinds >> GENERATE group, COUNT(words) grunt>STORE word_countINTO ‘output/wordcount’;
Related Project: Hive • Started in Facebook, now open source • Like PIG but supports SQL • Trend : Move towards in-database MapReduce • Allows existing DB applications to scale up • Makes MapReduce capabilities easily accessible • Business opportunity: www.vertica.com
Summary (this and last class) • MapReduce as a radically different solution to large data problems • Exploit the power of parallel I/O and computation • Need to think from the “ground up” • Filesystem: HDFS • Table store: HBase • Basic MapReduce too complicated DB end users
Summary (this and last class) • Efforts to simplify MapReduce based data processing • PIG from Yahoo! • Pig Latin a-not-so-SQL like language • A data flow language • LOAD, STORE, FILTER, ORDER, GROUP, JOIN • Facebook Hive supports direct SQL interface • Emerging trend: Fusion of MapReduce and DB technologies