150 likes | 589 Views
Open Source!. Jaql → pipes Unix pipes for the JSON data model. Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata IBM Almaden Research Center http://code.google.com/p/jaql/ http://jaql.org/. Goals for Jaql.
E N D
Open Source! Jaql → pipesUnix pipes for the JSON data model Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata IBM Almaden Research Centerhttp://code.google.com/p/jaql/ http://jaql.org/
Goals for Jaql • Provide a simple, yet powerful language to manipulate semi-structured data. • Use JSON as a data model • Data is usually converted to/from JSON view • Most data has a natural JSON representation • Easily extended using Java, Python, JavaScript, … • Exploit massive parallelism using Hadoop
What is in the upcoming release? • User feedback on previous release • Too XQuery-like (yuck factor) • Too complex • Too composable, too nested, too verbose • Unclear what is parallelized • Next release (planned 10/30/2008) • Vastly simplified syntax • Inspired by Unix Pipes
A query is a pipeline sink source operator operator $people = file …; $greetings = file …; $people -> filter $.type = 'friendly‘ -> map { hello: $.name } -> write $greetings; // declare files // read input (json array) // find friendly people // keep just name // write output Operations listed in natural order vs last operation first one map job
Aggregate • Aggregate the input into a single value • Using push-based, streaming, combining API to aggregate functions $people -> filter by $.birthdate < date(‘1990-01-01’) -> aggregate count($); // count the older people one map / combine / reduce job
Partition • Partition one or more inputs • Send each individual partition through a sub-pipe • Merge the results $people -> filter by $.birthdate < date(‘1990-01-01’) -> partition by $t = $.type // partition the older people by type |- aggregate { type: $t, n: count($) } -|; // aggregate per partition one map / combine / reduce job
User-defined operators • Call user code • Similar to calling user program / script in Unix • Input and output are pipelined • Like “Hadoop streaming” $people -> myBestMatches($, 3); // pass “standard input” to external code Not Parallel!
partition “split” merge Per partition sub-pipe • Partition one or more inputs on a key • Send each partition through (duplicate) sub-pipe • Merge the results $people -> partition by $.type // partition people by type |- sort by $.rating // sort partition by rating -> top 100 // keep just the first 100 in partition -> myBestMatches($,3) -|; // find best matches per partition one map / reduce job
Partition by default • Run sub-pipe on each partition of the input • If input is a file, use its partition, else arbitrary • Expresses parallelism of user-defined operator $file -> partition by default // run per file partition |- buildPartialModel($) -| // partial model built per partition -> unifyModels($); // unify all partial the models into one one map job +serial unify
Join People: [ { id: 1, name: ‘Jack’ }, { id: 2, name: ‘Jill’ }, … ] Children: [{ id: 3, name: ‘Becky’, father: 1, mother: 2 }, …] $people = file …; $children = file …; join $people on $people.id, $children on $children.mother; [ { people: { id: 2, name: ‘Jill’ }, children: { id: 3, name: ‘Becky’, father: 1, mother: 2 } }, … ] • result is record with inputs as values • joins on multiple inputs with multiple conditions • Inner, left-, right-, full-outer joins one map / reduce job
Composite Operators • Join • Join two or more inputs on a key • Inner/outer/full • Multi-predicate, multi-way • Merge • Concatenate all inputs in any order • User-defined operator (function) • Union, Intersect, Difference… One input can comefrom current pipe. Examples: composite operator Remaining inputs are pipe variablesor nested pipes.
Composite sinks • Tee • Send each input item to all output pipes $people -> tee |- filter $.gender == ‘F’ -> write $women |- map { $.name } -> write $names -|; • Split • Send each input item to one pipe
Rough Unix analogs of Jaql Unix: stream of bytes / lines Jaql: stream of JSON items more structure / types
Summary • Unix pipes revolutionized scripting • If you know Unix pipes, you understand Jaql
Questions? Comments?