1 / 15

Jaql → pipes Unix pipes for the JSON data model

Open Source!. Jaql → pipes Unix pipes for the JSON data model. Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata IBM Almaden Research Center http://code.google.com/p/jaql/ http://jaql.org/. Goals for Jaql.

fawn
Download Presentation

Jaql → pipes Unix pipes for the JSON data model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Open Source! Jaql → pipesUnix pipes for the JSON data model Kevin Beyer, Vuk Ercegovac, Eugene Shekita, Jun Rao, Ning Li, Sandeep Tata IBM Almaden Research Centerhttp://code.google.com/p/jaql/ http://jaql.org/

  2. Goals for Jaql • Provide a simple, yet powerful language to manipulate semi-structured data. • Use JSON as a data model • Data is usually converted to/from JSON view • Most data has a natural JSON representation • Easily extended using Java, Python, JavaScript, … • Exploit massive parallelism using Hadoop

  3. What is in the upcoming release? • User feedback on previous release • Too XQuery-like (yuck factor) • Too complex • Too composable, too nested, too verbose • Unclear what is parallelized • Next release (planned 10/30/2008) • Vastly simplified syntax • Inspired by Unix Pipes

  4. A query is a pipeline sink source operator operator $people = file …; $greetings = file …; $people -> filter $.type = 'friendly‘ -> map { hello: $.name } -> write $greetings; // declare files // read input (json array) // find friendly people // keep just name // write output Operations listed in natural order vs last operation first one map job

  5. Aggregate • Aggregate the input into a single value • Using push-based, streaming, combining API to aggregate functions $people -> filter by $.birthdate < date(‘1990-01-01’) -> aggregate count($); // count the older people one map / combine / reduce job

  6. Partition • Partition one or more inputs • Send each individual partition through a sub-pipe • Merge the results $people -> filter by $.birthdate < date(‘1990-01-01’) -> partition by $t = $.type // partition the older people by type |- aggregate { type: $t, n: count($) } -|; // aggregate per partition one map / combine / reduce job

  7. User-defined operators • Call user code • Similar to calling user program / script in Unix • Input and output are pipelined • Like “Hadoop streaming” $people -> myBestMatches($, 3); // pass “standard input” to external code Not Parallel!

  8. partition “split” merge Per partition sub-pipe • Partition one or more inputs on a key • Send each partition through (duplicate) sub-pipe • Merge the results $people -> partition by $.type // partition people by type |- sort by $.rating // sort partition by rating -> top 100 // keep just the first 100 in partition -> myBestMatches($,3) -|; // find best matches per partition one map / reduce job

  9. Partition by default • Run sub-pipe on each partition of the input • If input is a file, use its partition, else arbitrary • Expresses parallelism of user-defined operator $file -> partition by default // run per file partition |- buildPartialModel($) -| // partial model built per partition -> unifyModels($); // unify all partial the models into one one map job +serial unify

  10. Join People: [ { id: 1, name: ‘Jack’ }, { id: 2, name: ‘Jill’ }, … ] Children: [{ id: 3, name: ‘Becky’, father: 1, mother: 2 }, …] $people = file …; $children = file …; join $people on $people.id, $children on $children.mother; [ { people: { id: 2, name: ‘Jill’ }, children: { id: 3, name: ‘Becky’, father: 1, mother: 2 } }, … ] • result is record with inputs as values • joins on multiple inputs with multiple conditions • Inner, left-, right-, full-outer joins one map / reduce job

  11. Composite Operators • Join • Join two or more inputs on a key • Inner/outer/full • Multi-predicate, multi-way • Merge • Concatenate all inputs in any order • User-defined operator (function) • Union, Intersect, Difference… One input can comefrom current pipe. Examples: composite operator Remaining inputs are pipe variablesor nested pipes.

  12. Composite sinks • Tee • Send each input item to all output pipes $people -> tee |- filter $.gender == ‘F’ -> write $women |- map { $.name } -> write $names -|; • Split • Send each input item to one pipe

  13. Rough Unix analogs of Jaql Unix: stream of bytes / lines Jaql: stream of JSON items more structure / types

  14. Summary • Unix pipes revolutionized scripting • If you know Unix pipes, you understand Jaql

  15. Questions? Comments?

More Related