190 likes | 394 Views
Progressive Merge Join : A Generic and Non-Blocking Sort-Based Join Algorithm. Jens-Peter Dittrich Bernhard Seeger David Scot Taylor Peter Widmayer. Outline. Motivation Genericity Invariants Applications Non-Blocking Run Generation Merge Phase Experiments Conclusions. 1. Motivation.
E N D
Progressive Merge Join:A Generic and Non-Blocking Sort-Based Join Algorithm Jens-Peter Dittrich Bernhard Seeger David Scot Taylor Peter Widmayer
Outline • Motivation • Genericity • Invariants • Applications • Non-Blocking • Run Generation • Merge Phase • Experiments • Conclusions
1. Motivation • Genericity • positive aspects • separating algorithm from the domain problem • reuse of code • increase flexibility • extensible database systems • Non-Blocking algorithms • production of results before the entire input is read • pipelined processing through iterators
Sort-Merge Join • Sort-Merge Join • sketch of the algorithm • sort phase • join phase by merging • frequent usage of sort-merge paradigm • equi joins • temporal joins • spatial joins • similarity joins • blocking operator #results runtime
2. Generictity Basic idea • based on the sweep-line paradigm known from CG • sweep area refers to the processing window of the join • insertions • queries • deletions • a sweep area for each input Notation • R and S input relations • R‘ and S‘ sorted input relations • Pjoin binary join predicate • Prm binary removal predicate
insert(r) Prm(?,r) Pjoin (?,r) Sweep-area R‘ sweep area of R r s ³ r S‘ s sweep area of S
Use-cases • equi join • order by attribute A • Pjoin (u,v) := u.A == v.A • Prm(u,v) := u.A < v.A • temporal join • data type I = [min, max] • order by I.min • Pjoin (u,v) := u.I intersects v.I • Prm(u,v) := u.I.max < v.I.min • spatial join • data type R = [I0,I1] • order by R[0].min • Pjoin (u,v) := u.R intersects v.R • Prm(u,v) := u.R[0].max < v.R[0].min
Conditions When is the method applicable? • "w ³ v: Prm(u,v) Þ¬Pjoin (u,w) When an element is removed from a sweep area, it does not contribute anymore to the result of the join. • " w ≤ v: Prm(u,v) Þ Prm(u,w) When an item removes an element from a sweep area, it also will remove all smaller elements.
R S S 3. Early results • Basic ideas • sort both inputs simoultaneously in main memory • interleave sorting and joining: join immediately after • initial runs are generated • a bunch of runs are merged
3. Early results • Basic ideas • sort both inputs simultaneously in main memory • Interleave sorting and joining: join immediately after • initial runs are generated • a bunch of runs are merged
Duplicates Problem • early results are produced again during merge Avoidance of duplicates • a sweep area is used for each input run and an item has to send a query to each of these sweep areas • fits to our generic framework • considerable CPU-time overhead • Each item has to check every sweep area • currently used in our implementation • each item contains the label of the run where it is coming from • requires a special sweep area • moderate CPU-time overhead • each item has to check only one sweep area • early results are retrieved, but identified
Correctness Theorem If the conditions • "w ³v:Prm(u,v) Þ¬Pjoin (u,w) • " w ≤ v: Prm(u,v) Þ Prm(u,w) are satisfied, Progressive Merge Join returns each result of the join at most once.
4. Experiments • Implementation of PMJ • in XXL (eXtendible and fleXible Library), but not yet in the most recent release • Examined methods • Methods • strict sort-merge join • semi-strict merge join: join during final merge • PMJ • Join Predicates • equi • spatial (Arge et al, VLDB 1998) • similarity (Dittrich & Seeger, KDD 2001)
Results (I/O performance) • Similarity Join • 16-dim. Fourier vectors from CAD data • about 650.000 vectors per set • # results as a function of #I/O • page size: 4K PMJ semi-strict SMJ strict SMJ #results #I/O
200.000 150.00 #results 100.000 50000 0 0 350 runtime (sec) Results (runtime) • Experimental setting • 700 MHz Athlon, 256 MB memory • Java HotSpot Server Virtual Machine • # results as a function of #I/O PMJ semi-strict SMJ strict SMJ
5. Related Work • Sort-merge join • Join during final merge: [NP 85] • Plane-sweep paradigm • [PS 85] • Symmetric Hash Join • [RS 86], [WA 91] • Xjoin • [UF 00] • Hash-Ripple Join • [HH 99] • Scalable version [LEHN 02]
5. Conclusions • PMJ • Generic join processing • is not limited to equi-predicate, but supports also others • necessary conditions for applying PMJ to a join predicate • Non-blocking • join during run generation • join during merge • PMJ is an efficient non-blocking similarity join • Extensions • multi-relation joins • adaptive memory assignment • large main memories • sorted order of output