600 likes | 616 Views
This paper presents novel algorithms and optimizations for computing full disjunctions in relational databases, extending natural joins to combine data from multiple relations, introducing efficient methods for full disjunction evaluation.
E N D
Sara Cohen Itzhak Fadida Yaron Kanza Technion Israel Technion Israel University of Toronto Canada Benny Kimelfeld Yehoshua Sagiv Hebrew University Israel Hebrew University Israel VLDB 2006 Seoul, Korea Full Disjunctions:Polynomial-Delay Iterators in Action
Computing Full Disjunctions • The full disjunction is a relational operator that maximally combines data from several relations • It extends the natural join by allowing incompleteness • It extends the binaryouterjoin to many relations • This paper presents algorithms and optimizations for computing full disjunctions • Theoretically, full disjunctions are more tractable than previously known • Practically, a significant improvement over the state-of-art, an iterator-like evaluation
Contents • Full Disjunctions • Complexity • Contributions • Algorithms • Algorithm NLOJ for Tree-Structured Schemes • Algorithm PDelayFD for General Schemes • Algorithm BiComNLOJ − Main Algorithm • Experimental Results • Conclusion
Contents • Full Disjunctions • Complexity • Contributions • Algorithms • Algorithm NLOJ for Tree-Structured Schemes • Algorithm PDelayFD for General Schemes • Algorithm BiComNLOJ − Main Algorithm • Experimental Results • Conclusion
The Natural Join Operator Climates Accommodations Sites ClimatesAccommodationsSites
The Natural Join Misses Information Climates Accommodations Sites Bahamas is not in Sites, so the natural join misses it ClimatesAccommodationsSites
The Natural Join Misses Information Empty space means nullvalue Climates Accommodations Bahamas is not in Sites, so the natural join misses it Mouth Logan is not in a city, hence missed ClimatesAccommodationsSites
The Natural Join Misses Information A looser notion of join is needed—one that enables joining tuples from some of the tables Climates Accommodations Bahamas is not in Sites, so the natural join misses it Mouth Logan is not in a city, hence missed ClimatesAccommodationsSites
The Natural Join Operator A tuple of the join corresponds to a set of tuples from the source relations Climates Accommodations Sites Join consistent Connected No Cartesian product Complete One tuple from each relation ClimatesAccommodationsSites
Join-Consistent Sets of Tuples A set T of tuples is join-consistent if every two tuples of T are join-consistent Two tuples t1 and t2 are join-consistent if for every common attribute A: 1. t1[A] and t2[A] are non-null 2.t1[A] = t2[A]
Connected Sets of Tuples A set of tuples is connected if its join graph is connected The join graph of a setT of tuples: • The nodes are the tuples of T • An edge between every two tuples with a common attribute
Natural Join (w/o Cartesian Product) T is join consistent 1. JCC 2. 3. T is connected No Cartesian product T is complete One tuple from each relation Each tuple of the result corresponds to a set T of tuples from the source relations
FullDisjunction (Galindo-Legaria 1994) JCC 2. 3. 3. T is connected No Cartesian product T is complete One tuple from each relation T is maximal Not properly contained in any JCC set Each tuple of the result corresponds to a set T of tuples from the source relations T is join consistent 1.
An Example of a Full Disjunction Climates Accommodations Sites R FD(R)
An Example of a Full Disjunction Climates Accommodations Sites R FD(R)
An Example of a Full Disjunction Climates Accommodations Sites R FD(R)
An Example of a Full Disjunction Climates Accommodations Sites R FD(R)
An Example of a Full Disjunction Climates Accommodations Sites R FD(R)
An Example of a Full Disjunction Climates Accommodations Sites R FD(R)
The Outerjoin Operator R1R2 The natural joinR1 R2 and, in addition, all dangling tuplespadded with nulls Theouterjoinof two relations R1andR2
Example of an Outerjoin Climates Accommodations Climates Accommodations
Combining Relations using Outerjoins The outerjoin operator is not associative For more than two relations, the result depends on the order in which the outerjoin is applied In general, outerjoins cannot maximally combine relations (no matter what order is used) Outerjoin is not suitable for combining more than two relations!
Contents • FullDisjunctions • Complexity • Contributions • Algorithms • Algorithm NLOJ for Tree-Structured Schemes • Algorithm PDelayFD for General Schemes • Algorithm BiComNLOJ − Main Algorithm • Experimental Results • Conclusion
Efficiency of Evaluation The full-disjunction operator (as well as other operators like the Cartesian product or the natural join) can generate an exponential(in the input size)number of tuples Polynomial running time is not a suitable yardstick The usual notion: Polynomial time in the combined size of the input and the output
History of Algorithms for Full Disjunctions Source Time Databases g-acyclic RU96 O(n+F2) KS03 O(n5N2F2) general O(n3NF2)“incremental polynomial” CS05 general This paper:linear dependence on F number of relations number of tuples in the DB number of tuples in the FD F is typically very large Can be exponential in the size of the database n: N: F:
Polynomial Delay time One way to obtain an evaluation with a running time linear in the output is to devise an algorithm that acts as an iterator with an efficient next() operator, that is, An enumeration algorithm that runs with polynomial delay An enumeration algorithm runs with polynomial delay if the time between every two successive answers is polynomial in the size of the input
Other Benefits of Polynomial Delay • Incremental evaluation • First tuples are generated quickly • Full disjunctions are large, yet the user need not wait for the whole result to be generated • Suitable for Web applications, where users expect to get the first few pages quickly • In addition, the user can decide anytime that enough information has been shown • Enable parallel query processing • While one processor generates the FD tuples, other processors apply further processing
Contents • Full Disjunctions • Complexity • Contributions • Algorithms • Algorithm NLOJ for Tree-Structured Schemes • Algorithm PDelayFD for General Schemes • Algorithm BiComNLOJ − Main Algorithm • Experimental Results • Conclusion
Main Contributions Substantial improvement over the state-of-art is proved theoretically and experimentally 1.First algorithm for computing full disjunctions withpolynomial delay 2. First algorithm for computing full disjunctions in time linear in the output 3. A general optimizationtechniquefor computing full disjunctions Division into biconnected components
Contents • FullDisjunctions • Complexity • Contributions • Algorithms • Algorithm NLOJ for Tree-Structured Schemes • Algorithm PDelayFD for General Schemes • Algorithm BiComNLOJ − Main Algorithm • Experimental Results • Conclusion
Our Algorithms Combine Algorithm NLOJ Tree Schemes Algorithm PDelayFD GeneralSchemes Division into Biconnected Components Optimization Algorithm BiComNLOJ Main Algorithm− GeneralSchemes
Contents • Full Disjunctions • Complexity • Contributions • Algorithms • Algorithm NLOJ for Tree-Structured Schemes • Algorithm PDelayFD for General Schemes • Algorithm BiComNLOJ − Main Algorithm • Experimental Results • Conclusion
Tree Schemes R1 R5 R2 R3 R6 R7 R4 Scheme graphs w/o cycles In the scheme graph, the relation schemes are the nodes and there is an edge between every two schemes with one or more common attributes
Left-Deep Sequence of Outerjoins AlgorithmNLOJ (Nested Loop OuterJoin) R: a set of relations with a tree scheme R1,…,Rn: a connected-prefix order of R Proposition: FD(R) = (…((R1R2) R3) …) Rn 1. Compute a connected-prefix order of R 2. Apply outerjoins in a left-deep order
Connected-Prefix Order of Relations Aconnected-prefixorder of relations: Each prefix forms a (connected) subtree R1 R5 R2 R3 R6 R7 R4 R1 R3 R2 R7 R4 R5 R6
Achieving Polynomial Delay R2 R3 Rn-1 Rn Already exponential size! AlgorithmNLOJ (Nested Loop OuterJoin) 1. Compute a connected-prefix order of R 2. Apply outerjoins in a left-deep order R1 … Problem: exp. delay Solution: use iterators
Iterators To obtain polynomial delay, we use iterators • Operate on top of an enumeration algorithm • Implement next() by controlling the execution Algorithm Iterator next()
Using Iterators for Outerjoins Rn-1 Iterator 1 Iterator 2 R2 R3 Rn Iterator n-1 Iterator n R1 …
Outerjoins are not Always Applicable It is not always possible to formulate a full disjunction as a left-deep sequence of outerjoins Rajaraman and Ullman[PODS 96]: Some full disjunctions cannot be formulated as expressions of outerjoins (i.e., with arbitrary placement of parentheses)
Contents • Full Disjunctions • Complexity • Contributions • Algorithms • AlgorithmNLOJfor Tree-Structured Schemes • Algorithm PDelayFDforGeneralSchemes • Algorithm BiComNLOJ − Main Algorithm • Experimental Results • Conclusion
About the Algorithm • Unlike NLOJ, the next algorithm, PDelayFD, is applicable to all schemes (and not just trees) • Algorithm PDelayFD has a polynomial delay, but the delay islargerthan that of NLOJ • Nevertheless, PDelayFD by itself is a significant improvement over the state-of-art
Shifting a Maximal JCC Tuple Set T t-shifting T: T 1.Add t to T 2.Extractmax. JCC subset containing t 3. Extend to a maximal JCC set t-shift of T t t t
Algorithm PDelayFD Theorem: Validate that the t-shift is not already in Q orC 1. Generate a max. JCC set T0 2.Insert T0 into Q PDelayFD(R) computes FD(R)with polynomial delay C Q Repeat until Q is empty: 1. Move some T from Q toC 2.Print the join ofT, padded with nulls 3.Insert into Qa t-shift of Tfor all tuples t in the database … Output:
Contents • Full Disjunctions • Complexity • Contributions • Algorithms • AlgorithmNLOJfor Tree-Structured Schemes • AlgorithmPDelayFD for General Schemes • AlgorithmBiComNLOJ− Main Algorithm • Experimental Results • Conclusion
NLOJ vs. PDelayFD R3 R3 R2 R7 R2 R7 R8 R10 R1 R8 R10 R1 R6 R4 R5 R9 R6 R4 R5 R9 R3 R2 R7 R8 R10 R1 R6 R4 R5 R9 ? PDelayFD NLOJ • Shorter delays • Less space • Simpler to impl. Our approach: divide and conquer
Biconnected Components R1 R2 R5 R4 R3 R6 R8 R7 R9 R1 R5 R2 R3 R8 R6 R7 R4 Biconnected component: A maximal subset Bof relations, s.t. the scheme graph has two (or more) disjoint paths between every two relations ofB
Left-Deep Sequence of Outerjoins R: a set of relations Theorem: There exists an (efficiently computable) order B1,…,Bk of the biconnected components ofR, s.t. FD(R) = (…((FD(B1) FD(B2)) …) FD(Bk) Optimized Algorithm: 1. Compute the biconnected components of R 2. Compute the full disjunction of each component 3. Apply outerjoins in a suitable order
BiComNLOJ: a Naïve Attempt Iterator Iterator Iterator Each FD(Bi) can be exponential in the input 1.DivideRinto biconnected components →B1,…Bk in a suitable order Non-polynomial delay! 2.ComputeFD(B1),…,FD(Bk) − using PDelayFD 3. Using NLOJ, compute (…((FD(B1) FD(B2)) …) FD(Bk) Solution:
Retaining Polynomial Delay: 1st Problem R2 R6 For simplification, assume only two components R1 R3 R5 R7 R4 R8 B1 B2 • After generating a tuple t of FD(B1), we need to generate all tuples of FD(B2) that can join t • Non-polynomial delay if all of FD(B2) is computed for finding these tuples! • Solution: • PDelayFD can be modified so that it generates only those tuples of FD(B2) that can join t Details in the proceedings…