290 likes | 413 Views
Proposed By, Ihab F. Ilyas Walid G. Aref Ahmed K. Elmagarmid. Supporting top-k queries in Relational databases. Proceedings of the 29th international conference on Very large databases, March 2004. Sowmya Muniraju. Presented By:. Outline. Introduction
E N D
Proposed By, Ihab F. Ilyas Walid G. Aref Ahmed K. Elmagarmid Supporting top-k queries in Relational databases.Proceedings of the 29th international conference on Very large databases, March 2004 Sowmya Muniraju Presented By:
Outline • Introduction • Existing join strategies • Contributions • Related Work • Introduction to New Rank join algorithm • Overview of Ripple Joins • New Rank join algorithm • Physical Rank Join Operators • HRJN • HRJN* • Performance Evaluation • Conclusion
Introduction • Need for support of ranking in Relational Databases. • Attributes in Relational databases spread across multiple relations, hence need for ranking on join queries. • User mostly interested in top few results. • Resultset should be ordered based on certain conditions (scoring functions).
Existing Join strategies • Sort-Merge join • Relations sorted on join columns. • Nested loop join • Tuples of outer relation are joined with tuples of the inner relation. • Hash join • 2 phases: Build, Probe • Build hash table for smaller of the two relations. • Probe this hash table with hash value for each tuple in the other relation.
Top-k using existing join strategies • Given a query, how do we get the top-k results? SELECT A.1, B.2 FROM A, B, C WHERE A.1 = B.1 AND B.2 = C.2 ORDER BY ( 0.3 * A.1 + 0.7 * B.2 ) STOP AFTER 5; • Problems ? • Sorting is a blocking operation. • Sorting is expensive and has been done thrice.
Order limitations on existing joins • Sort-merge join • Sorting is done on joining columns, NOT on columns that participate in scoring function. • Nested-loop join • Orders of only the outer loop is maintained. • Hash join • Orders on both inputs are lost after the join, when hash tables do not fit in memory. Common characteristic in these joins: Decouple join from sort.
Contributions • Proposed a new rank join algorithm. • Implemented this algorithm in practical pipelined rank-join operators based on ripple join. • Proposed a scoring guide function that reduces the number of tuples to be evaluated to get the desired resutls.
Desired Result SELECT A.1, B.2 FROM A, B, C WHERE A.1 = B.1 AND B.2 = C.2 ORDER BY ( 0.3 * A.1 + 0.7 * B.2 ) STOP AFTER 5; DESIRED: Using rank join Using existing join strategies
Related Work • This problem is closely related to top-k selection queries. • Here, scoring function is applied on multiple attributes m of the same relation. • Related algorithms: Threshold Algorithm(TA), No-Random Access Algorithm(NRA), J*, A*
Introduction: New Rank Join Algorithm • Tuples are retrieved in order to preserve ranking. • Produces first ranked join results as quickly as possible. • Uses a monotonicranking function. • Based on the idea of ripple join. • Integration with existing physical query engines. • Variations: HRJN, HRJN*
Overview of Ripple Joins • Previously unseen random tuple from one relation is joined with previously seen tuples from another relation. • Variations of Ripple Joins • Block • Hash
Example L.A = R.A Scoring Function: L.B+ R.B K = 2 L_top R_top L_bottom R_bottom L R K = 1 K = 2 K = 0 Threshold (T): 10 8 9 7 Right_threshold =f( R_top, L_bottom ) Left_threshold = f( L_top, R_bottom ) T = Max(Left_threshold, Right_threshold ) 10 10 10 8 8 8 7 7 7 9 9 9 [ (1,1,5) (2,1,4) ] = 9 [ (2,2,4) (3,2,3) ] = 7 [ (3,2,3) (3,2,3) ] = 6 [ (4,3,2) (1,3,5) ] = 7 [ (2,2,4) (4,2,2) ] = 6 LI, RI not a valid join L4,R1 | L2,R4 | L3,R4 are valid joins L3, R3 | L2, R3 are valid joins L1, R2 is a valid join. [ (3,2,3) (4,2,2) ] = 5
Hash Rank Join Operator (HRJN) • Variant of Symmetrical hash join algorithm. • Data Structures • Hash table for each input. • Priority Queue - holds valid join combinations along with their scores. • Methods implemented • Open: initializes its operator and prepares its internal state. • Get Next: returns next ranked join result upon each call. • Close: terminates the operator and performs the necessary clean up.
Open(L, R, C, f) L = Left Input R = Right Input C = Join condition f = Monotonic scoring function
GetNext() Output: Next ranked join result
Local Ranking Problem Solving • Unbalance retrieval rate of left and right inputs. • Use concept of Block Ripple Join.
Example 2 L.A = R.A Scoring Function: L.B+ R.B K = 2 Scoring Function: L.B+ R.B K = 2 L_top R_top L_bottom R_bottom L R Threshold (T): 10 Right_threshold =f( R_top, L_bottom ) Left_threshold = f( L_top, R_bottom ) T = Max(Left_threshold, Right_threshold ) 9 10 10 8 10 10 7 10 10 10 10 10 [ (4,3,2) (1,3,5) ] = 7 No valid joins. L4, R1 is a valid join
HRJN*: Score-Guided Join Strategy Retrieve tuple from input T1 = f( L_top, R_bottom) T2 = f( R_top, L_bottom) Yes If T1 > T2 No Input = L Input = R
Exploiting available indexes • Generalize Rank-join to use random access if available. • Two cases: • An index on join attribute(s) of one input. • An index on join attribute(s) for each input. • Problem: Duplicatescan be produced as indexes will contain all data seen and not yet seen.
Exploiting Indexes: On-the-fly duplicate elimination Scoring Function: L.B+ R.B Index available on R L R f( L_bottom, R_bottom) = 59 Any join result, not yet produced, cannot have a combined score greater than f( L_bottom, R_bottom) [ (1,1,100) (2,1,9) ] = 109 [ (2,2,5) (3,2,8) ] = 58 [ (2,2,50) (4,2,5) ] = 55
Exploiting Indexes: Faster Termination Previously, T = ( 109, 60 ) = 109 After reducing L_top, T = ( 59, 60 ) = 60 L.A = R.A Scoring Function: L.B+ R.B Index available on R L R L_top = L_bottom Reduce L_top to L_bottom, i.e
Performance Evaluation Top-k join operators M = 4 Selectivity = 0.2%
Effect of selectivity M = 4 K = 50
Effect of pipelining Selectivity = 0.2% K = 50
Conclusion • Supported top-k join queries using the new rank join algorithm. • Algorithm uses ranking on the input relations to produce ranked join results on a combined score. • The ranking is performed progressively during the join operation. • HRJN, HRJN* operators implement the new algorithm. • Generalization of this algorithm utilized available indexes for faster termination.
References • “Supporting Top-k Join Queries in Relational Databases.”, Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid, March 2004 • Jing Chen: DIBR Spring 2005, CSE - UT Arlington
THANK YOU