270 likes | 368 Views
Supporting Top-K Keyword Search in XML Databases. ICDE 2010. Outline. Introduction Motivation Preliminaries Join-based Algorithm Join-based Top-k Algorithm Experiments Conclusions. Introduction. LCA:Lowest Common Ancestor. Introduction. LCA:Lowest Common Ancestor. Motivation.
E N D
Outline • Introduction • Motivation • Preliminaries • Join-based Algorithm • Join-based Top-k Algorithm • Experiments • Conclusions
Introduction • LCA:Lowest Common Ancestor
Introduction • LCA:Lowest Common Ancestor
Motivation • The naive LCA-based semantics is straightforward, but leads to exponential computation and result size. • Two keywords:{XML} and {data} :lists of node XML. :lists of node data. the total number of the LCAs :m*n • Existing algorithms focusing on efficiency, cannot provide effective support for Top-k processing. • tg
Preliminaries 1.Query Semantics • k-keyword query • :the list of nodes directly • :the LCA of nodes • ELCA semantics :the result as a set of nodes that contain at least one occurrence of all of the query keywords either in their labels or in the labels of their descendant nodes, after excluding the occurrences of the keywords in the subtrees that already contain at least one occurrence of all the query keywords
Cont. • SLCA: a subset of such that no LCA in the subset is the ancestor of another LCA. • LCA:1.1, 1.1.2, 1, 1.3.4, 1.3 • SLCA:1.1.2, 1.3.4 • ELCA:1.1.2, 1.3.4, 1
Cont. 2.Ranking Function
Cont. • : a decreasing function
Join-based Algorithm 1.Node encoding
Join-based Algorithm 2.Algorithm .Two lists of nodes: . . .
Cont. (2,3) join (1),no matched
Cont. (3,5,6) join (1,2,4) no matched
Cont. (2,3,4,5) join (1,2,4)=>(2,4) matchedthe nodes numbered 2 and 4 at level 3 are the lowest ELCAs=>erased
Cont. (2,3) join (1) ,no matched
Cont. (1,1) join (1) matched=>root is ELCA 1 correspond two node (1.2.3 and 1.3.5.6),output one of them
Cont. Score:(1.3.4.5.3.1.1) is greater than Score(1.3.5.6) But in 4th column, 0.5*d(3) may greater than or equal 0.44
Cont. Assume d( ): Join column 5 and 4: no result
Cont. Column 3: Number 2 is matched It’s score is 0.73+0,41=1.14 Threshold of the unseen results in column 3 is =max{0.7+0.3,0.5+0.4}=1
Cont. Consider the unseen results in other column: column 1 and 2 do not contain sequence s. ignore. Consider column 2:the maximum scores 0.7*0.9 and 0.5*0.9, threshold is 0.63+0.45=1.08<1.14 Therefore , node 2 at level 3 can output.
Conclusions • 1. Join-based Algorithm has good performance in high frequency • 2. Join-based Top-k Algorithm has good performance in high correlation.