430 likes | 568 Views
Effective Keyword Search for Valuable LCAs over XML Documents. Guoliang Li Jianhua Feng Jianyong Wang Lizhu Zhou . Lin Shao XML und Datenbanksysteme. Content. Introduction Background and Motivation Valuable LCA Meaningful Dewey Code (MDC)
E N D
EffectiveKeywordSearch for Valuable LCAs over XML Documents Guoliang Li JianhuaFeng Jianyong Wang Lizhu Zhou Lin Shao XML und Datenbanksysteme
Content • Introduction • Backgroundand Motivation • Valuable LCA • Meaningful Dewey Code (MDC) • The Stack-Based Algorithm • Experimental Study • Conclusion
Introduction • Existing proposals on keyword search over XML databases suffer from two problems • Meaningfulness and completeness of answers, and the scope of the search • The answer of keyword search should not be limited to the LCAs of the keyword
Introduction • To solve the problem • Valuable LCA • Compact VLCA • devise an efficient stack-based algorithm
Background and Motivation • Notations • u v u is an ancestor of node v • u < v u precedes v in the XML Document • u _ v denotes that u v or u = v
Background and Motivation • Notations • u v u is an ancestor of node v • u < v u precedes v in the XML Document • u _ v denotes that u v or u = v • For example • conf(2) paper(15) • author(17) _ paper(15) • title (6) < author(17)
Background and Motivation Example False positive problem of LCA • Search for: {“IR”, “Tom”}
Background and Motivation Example False positive problem of LCA • Search for: {“IR”, “Tom”} false answer conf(2) • Solutions • Meaningful LCA (MLCA) • Smallest LCA (SLCA) • XRank
Background and Motivation Example False negative problem of SLCA • Search for: {“XML”, “Bob”}
Background and Motivation Example False negative problem of SLCA • Search for: {“XML”, “Bob”} paper(5) will not be in SLCAset
Background and Motivation Example False positive problem of SLCA • Search for: {“XML”, “John”}
Background and Motivation Example False positive problem of SLCA • Search for: {“XML”, “John”} false answer conf(2)
Content • Introduction • Backgroundand Motivation • Valuable LCA • Meaningful Dewey Code (MDC) • The Stack-Based Algorithm • Experimental Study • Conclusion
Valuable LCA • Based on the homogenous / heterogenous concept • Given two nodes u, v, and w=LCA(u,v) uSet and vSet are two sets of nodes in the parths of wu and wv respectively. • If u and v having the same elementary type, they are homogenous (denoted u ~ v)
Valuable LCA • Avoid the false positives and false negatives introduced by SLCA • Definition: Given m nodes n1,n2, … , nm, v=LCA(n1,n2, ... , nm). VLCA(n1,n2, ... ,nm) = v, iff, these m nodes are homogenous, that is, A 1 i < j m, ni~ nj.
Valuable LCA Example heterogenous / homogenous: • Search for: {“XML”, “John”}
Valuable LCA Example heterogenous / homogenous: • Search for: {“XML”, “John”} conf(2) heterogenous paper(23) homogenous
Content • Introduction • Backgroundand Motivation • Valuable LCA • Meaningful Dewey Code (MDC) • The Stack-Based Algorithm • Experimental Study • Conclusion
Meaningful Dewey Code (MDC) • Novel numbering scheme • Inspired form Dewey Code • Number/encode the nodes based on the corresponding DTD • Deduce ancestors and elementary types
Meaningful Dewey Code (MDC) <!ELEMENT bib (conf)*> <!ELEMENT conf (name,year,paper*,chair)> <!ELEMENT paper (title,author+,bib?)> <!ELEMENT name (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT chair (#PCDATA)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)>
Meaningful Dewey Code (MDC) • Ɛ Root Element • CnMDC of the node n • On ordered number of the node n • To encode a node: • author(0.2.1)
Meaningful Dewey Code (MDC) • k k-thlable • m number of children in DTD of parent(n)
Meaningful Dewey Code (MDC) MDC example • Given MDC = 0.6.1 • Level 0 (root) = bib m = 1 • Level 1 = conf m = 4 • Level 2 = paper m = 3 <!ELEMENT bib (conf)*> <!ELEMENT conf (name,year,paper*,chair)> <!ELEMENT paper (title,author+,bib?)> <!ELEMENT name (#PCDATA)> <!ELEMENT year (#PCDATA)> <!ELEMENT chair (#PCDATA)> <!ELEMENT title (#PCDATA)> <!ELEMENT author (#PCDATA)>
Meaningful Dewey Code (MDC) To check homogenous or heterogenous nodes • Proof. If u and v have the same elementary type, λ(u) = λ(v) |{λ(u) ∩ λ(v)}|= 1 • Heterogenous |wSet| - |{λ(u) ∩ λ(v)}| > |lSet| wSet = uSet ᴜ vSet, lSet = {λ(u)|u ϵwSet} • Check u(0.2.0) and v(0.6.4) • wSet{conf(0), paper(0.2), title(0.2.0), paper(0.6), author(0.6.4)} • |wSet|= 5, |lSet|= 4, and |{λ(u) ∩ λ(v)}|= 0
Content • Introduction • Backgroundand Motivation • Valuable LCA • Meaningful Dewey Code (MDC) • The Stack-Based Algorithm • Experimental Study • Conclusion
The Stack-Based Algorithm • VLCAStack to improve the search efficiency • Algorithm for structure join and twig join • Different from the existing studies (CVLCA)
The Stack-Based Algorithm • Compact VLCA (CVLCA) • Is more compact than VLCA • Answer is more meaningful • Connected subtree rooted at CVLCA • Idea behind compact connected tree • Since node v is in a compact connected tree, it will not be in another looser one, which contain some other irrelevant nodes
The Stack-Based Algorithm • Compact VLCA vs. SLCA • Example False negative problem of SLCA • Search for: {“XML”, “Bob”}
The Stack-Based Algorithm • Compact VLCA vs SLCA • Example False negatives problem of SLCA • Search for: {“XML”, “Bob”} SLCAset = {paper(12)} CVLCAset ={paper(5), paper(12)}
The Stack-Based Algorithm • VLCAStack • Input Elements are sorted in order by their MDCs • VLCAStack maintains another stack to preserve current LCAs
The Stack-Based Algorithm • Example: Search for = {“XML”, “John”} • sVLCA is empty • nMin = 0.2.0
The Stack-Based Algorithm • Example: Search for = {“XML”, “John”} • sVLCA = 0.2.0 • nMin = 0.6.4
The Stack-Based Algorithm • Example: Search for = {“XML”, “John”} • sVLCA = 0.6.4 • nMin = 1.2.0
The Stack-Based Algorithm • Example: Search for = {“XML”, “John”} • sVLCA = 0 • nMin = 1.2.0
The Stack-Based Algorithm • Example: Search for = {“XML”, “John”} • sVLCA = 1.2.0 • nMin = 1.2.1
The Stack-Based Algorithm • Example: Search for = {“XML”, “John”} • sVLCA = 1.2 • nMin is empty
The Stack-Based Algorithm • Example: Search for = {“XML”, “John”} • Answer of the keyword query = {(paper(1.2);title:XML(1.2.0);author:John(1.2.1))}
Content • Introduction • Backgroundand Motivation • Valuable LCA • Meaningful Dewey Code (MDC) • The Stack-Based Algorithm • Experimental Study • Conclusion
Experimental Study • Efficiency and Effectiveness Test • Datasets • Real Dataset: DBLP, SIGMOD Record, TreeBank • Synthetic Dataset: XMark • Tested Methods • Brute-Force • XSEarch • SLCA • GDMCT
Experimental Study • Efficiency
Experimental Study • Effectiveness • Precision • Recall • F-measure
Conclusion • Demonstration of the problems of keyword search over XML documents • Proposed VLCA and CVLCA to obtain meaningful results of keyword queries • Present an optimization technique to compute CVLCAs and devise an efficient stack-based algorithm to identify meaningful compact connected trees