370 likes | 497 Views
Keyword Proximity Search on XML Graphs. Vagelis Hristidis Yannis Papakonstantinou Andrey Balmin University of California, San Diego. Motivation. Keyword Search is the dominant information discovery method in plain text documents Increasing amount of data stored in XML databases. Motivation.
E N D
Keyword Proximity Search on XML Graphs • Vagelis Hristidis • Yannis Papakonstantinou • Andrey Balmin University of California, San Diego
Motivation • Keyword Search is the dominant information discovery method in plain text documents • Increasing amount of data stored in XML databases
Motivation • Currently, information discovery in XML databases requires: • Knowledge of schema • Knowledge of a query language (eg: XQuery) • Knowledge of the role of the keywords • XKeyword eliminates these requirements
Keyword Query - Semantics • Keywords are: • in same XML text node • in same element • connected through edges <title> Storage of XML databases<\title> <topics> <topic> XML survey <\topic> <topic> Storage of database <\topic> <\topics> <topic> XML survey <\topic> idref <topic> Storage of database <\topic>
Result of Keyword Query • Result is tree T of XML nodes where: • every keyword contained in a node of T (total) • no node of T is redundant (minimal) • Score of result: • distance of keywords within a text node • distance between keywords in number of edges • weighted distance • PageRank-like methods
Example - Schema TPCH-like schema
Example – Keyword Query Query: “John, VCR”
Example – Keyword Query Query: “John, VCR” Result trees: T1 size = 6 T2size = 8
Example – Keyword Query Target Objects
Presentation • Number of results explodes due to MVDs Example Results: R1. p1-l1-pa3-pa1 R2. p1-l2-pa3-pa2 R3. p1-l2-pa3-pa1 R4. p1-l1-pa3-pa2 R3, R4 are implied by R1, R2!
Presentation • Create a Presentation Graph for each CN
Demo • Demo on DBLP dataset available at www.db.ucsd.edu/XKeyword
Architecture “John, VCR” John: person.name VCR: part.name, product.descr PersonJohn-Lineitem-ProductVCR, PersonJohn-Lineitem-PartVCR Person1-Lineitem1-Product1, Person1-Lineitem2-Part1 Person[“John”,“US”]-Lineitem[quant=6, Oct 14 2001] –Product[id=2005,descr=“Set of VCR and DVD”], …
Candidate Networks Generator • Adaptation of CN generator of DISCOVER (Hristidis et al. VLDB 2002) to XML databases • Example • CNs of size≤3 for query “John,VCR” supplier ProductVCR PersonJohn Lineitem supplier PersonJohn PartVCR Lineitem supplier subpart Part PersonJohn Lineitem PartVCR PartVCR PersonJohn Order Lineitem ProductVCR PersonJohn Order Lineitem
Decomposer Storing Data in XKeyword • Each target object is stored in a CLOB • Connections between target objects in ID Relations Minimal ID Relations Lineitem_Part: LPa (L_id, Pa_id) Lineitem_Person_ref: LPref (L_id, P_id) Part_Part: PaPa (Pa_id1,Pa_id2) LPa L_id Pa_id 100 123 101 123
Decomposer PLPa P_id L_id Pa_id • Create redundant ID Relations to improve performance • Examples: PLPa = LPref LPa PaPaPa = PaPa PaPa supplier subpart Part PartVCR PersonJohn Lineitem Then CN is evaluated as PJohn PLPa PaPaPa PaVCR instead of PJohn LPref LPa PaPa PaPa PaVCR Spare 2 joins!
Decomposer - Rules • Create redundant ID Relations when not MVDs. Eg: Person Order Lineitem (POL) • Avoid MVD ID Relations Eg: Order Person Lineitem (OPLref) * * ref supplier * OPLref O_id P_id L_id 127 105 100 127 105 101 129 105 100 129 105 101 PO P_id O_id 105 127 105 129 LPref L_id P_id 100 105 101 105 • There is always decomposition with fragments’ maximum size L = M/(J+1), s.t. any CN of size up to M is evaluated with at most J joins
Clustering • ID Relations are stored clustered. • Eg: POL is clustered on P,O,L LOP is clustered on L,O,P • Use ID Relations clustered as join direction • Eg: use POL and not LOP when evaluating CN Person-Order-Lineitem-Product from left to right
Execution Module • Get top-K results using nested loops join on each CN. • Caching intermediate results. Eg: PartVCR Part PartTV Assume we have evaluated p1-p2-x if we reach partial result p3-p2-x, no need to join with PartTV • Multithreaded execution: One thread for each CN subpart subpart
Execution Module • Execution is guided by navigation in Presentation Graph “XML storage”
Previous Work • DBXplorer (Agrawal et al. ICDE 2002), DISCOVER (Hristidis et al. VLDB 2002) • Work on Relational Databases • Execute an SQL statement for each CN • Drawbacks • Redundancy in Presentation • No control on Storage of data • Keyword Search in Graph Databases (Goldman et. al. VLDB 98) • look for hub nodes • No schema
Experimentation: Evaluate decompositions • MinClust: Minimal, both directions of clustering • MinNClustIndx: Minimal, no clustering, indexed • Complete: All ID relations- MVD & non-MVD • XKeyword: all ID relations of size up to 2 (2 edges)
Experimentation: Evaluate decompositions • Maximum CN size = 6, top-K • 2 keywords • DBLP dataset
Evaluation of Optimized Execution Algorithm • 2 keywords • DBLP dataset
Conclusions • XKeyword is system for plain keyword search in XML databases • Focus on: • Storage of XML in relations • Presentation • Future work • Investigate other relevance semantics. Eg: ranking based on link-structure.
Candidate Networks Generator • A keyword may appear in multiple nodes • # candidate networks can be too big (sometimes unbounded) • Adaptation of CN generator of DISCOVER (Hristidis et al. VLDB 2002) to XML databases
Candidate Network - Examples CNs of size≤3 for query “John,VCR” supplier ProductVCR PersonJohn Lineitem supplier PersonJohn PartVCR Lineitem supplier subpart Part PersonJohn Lineitem PartVCR PartVCR PersonJohn Order Lineitem ProductVCR PersonJohn Order Lineitem
Candidate Networks Generator is Complete and Non-Redundant • Prove that the set of Candidate Networks generated is • Complete: All solutions generated by a CN • Non-redundant: There is database instance, where by removing a CN a solution is lost
Experimentation: Evaluate decompositions • Maximum CN size = 6, all results • 2 keywords • DBLP dataset