Keyword Proximity Search on XML Graphs

Keyword Proximity Search on XML Graphs • Vagelis Hristidis • Yannis Papakonstantinou • Andrey Balmin University of California, San Diego

Motivation • Keyword Search is the dominant information discovery method in plain text documents • Increasing amount of data stored in XML databases

Motivation • Currently, information discovery in XML databases requires: • Knowledge of schema • Knowledge of a query language (eg: XQuery) • Knowledge of the role of the keywords • XKeyword eliminates these requirements

Keyword Query - Semantics • Keywords are: • in same XML text node • in same element • connected through edges <title> Storage of XML databases<\title> <topics> <topic> XML survey <\topic> <topic> Storage of database <\topic> <\topics> <topic> XML survey <\topic> idref <topic> Storage of database <\topic>

Result of Keyword Query • Result is tree T of XML nodes where: • every keyword contained in a node of T (total) • no node of T is redundant (minimal) • Score of result: • distance of keywords within a text node • distance between keywords in number of edges • weighted distance • PageRank-like methods

Example - Schema TPCH-like schema

Example - Data

Example – Keyword Query Query: “John, VCR”

Example – Keyword Query Query: “John, VCR” Result trees: T1 size = 6 T2size = 8

Example – Keyword Query Target Objects

Presentation • Number of results explodes due to MVDs Example Results: R1. p1-l1-pa3-pa1 R2. p1-l2-pa3-pa2 R3. p1-l2-pa3-pa1 R4. p1-l1-pa3-pa2 R3, R4 are implied by R1, R2!

Presentation • Create a Presentation Graph for each CN

Demo • Demo on DBLP dataset available at www.db.ucsd.edu/XKeyword

Demo

Architecture “John, VCR” John: person.name VCR: part.name, product.descr PersonJohn-Lineitem-ProductVCR, PersonJohn-Lineitem-PartVCR Person1-Lineitem1-Product1, Person1-Lineitem2-Part1 Person[“John”,“US”]-Lineitem[quant=6, Oct 14 2001] –Product[id=2005,descr=“Set of VCR and DVD”], …

Architecture

Candidate Networks Generator • Adaptation of CN generator of DISCOVER (Hristidis et al. VLDB 2002) to XML databases • Example • CNs of size≤3 for query “John,VCR” supplier ProductVCR PersonJohn Lineitem supplier PersonJohn PartVCR Lineitem supplier subpart Part PersonJohn Lineitem PartVCR PartVCR PersonJohn Order Lineitem ProductVCR PersonJohn Order Lineitem

Architecture

Decomposer Storing Data in XKeyword • Each target object is stored in a CLOB • Connections between target objects in ID Relations Minimal ID Relations Lineitem_Part: LPa (L_id, Pa_id) Lineitem_Person_ref: LPref (L_id, P_id) Part_Part: PaPa (Pa_id1,Pa_id2) LPa L_id Pa_id 100 123 101 123

Decomposer PLPa P_id L_id Pa_id • Create redundant ID Relations to improve performance • Examples: PLPa = LPref LPa PaPaPa = PaPa PaPa supplier subpart Part PartVCR PersonJohn Lineitem Then CN is evaluated as PJohn PLPa PaPaPa PaVCR instead of PJohn LPref LPa PaPa PaPa PaVCR Spare 2 joins!

Decomposer - Rules • Create redundant ID Relations when not MVDs. Eg: Person Order Lineitem (POL) • Avoid MVD ID Relations Eg: Order Person Lineitem (OPLref) * * ref supplier * OPLref O_id P_id L_id 127 105 100 127 105 101 129 105 100 129 105 101 PO P_id O_id 105 127 105 129 LPref L_id P_id 100 105 101 105 • There is always decomposition with fragments’ maximum size L = M/(J+1), s.t. any CN of size up to M is evaluated with at most J joins

Clustering • ID Relations are stored clustered. • Eg: POL is clustered on P,O,L LOP is clustered on L,O,P • Use ID Relations clustered as join direction • Eg: use POL and not LOP when evaluating CN Person-Order-Lineitem-Product from left to right

Architecture

Execution Module • Get top-K results using nested loops join on each CN. • Caching intermediate results. Eg: PartVCR  Part  PartTV Assume we have evaluated p1-p2-x if we reach partial result p3-p2-x, no need to join with PartTV • Multithreaded execution: One thread for each CN subpart subpart

Execution Module • Execution is guided by navigation in Presentation Graph “XML storage”

Previous Work • DBXplorer (Agrawal et al. ICDE 2002), DISCOVER (Hristidis et al. VLDB 2002) • Work on Relational Databases • Execute an SQL statement for each CN • Drawbacks • Redundancy in Presentation • No control on Storage of data • Keyword Search in Graph Databases (Goldman et. al. VLDB 98) • look for hub nodes • No schema

Experimentation: Evaluate decompositions • MinClust: Minimal, both directions of clustering • MinNClustIndx: Minimal, no clustering, indexed • Complete: All ID relations- MVD & non-MVD • XKeyword: all ID relations of size up to 2 (2 edges)

Experimentation: Evaluate decompositions • Maximum CN size = 6, top-K • 2 keywords • DBLP dataset

Evaluation of Optimized Execution Algorithm • 2 keywords • DBLP dataset

Conclusions • XKeyword is system for plain keyword search in XML databases • Focus on: • Storage of XML in relations • Presentation • Future work • Investigate other relevance semantics. Eg: ranking based on link-structure.

Questions?

Candidate Networks Generator • A keyword may appear in multiple nodes • # candidate networks can be too big (sometimes unbounded) • Adaptation of CN generator of DISCOVER (Hristidis et al. VLDB 2002) to XML databases

Candidate Network - Examples CNs of size≤3 for query “John,VCR” supplier ProductVCR PersonJohn Lineitem supplier PersonJohn PartVCR Lineitem supplier subpart Part PersonJohn Lineitem PartVCR PartVCR PersonJohn Order Lineitem ProductVCR PersonJohn Order Lineitem

Candidate Networks Generator is Complete and Non-Redundant • Prove that the set of Candidate Networks generated is • Complete: All solutions generated by a CN • Non-redundant: There is database instance, where by removing a CN a solution is lost

Experimentation: Evaluate decompositions • Maximum CN size = 6, all results • 2 keywords • DBLP dataset

Keyword Proximity Search on XML Graphs

Keyword Proximity Search on XML Graphs

Presentation Transcript

Keyword Proximity Search on XML Graphs

XRANK: Ranked Keyword Search Over XML Documents

Fast Incremental Proximity Search in Large Graphs

Measure Proximity on Graphs with Side Information

Fast Proximity Search on Large Graphs

Graphinder Semantic Search Relational Keyword Search over Data Graphs

Effective XML Keyword Search with Relevance Oriented Ranking

Keyword Search on External Memory Data Graphs

Keyword Search On Structured Database

Efficient Keyword Search over Virtual XML Views

Efficient Keyword Search Over Virtual XML Views

Integrating Keyword Search into XML Query Processing

Finding and Approximating Top- k Answers in Keyword Proximity Search

Proximity search

Keyword search on encrypted data

Efficient Keyword Search over Virtual XML Views

Proximity Tracking on Time-Evolving Bipartite Graphs

XRANK: Ranked Keyword Search over XML Documents

Keyword Search on Form Results

XML Keyword Search Refinement

Supporting Top-K Keyword Search in XML Databases

Keyword Search and Keyword Selection