1 / 37

Keyword Proximity Search on XML Graphs

Keyword Proximity Search on XML Graphs. Vagelis Hristidis Yannis Papakonstantinou Andrey Balmin University of California, San Diego. Motivation. Keyword Search is the dominant information discovery method in plain text documents Increasing amount of data stored in XML databases. Motivation.

sadie
Download Presentation

Keyword Proximity Search on XML Graphs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Keyword Proximity Search on XML Graphs • Vagelis Hristidis • Yannis Papakonstantinou • Andrey Balmin University of California, San Diego

  2. Motivation • Keyword Search is the dominant information discovery method in plain text documents • Increasing amount of data stored in XML databases

  3. Motivation • Currently, information discovery in XML databases requires: • Knowledge of schema • Knowledge of a query language (eg: XQuery) • Knowledge of the role of the keywords • XKeyword eliminates these requirements

  4. Keyword Query - Semantics • Keywords are: • in same XML text node • in same element • connected through edges <title> Storage of XML databases<\title> <topics> <topic> XML survey <\topic> <topic> Storage of database <\topic> <\topics> <topic> XML survey <\topic> idref <topic> Storage of database <\topic>

  5. Result of Keyword Query • Result is tree T of XML nodes where: • every keyword contained in a node of T (total) • no node of T is redundant (minimal) • Score of result: • distance of keywords within a text node • distance between keywords in number of edges • weighted distance • PageRank-like methods

  6. Example - Schema TPCH-like schema

  7. Example - Data

  8. Example – Keyword Query Query: “John, VCR”

  9. Example – Keyword Query Query: “John, VCR” Result trees: T1 size = 6 T2size = 8

  10. Example – Keyword Query Target Objects

  11. Presentation • Number of results explodes due to MVDs Example Results: R1. p1-l1-pa3-pa1 R2. p1-l2-pa3-pa2 R3. p1-l2-pa3-pa1 R4. p1-l1-pa3-pa2 R3, R4 are implied by R1, R2!

  12. Presentation • Create a Presentation Graph for each CN

  13. Demo • Demo on DBLP dataset available at www.db.ucsd.edu/XKeyword

  14. Demo

  15. Demo

  16. Demo

  17. Architecture “John, VCR” John: person.name VCR: part.name, product.descr PersonJohn-Lineitem-ProductVCR, PersonJohn-Lineitem-PartVCR Person1-Lineitem1-Product1, Person1-Lineitem2-Part1 Person[“John”,“US”]-Lineitem[quant=6, Oct 14 2001] –Product[id=2005,descr=“Set of VCR and DVD”], …

  18. Architecture

  19. Candidate Networks Generator • Adaptation of CN generator of DISCOVER (Hristidis et al. VLDB 2002) to XML databases • Example • CNs of size≤3 for query “John,VCR” supplier ProductVCR PersonJohn Lineitem supplier PersonJohn PartVCR Lineitem supplier subpart Part PersonJohn Lineitem PartVCR PartVCR PersonJohn Order Lineitem ProductVCR PersonJohn Order Lineitem

  20. Architecture

  21. Decomposer Storing Data in XKeyword • Each target object is stored in a CLOB • Connections between target objects in ID Relations Minimal ID Relations Lineitem_Part: LPa (L_id, Pa_id) Lineitem_Person_ref: LPref (L_id, P_id) Part_Part: PaPa (Pa_id1,Pa_id2) LPa L_id Pa_id 100 123 101 123

  22. Decomposer PLPa P_id L_id Pa_id • Create redundant ID Relations to improve performance • Examples: PLPa = LPref LPa PaPaPa = PaPa PaPa supplier subpart Part PartVCR PersonJohn Lineitem Then CN is evaluated as PJohn PLPa PaPaPa PaVCR instead of PJohn LPref LPa PaPa PaPa PaVCR Spare 2 joins!

  23. Decomposer - Rules • Create redundant ID Relations when not MVDs. Eg: Person Order Lineitem (POL) • Avoid MVD ID Relations Eg: Order Person Lineitem (OPLref) * * ref supplier * OPLref O_id P_id L_id 127 105 100 127 105 101 129 105 100 129 105 101 PO P_id O_id 105 127 105 129 LPref L_id P_id 100 105 101 105 • There is always decomposition with fragments’ maximum size L = M/(J+1), s.t. any CN of size up to M is evaluated with at most J joins

  24. Clustering • ID Relations are stored clustered. • Eg: POL is clustered on P,O,L LOP is clustered on L,O,P • Use ID Relations clustered as join direction • Eg: use POL and not LOP when evaluating CN Person-Order-Lineitem-Product from left to right

  25. Architecture

  26. Execution Module • Get top-K results using nested loops join on each CN. • Caching intermediate results. Eg: PartVCR  Part  PartTV Assume we have evaluated p1-p2-x if we reach partial result p3-p2-x, no need to join with PartTV • Multithreaded execution: One thread for each CN subpart subpart

  27. Execution Module • Execution is guided by navigation in Presentation Graph “XML storage”

  28. Previous Work • DBXplorer (Agrawal et al. ICDE 2002), DISCOVER (Hristidis et al. VLDB 2002) • Work on Relational Databases • Execute an SQL statement for each CN • Drawbacks • Redundancy in Presentation • No control on Storage of data • Keyword Search in Graph Databases (Goldman et. al. VLDB 98) • look for hub nodes • No schema

  29. Experimentation: Evaluate decompositions • MinClust: Minimal, both directions of clustering • MinNClustIndx: Minimal, no clustering, indexed • Complete: All ID relations- MVD & non-MVD • XKeyword: all ID relations of size up to 2 (2 edges)

  30. Experimentation: Evaluate decompositions • Maximum CN size = 6, top-K • 2 keywords • DBLP dataset

  31. Evaluation of Optimized Execution Algorithm • 2 keywords • DBLP dataset

  32. Conclusions • XKeyword is system for plain keyword search in XML databases • Focus on: • Storage of XML in relations • Presentation • Future work • Investigate other relevance semantics. Eg: ranking based on link-structure.

  33. Questions?

  34. Candidate Networks Generator • A keyword may appear in multiple nodes • # candidate networks can be too big (sometimes unbounded) • Adaptation of CN generator of DISCOVER (Hristidis et al. VLDB 2002) to XML databases

  35. Candidate Network - Examples CNs of size≤3 for query “John,VCR” supplier ProductVCR PersonJohn Lineitem supplier PersonJohn PartVCR Lineitem supplier subpart Part PersonJohn Lineitem PartVCR PartVCR PersonJohn Order Lineitem ProductVCR PersonJohn Order Lineitem

  36. Candidate Networks Generator is Complete and Non-Redundant • Prove that the set of Candidate Networks generated is • Complete: All solutions generated by a CN • Non-redundant: There is database instance, where by removing a CN a solution is lost

  37. Experimentation: Evaluate decompositions • Maximum CN size = 6, all results • 2 keywords • DBLP dataset

More Related