Keyword Proximity Search on XML Graphs

Keyword Proximity Search on XML Graphs Vagelis Hristidis Yannis Papakonstatinou Andrey Balmin @UCSD Presenter: Feng Shao

Outline • Introduction • Proximity Keyword Query Semantics • Architecture • XML Decompositions • Execution • Experiment • Conclusion

Introduction • Keyword search is easy-to-use • No need to know the structure and query language • XML: labeled graph, representing semistructured self-describing data. • Feb.10, 5th birthday of XML From www.w3c.org

Problem--Keyword proximity query • Input: a set of keywords • Results: trees of XML fragments(called target objects) that contains all the keywords, ranked according to their size • Assume the existence of schema, facilitates the presentation of the results and used in optimizing the performance of the system.

Name[John]personsupplierlineitemlinepartproductdescr[set of VCR and DVD] , size 6 Name[John]personsupplierlineitemlinepartpartsubpartpartname[VCR], size 8

Challenges • Presentation of result graphs: • Semantically meaningful • Avoid a huge number of trivial results

Challenges • Presentation of result graphs: • Semantically meaningful • Avoid a huge number of trivial results • Providing fast response time • Efficient storage of data • On-demand execution, guided according to user’s navigation

Semantics • XML Graph: a labeled graph • Node v: id(v), label λ(v),value val(v) • Edge: containment and reference edges • Schema graph: a directed graph • Node vs: labelλ(vs), content type type(vs)(all orchoice) • Edge es: containment or refrence, annotated with a maximum occurrence occ(es) • A XML graph conforms to a schema graph

schema graph XML Graph

Query semantics • Result: the set of all possible Minimal Total Target Object Networks(MTTON’s) • What’s MTTON? • Node network j: an uncycled subgraph of G, such that each edge in j is an edge in G • Total node network j of keyword {k1,…,km}: a node network where every keyword is contained at least one node n of j • Minimal Total Node Network(MTTN):a total node network j where no node can be removed and j still be a total node network. Score : number of edges • Target object of node n: a segment of XML graph, large enough to be meaningful and semantically identify the node n, and as small as possible.

MTTON(cont.) • Given a MTNN j with nodes v1, . . . , vn there is a corresponding MTTON t, which is a tree whose • nodes is a minimal set of target objects {t1, . . . , tm} such that for every node nk ∈ j there is a tl ∈ t such that target(nk) = tl. • There is an edge from a target object ti to a target object tj if there is an edge ( or a path) from a node that belongs to ti to a node that belongs to tj . • The score of a MTTON j is the score of its corresponding MTNN. MTNN:namepersonnation MTNN: name

MTTN & MTTON Name[John]personsupplierlineitemlinepartpartsubpartpartname[VCR]

Target object • Defined from an administrator using the Target Schema Segment (TSS) graph • TSS graph: a partial mapping of nodes in G • A node tSis created in GTSSfor each set S = {s1, . . . , sw} of nodes of G that are mapped to tS. • An edge (tS, tS’) is created in GTSSif the schema graph has nodes s ∈ S and s ‘∈ S’, that are connected directly through an edge (s,s’) or indirectly through a path of dummy schema nodes. • Target decomposition: given the TSS graph, decompose XML graph into target objects, connected to each other

Example

MTTN & MTTON Name[John]personsupplierlineitemlinepartpartsubpartpartname[VCR]

Presentation Graph • Naïve method: multiple threads, evaluating various plans for producing MTTON’s, and outputs as they come. • Pro: fast response time • Con: many trivial results • Interactive interface: allows navigation and hides the trivial results

Presentation Graph

Architecture

Load Stage Keyword: <TO_id,node_id, schema_node> The number of nodes of each type and etc. A decomposition of the TSS graph into fragments, which correspond to connection relations that allow efficient retrieval of MTTON’s. Given an object id instantly return the whole target object

Example of decomposition

Query processing Keyword: TV, VCR Keyword: <TO_id,node_id, schema_node>

Execution Plan Candidate Network Schema graph and TSS graph Candidate TSS Network Connection relations schema Execution Plan TSS graph Connection relations Schema graph

XML Decomposition • Decompose TSS graph into fragments • Determines how the connections are stored in the database • Dramatically change the performance • Example: a a

Decomposition Tradeoff • # fragments v.s. performance • Minimal decomposition • A fragment is built for each edge of TSS graph • Candidate TSS network C of size S, requires S-1 joins • Maximal decomposition • A fragment F is built for every possible candidate TSS network C • C requires zero joins. • Not feasible in practice

Tradeoff (cont.) • Clustering and indexing are critical • Maximal decomp.: multi-attribute indices • Non-maximal decomp.: a connection relation R is clustered on the direction that R is used • Example • Classify TSS graph, based on the storage redundancy in the corresponding connection relations. • 4NF, inlined( non-MVD,no-4NF) • Decomposition Algorithm • See paper

Execution • Goal: fast response time • Web search engine-like presentation • Use inlined decomposition • Use thread pool • Use nest-loop joins • Example: Outmost loop: over TSS partVCR,name • Optimization: store partial results

Execution • Presentation graphs(on-demand) • Initially, Xkeyword decomposition is used to retrieve the top result of each CN. • Then use a combination of decompositions to find the minimal connection of the expanded nodes.

Outline • Introduction • Architecture • Proximity Keyword Query Semantics • XML Decompositions • Execution • Experiment • Conclusion

Experiments • Measure various decompositions , for top-K and full results • Evaluate the performance of algorithm for search engine-like presentation method and on-demand expansion method • Data: DBLP XML database, 2 keywords Maximum size of CTSSN: M = 6 Max size of fragments: L = 2

Decompositions

Execution algorithm Speedup = optimized algorithm / naïve, non-caching algorithm

Execution algorithm Keyword queries: the names of two authors, k1 and k2 Candidate Network: Authork1 Paper  Authork2 Time measured: average time to expand a Paper node

Outline • Introduction • Architecture • Proximity Keyword Query Semantics • XML Decompositions • Execution • Experiment • Conclusion

Conclusion • Xkeyword is built on a relational database and, hence, can accommodate very large graphs. • Present keyword proximity search semantics, extended to capture the novel result presentation method. • Present an architecture allowing for choosing which connections will be precomputed • Address on-demand performance requirement • Demo: http://www.db.ucsd.edu/Xkeyword

Keyword Proximity Search on XML Graphs

Keyword Proximity Search on XML Graphs

Presentation Transcript

XRANK: Ranked Keyword Search Over XML Documents

Fast Incremental Proximity Search in Large Graphs

Measure Proximity on Graphs with Side Information

Fast Proximity Search on Large Graphs

Graphinder Semantic Search Relational Keyword Search over Data Graphs

Effective XML Keyword Search with Relevance Oriented Ranking

Keyword Search on External Memory Data Graphs

Keyword Search On Structured Database

Efficient Keyword Search over Virtual XML Views

Efficient Keyword Search Over Virtual XML Views

Integrating Keyword Search into XML Query Processing

Finding and Approximating Top- k Answers in Keyword Proximity Search

Proximity search

Keyword search on encrypted data

Efficient Keyword Search over Virtual XML Views

Keyword Proximity Search on XML Graphs

Proximity Tracking on Time-Evolving Bipartite Graphs

XRANK: Ranked Keyword Search over XML Documents

Keyword Search on Form Results

XML Keyword Search Refinement

Supporting Top-K Keyword Search in XML Databases

Keyword Search and Keyword Selection