LCA -Based Selection for XML Document Collections

LCA -Based Selection for XML Document Collections Georgia Koloniari joint work withEvaggelia Pitoura Department of Computer Science University of Ioannina, Greece http://dmod.cs.uoi.gr

What is the topic of this talk? Fundamental question: Given a query and many available data sources with large volumes of data, select the most relevant sources for the query/filter out the irrelevant ones DMOD Laboratory, University of Ioannina HDMS 2010

What is the topic of this talk? More formally: • Source/Database Selection: Problem Definition • Given a query q and a set of data sources, rank the data sources according to the relevance (called goodness) of their data to q • Evaluate qagainst the most relevant (best) data sources Source Selection Problem: Previous research Database selection for: • relational databases Sayyadian et al [ICDE ‘07], Yu et al [SIGMOD ‘07], Vu at el [SIGMOD ‘08] • textual document collections Callan et al [SIGIR ‘95], Gravano et al [ACM Trans. Database Syst. ‘99] However, many data sources with XML documents DMOD Laboratory, University of Ioannina HDMS 2010

In this paper, The source selection problem for XML Document Collections XML Selection Problem: Definition Given a set of N distributed collections of XML documents and a query q rank the collections based on their goodness (i.e., relevance) to q Keyword queries, q = (w1, w2, …, wk) DMOD Laboratory, University of Ioannina HDMS 2010

OUTLINE • In the rest of this talk, • What is different with XML? • our LCA-based approach • Define goodness for a database of XML documents • How to compute goodness for a given query • using pre-computed summaries • Experimental evaluation DMOD Laboratory, University of Ioannina HDMS 2010

Keyword search for XML Documents: an example Query: Atre RDF conf cname paper year paper demo author title title author title author author WWW author 2010 … Top-k name name RDF name name name facet Chaoij Atre Soliman van Zwol Sigurbjörnsson • Search for nodes that contain the keywords (as their label, content, label or value of their attributes) • Result: the subtrees whose nodes contain all the keywords DMOD Laboratory, University of Ioannina HDMS 2010

Keyword search for XML Documents: an example Query: Atre RDF conf cname paper year paper demo author title title author title author author HDMS author 2010 … Top-k name name RDF name name name XPath Chaoij Atre Soliman Georgiadis Vassalos The Lowest Common Ancestor (LCA) of a set of nodes V ‘ = {v1, . . . , vk} (V’ V ) is the deepest nodev in a tree T which is an ancestor of all nodes in V’ DMOD Laboratory, University of Ioannina HDMS 2010

Keyword search for XML Documents: LCA semantics Result(q) subset of (basic LCA-approach) lca(S1, . . . , Sk) that evaluates the set of LCA nodes V, such that v ∈ V if v = lca(v1, . . . , vk) and v1 ∈ S1, . . . , vk∈ Sk (at least one occurence of each keyword) DMOD Laboratory, University of Ioannina HDMS 2010 Keyword query q = (w1, w2, …, wk) An unordered labeled XML tree T = (V, E) of an XML document d An element (node) v ∈ V contains a keyword wi - contains(v, wi) Si = {v|v ∈ V and contains(v, wi)}, 1 ≤ i ≤ k (set of nodes that contain keyword wi)

Keyword search for XML Documents: LCA semantics Query: papervan Zwol conf content SLCA cname year WWW 2010 paper paper demo author title title author title author author author … facet RDF name name name name facet name van Zwol Lin Yan van Zwol Sigurbjörnsson DMOD Laboratory, University of Ioannina HDMS 2010

Keyword search for XML Documents: LCA semantics Query:paper van Zwol conf ELCA ELCA content cname year WWW 2010 paper paper demo author title title author title author author author … facet RDF name name name name facet name van Zwol Lin Yan van Zwol Sigurbjörnsson DMOD Laboratory, University of Ioannina HDMS 2010

Lowest Common Ancestor • Many variations • Only structural (Smallest LCA, Exclusive LCA, etc) • Schema of the documents (Meaningful LCA, Valuable LCA, based also on node/element types) • in addition IR-based statistics • We do not propose yet another one, instead we use the basic LCA (the Result(q) set) • Most others can be implemented on filtering our results (details in the paper) • Experimental evaluation on ELCA DMOD Laboratory, University of Ioannina HDMS 2010

Keyword search for XML Documents: Ranking Query: paper van Zwol conf ELCA ELCA content cname year WWW 2010 paper paper demo author title title author title author author author … facet RDF name name name name facet name van Zwol Lin Yan van Zwol Sigurbjörnsson Structure is used to improve the quality of the result -> rank results based on the distance of the keywords from their LCA DMOD Laboratory, University of Ioannina HDMS 2010

Query: o, b a a a f f f d d m o c h Height: 1 v b e b o x Keyword search for XML Documents: Ranking the maximumdistance of any of the keywords of q in the XML tree to their LCA node the height of the LCA node v∈ Result(q) root e Height: 2 v DMOD Laboratory, University of Ioannina HDMS 2010

Keyword search for XML Documents: Ranking Query: paper van Zwol conf ELCA ELCA, content cname year WWW 2010 paper paper demo Height: 4 author title title author title author author author … Height: 3 facet RDF name name name name facet name van Zwol Lin Yan van Zwol Sigurbjörnsson DMOD Laboratory, University of Ioannina HDMS 2010

Keyword search for XML Documents: Relevance Not all trees that contain the keywords are relevant Query: demoRDF Pound conf name paper year paper demo author title title author title author author WWW author 2010 … Top-k name name RDF name name name object Chaoij Atre Soliman Zaragoza Pound Exclude some of the results as not relevant based on height DMOD Laboratory, University of Ioannina HDMS 2010

Database Selection: Document relevance A user is interested in d as a result for q iff the distance (height) of a result in d is lower or equal to  Boolean Problem: Weighted Problem: F(h(v)): a function F of the height h of a result node v such that the similarity of d to q is greater when h(v) is small DMOD Laboratory, University of Ioannina HDMS 2010

Database Selection A database D is ranked based on its goodness to q by aggregating the relevance of their documents • The goodness measure ranks highly collections that: • have a large number of documents with a relatively small similarity score • have less documents but with higher similarity scores • The threshold limits the tendency to favor large collections in contrast to more relevant ones 17 DMOD Laboratory, University of Ioannina HDMS 2010

OUTLINE • In the rest of this talk, • What is different with XML? • our LCA-based approach • Define goodness for a database of XML documents • How to compute goodness for a given query • using pre-computed summaries • Experimental evaluation DMOD Laboratory, University of Ioannina HDMS 2010

Goodness Estimation Computing LCA online for each query is expensive DMOD Laboratory, University of Ioannina HDMS 2010 To estimate the goodness of a collection D forakeyword query, the straightforward approach is: • For each document d∈ D Evaluate q against d Find all the LCA nodes in d of the k keywords that appear in q (Result(q)) Select v ∈ Result(q) with the minimum height if h(v) ≤ l the boolean model returns a match the weighted model computes the similarity based on function F • Aggregate over all d ∈ D

Goodness Estimation DMOD Laboratory, University of Ioannina HDMS 2010 To avoid at execution time: • Pre-compute the LCA nodes of for all possible combinations of keywords that appear in each d and maintain their heights Number of computed LCA nodes for an XML document with n keywords:

Pair-Wise Goodness Estimation OUR APPROACH We maintain information for the height only for pairs of keywords and use this to estimate the height of the LCA for more than 2 keywords • For each distinct pair of keywords (wi, wj) in a document d, we maintain • the height hmin(i, j)of the LCA node v ∈ lca(Si, Sj) with h(v) ≤ h(u), ∀ u ∈ lca(Si, Sj) (the lowest LCA)and • the height hmax(i, j) of the LCA node v ∈ lca(Si, Sj)with h(v) ≥ h(u), ∀u ∈ lca(Si, Sj) (the highest LCA) DMOD Laboratory, University of Ioannina HDMS 2010

Pairwise-based Height Estimation If the keywords are distinct (just a single LCA), then it is easy to see that the height is equal to the maximum Else, we get estimations DMOD Laboratory, University of Ioannina HDMS 2010 Proposition. Let G(V,E) be an acyclic directed graph, and V ‘ = {v1, . . . , vM} any subset of M nodes in G, V ‘ V . Then, h(lca(v1, . . . , vM)) = maxvi,vj∈V h(lca(vi, vj)).

Pair-based Height Estimation f c h Query: o, b, a (o, b) → 1-3 (o, a) → 2-3 (b, a) → 1-3 Hmin(d, q): 2 Hmax(d, q): 3 a a a f e f d m d d e x b v v b o Hmin(d, q):the maximum value of the minimum LCA height values for any pair of keywords in q Hmax(d, q):the maximum value of the maximum LCA height values for any pair of keywords in q Theorem. Given a keyword query q and a document d, the height of any v ∈ Result(q) is such that: Hmin(d, q) ≤ h(v) ≤ Hmax(d, q) DMOD Laboratory, University of Ioannina HDMS 2010

Boolean Goodness Estimation • If Hmin(d, q) >  -> not relevant (no false negatives) • If Hmin(d, q)  and Hmax (d, q)   then relevant • If Hmin(d, q)   and Hmax (d, q) > , relevant but false positives are possible For the weigthed and the goodness estimation bounds, details in the paper DMOD Laboratory, University of Ioannina HDMS 2010

Summarizing the matrices Even with the optimizations, the information to be maintained may remain large => summaries to reduce its size Our summaries are based on Bloom filters DMOD Laboratory, University of Ioannina HDMS 2010

Bloom-based Summaries Bloom Filters Compact data structures for a probabilistic representation of a set Used to answer membership queries Bit vector of m bits, initially set to 0 - l hash function: 0 -> m - 1 Insert x in the Bloom - Apply the l hash function, set to 1 the corresponding bits h1(x) = 4 h2(x) = 2 h3(x) = 5 h4(x) = 8 Bit vector v m = 10 bits Test whether y in the set (look up y), again apply the same function Tunable probability of False Positive: probability of incorrectly identifying an element as a match DMOD Laboratory, University of Ioannina HDMS 2010

Bloom-based Summaries for the Boolean Problem • For each d in D maintain two Bloom filters: • BFmin(d) for the hmin(i, j) and • BFmax(d) for the hmax(i, j) values • of each distinct keyword pair (wi,wj) in d • Given a similarity threshold , for all (wi, wj) in d • if hmin (i, j) ≤ , then (wi, wj) is hashed as one key and inserted into BFmin(d) • if hmax(i, j) ≤ , then (wi, wj) is also inserted into BFmax(d) DMOD Laboratory, University of Ioannina HDMS 2010

Bloom-based Summaries for the Boolean Problem • Similarity Evaluation of d to q: • every pair of keywords of q is looked up in BFmin(d) and if one is not found, d is not relevant • else, we also look them in BFmax(d), if found, definite relevant else relevant but with a false positive probability DMOD Laboratory, University of Ioannina HDMS 2010

Bloom-based Summaries for the Weighted Problem a a a m d f e f f d f b o o h h e b o x v v • Group the keyword pairs according to their hmin(i, j) (hmax(i, j) ) value anduse a separate Bloom filter for each such group - distance • Compute the similarity by applying F on the number of the highest level for which there was a hit for any of the keyword pairs of the query DMOD Laboratory, University of Ioannina HDMS 2010

OUTLINE • In the rest of this talk, • What is different with XML? • our LCA-based approach • Define goodness for a database of XML documents • How to do compute goodness for a given query • using pre-computed summaries • Experimental evaluation DMOD Laboratory, University of Ioannina HDMS 2010

Experimental Evaluation • goodness estimation of a single collection • accuracy of the ranking based on goodness DMOD Laboratory, University of Ioannina HDMS 2010 We consider four approaches for goodness evaluation: • keyword: ignores structure - based solely on the appearance of the keywords • tree: exact evaluation based on ELCA semantics • pair: pairwise estimation • bloom: pairwsise + Bloom-based summaries Experiments on both synthetic and real datasets

Goodness Estimation (Single Collection) Weighted Boolean DMOD Laboratory, University of Ioannina HDMS 2010 • Using Bloom filters increases the estimation error but also reduces the storage overhead to 8% of the pair-based one • Due to false positives, Bloom filters derive more optimistic lower bounds

Similarity Threshold  (Single Collection) Weighted Boolean DMOD Laboratory, University of Ioannina HDMS 2010 • For low threshold values, the goodness estimations and the lower bounds are more accurate, while they increase as the threshold increases • When the threshold value is close to the tree depth of the documents, the accuracy of the estimations improves again

Document & Query Structure (single collection) DMOD Laboratory, University of Ioannina HDMS 2010 Absolute estimation error (distance from ELCA) Overall acceptable estimations (below 20%) • Our approaches behaves worse for queries of "medium" length (4-5) and small number of repeating elements

Achieved ranking • Optimal Ranking (Ranking achieved through the actual ELCA computation) and Pair-wise Ranking (with and without Blooms) • Spearman Footrule distancebetween two ranked lists: the absolute difference of their pairwise elements normalized by dividing by 1/2(S), where S the number of elements in the lists • Mean Average Precision (MAP) for a set of different queries: the average of the precision value (percentage of relevant documents) attained after each query, divided by the number of queries three different collections (same size, different size, random) DMOD Laboratory, University of Ioannina HDMS 2010

Ranking (Spearman) Equal Size Collections Different Size Collections Random Collections DMOD Laboratory, University of Ioannina HDMS 2010 • The keyword-based approach ignores the document structure and ranks the collections according to their size • Our approaches behave well, with maximum distance to the actual ranking at 0.3 in the worst case • The Bloom-based approach sometimes outperforms the pair-based one due to the more optimistic estimations

Ranking (MAP) Equal Size Collections Different Size Collections Random Collections DMOD Laboratory, University of Ioannina HDMS 2010 • Our approaches behave well, with a MAP around 0.75 to 0.85 • The Bloom-based approach is less precise because of the false positives

38 Real Data DMOD Laboratory, University of Ioannina HDMS 2010 We split the DBLP bibliographic data collection: Two sets of collections grouped by: • year of publication (i.e., collections “2009”, "2008", etc) • conference name (i.e., collection “WWW”, "VLDB", etc) Queries with author names as keywords Withλequal to 1,weretrieve publications cowritten by two authors

Summary Consider the problem of source selection for XML documents: Given a set of XML databases and a keyword query, ranked the databases based on their goodness • (LCA-based ranking) Maintain information about the height of the LCA node between keywords • Propose a pair-wise aproach: the actual height for a combination of keywords is estimated using the pair-wise heights • Introduce Bloom-based summaries for maintaining heights • Both a Boolean and a Weighted version for document similarity • Evaluation of the quality of the goodness estimation per collection and the actual ranking, as well as usefulness for real data DMOD Laboratory, University of Ioannina HDMS 2010

Future Work DMOD Laboratory, University of Ioannina HDMS 2010 • Other definitions of document relevance (including schema based and IR techniques) • Alternative definitions of database goodness + user study for their evaluation • Other types of summaries

Thank you DMOD Laboratory, University of Ioannina HDMS 2010

Related Work For all the variations of the LCA, for any query q and document d the set of the LCA nodes of the keywords in q (basic LCA nodes) is a superset of any type of LCA nodes, i.e., SLCA, ELCA, MLCA, VLCA DMOD Laboratory, University of Ioannina HDMS 2010

Experimental Evaluation DMOD Laboratory, University of Ioannina HDMS 2010

LCA -Based Selection for XML Document Collections

LCA -Based Selection for XML Document Collections

Presentation Transcript

Creation of Heterogeneous XML Document Collections based on the Internet Movie Database

CREATING AN XML DOCUMENT

Developing Schemas for XML Document Exchange

Finding Optimal Probabilistic Generators for XML Collections

Optimal Probabilistic Generators for XML Collections

Finding Optimal Probabilistic Generators for XML Collections

Microsoft and XML Formats for Document

XML Document Mining Challenge

SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS

Processing of large document collections

Processing of large document collections

XML: Document Type Definitions

Displaying XML Document

XML Document Object Model

Creating an XML Document Developing an XML Document for the Jazz Warehouse

XML Document Design

Processing of large document collections

CERN Document Server: An OAI-based solution for managing data collections

A Document-based Approach to Indexing XML Data

CERN Document Server: An OAI-based solution for managing data collections

Document Collections 2

XML: Document Type Definitions