720 likes | 884 Views
XSEarch: A Semantic Search Engine for XML. Sara Cohen , Jonathan Mamou , Yaron Kanza , Yehoshua Sagi v The Hebrew University of Jerusalem Presented by Deniz Kasap & Sarp Baran Özkan. XSEarch an X ML S earch E ngine. Goal: Find the “relevant” XML fragments,
E N D
XSEarch: A Semantic Search Engine for XML Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua Sagiv The Hebrew University of Jerusalem Presented by Deniz Kasap & Sarp Baran Özkan
XSEarch an XML Search Engine Goal: Find the “relevant” XML fragments, given tag names and keywords
Introduction • It is becoming increasingly popular to publish data on the Web in the form of XML documents. • Current search engines, which are an indispensable tool for finding HTML documents, have two main drawbacks when it comes to searching for XML documents. • It is not possible to pose queries that explicitly refer to XML tags. • Search engines return references (i.e. links) to documents and not specific fragments thereof. This is problematic, since large XML documents may contain thousands of elements storing many pieces of information that are not necessarily related to each other.
Excerpt from the XML Version of DBLP <proceedings> <inproceedings> <author>Moshe Y. Vardi</author> <title>Querying Logical Databases</title> </inproceedings> <inproceedings> <author>Victor Vianu</author> <title>A Web Odyssey: From Codd to XML</title> </inproceedings> </proceedings>
A Search Example Find papers by Vianu on the topic of “logical databases” How can we find such papers?
Attempt 1: Standard Search Engine A document containing some of the three query terms is considered as a result.
The document is returned BUT it does not contain any paper on “logical databases” by Vianu The document contains the three query terms. Hence, it is returned by a standard search engine. BUT This fragment does not represent a paper by Vianu This fragment does not represent a paper about logical databases The document is not relevant to the query. This does not work!!! <proceedings> <inproceedings> <author>Moshe Y. Vardi</author> <title>Querying Logical Databases</title> </inproceedings> <inproceedings> <author>Victor Vianu</author> <title>A Web Odyssey: From Codd to XML</title> </inproceedings> </proceedings>
Since a reference to whole XML document is usually not a useful answer, the granularity of the search should be refined. • Instead of returning entire document, an XML search engine should return fragments of XML documents.
A query language for XML, such as XQuery, can be used to extract data from XML documents. • However, such a query language is not an alternative to an XML search engine for several reasons. • The syntax of XQuery is more complicated than the syntax of a standart search query. Hence, it is not appropriate for a naive user. • Extensive knowledge of the document structure is required in order to correctly formulate a query. Thus, queries must be formulated on a per document basis. • XQuery lacks any mechanism for ranking answers.
Attempt 2: XML Query Language FOR $i IN document(“bib.xml”)//inproceedings WHERE $i/author contains ‘Vianu’ AND $i/title contains ‘Logical’ AND $i/title contains ‘Databases’ RETURN <result> <author> $i/author </author> <title> $i/title </title> </result> • Complicated syntax • Extensive knowledge of the document structure required to write the query • No mechanism for ranking results This does work, BUT
Our Requirements from the Search Tool • A simple syntax that can be used by naive users • Search results should include XML fragments and not necessarily full documents • The XML fragments in an answer, should be semantically related • For example, a paper and an author should be in an answer only if the paper was written by this author • Search results should be ranked • Search results should be returned in “reasonable” time
The design and implementation of XSEarch involved several challenges. • A syntax is suitable for a naive user. • The theoretical results were adapted so that XSEarch always returns as answers. • Answers are highly relevant to the keywords of the query. • Suitable ranking mechanism that takes into account both the degree of the semantic relationship and the relevance of the keywords have been developed. • Index structures and evaluation algorithms that allow the system to deal efficiently with large documents have been developed. • The implemantation of XSEarch is extensible in the sense that it can easily accommodate different type of semantic relationships.
Query Syntax • The query language of a standart search engine is simply a list of keywords. • Keywords with a plus (+) sign must appear in a satisfying document, whereas keywords without a plus sign may or may not appear in a satisfying document.(but the appearance of such keywords is desirable)
The query language of XSEarch is a simple extension of the language described below. In addition to specify labels and keyword-label combinations that must or may appear in a satisfying document. • A searchterm may have a plus sign prepended, in which case it is a required term. Otherwise, it is an optional term. • We use t, t1, t2, etc., as an abstract notation for required and optional term. • A query has the form Q(S) where S = t1,...,tm is a sequence of required and optional search terms.
Formally, a search term has the form; l:k, l:, :k where l is a label and k is a keyword.
Appearance of logical in the fragment increases the rank of this fragment The keyword database must appear in the fragment Appearance of the tag inproceedings, in the fragment, increases the rank of this fragment Appearance of Vianu under the tag author, in the fragment, increases the rank of this fragment Example • Find papers by Vianu on the topic of “logical databases” logical +database inproceedings: author:Vianu Note that the different document fragments matching these query terms must be “semantically related”
Query Semantics • This section presents the semantics of our queries. • In order to satisfy a query Q, each of the required terms in Q must be satisfied. • In addition, the elements satisfying Q must be meaningfully related.
XSEarch: author:Vianu title: <proceedings> <inproceedings> <author>Moshe Y. Vardi</author> <title>Querying Logical Databases</title> </inproceedings> <inproceedings> <author>Victor Vianu</author> <title>A Web Odyssey: From Codd to XML</title> </inproceedings> </proceedings> <author>Victor Vianu</author> <title>A Web Odyssey: From Codd to XML</title> Good Result! title and author elements ARE semantically related
XSEarch: author:Vianu title: <proceedings> <inproceedings> <author>Moshe Y. Vardi</author> <title>Querying Logical Databases</title> </inproceedings> <inproceedings> <author>Victor Vianu</author> <title>A Web Odyssey: From Codd to XML</title> </inproceedings> </proceedings> <title>Querying Logical Databases</title> <author>Victor Vianu</author> Bad Result! title and authorelements ARE NOT semantically related
Satisfaction of a Search Term • XML documents are modeled as trees in the standard fashion. • Each interior node is associated with a label and each leaf node is associated with the sequence of keywords. • If k is a keyword in the sequence associated with n, n contains k is said. • In Figure 1 there is a tree that represents a small portion of the Sigmod Record. • We will refer to this tree as Tsr
Let n be an interior node in a tree T. • We say that n satisfies the search term; • l:k if n is labeled with l and a descendent that contains the keyword k. • l: if n is labeled with l. • :k if n has a leaf child that contains the keyword k. • Example: • In the tree Tsr, • node number 14 satisfies :Kempster • node number 9 satisfies authors:Kempster. • node 9 does not satisfy :Kempster, position: or :position.
Meaningfully Related Sets of Nodes • Let T be a tree and R be a binary, reflexive and symmetric relationship on the nodes in T. • We assume that R contains pairs of nodes that are meaningfully related. • We present two different way to extend R to arbitrary sets of nodes
A set of nodes N is all-pairs R-related, if (n1,n2) is in R, for every pair of nodes n1, n2. • This states that a set of nodes is meaningfully related if every pair of nodes in the set is meaningfully related. • N is star R-related, if there is a node n* N such that the pair (n*,n) is in R, for all nodes n N. • This states that the nodes of a set are meaningfully related if all these nodes are meaningfully related to a node in the set. • Depending on the structure of the documents, either the all-pairs relationship or star relation-ship may be more appropriate.
Query Answers • Let Q(t1,…,tm) be a query. • A sequence N = n1,…,nm of nodes and null values is an all-pairs R-answer for Q if the nodes in N are all-pairs R-related and for all 1 i m: • ni is not the null value if ti is a required term; • ni satisfies ti if it is not the null value. • Similarly, N is star R-answer, when the nodes in N are star R-related.
We use; • Ansa,R(Q) to denote the set of all-pairs R-answer for the query Q over a tree T and • Ansts,R(Q) to denote the set of star R-answers for Q over T. • MaxAnsa,R to denote the set of maximal answers in Ansa,R(Q)
The Interconnection Relationship • We present a relation which can be used to determine whether a pair of nodes is meaningfully related. • Let T be tree an n1 and n2 be nodes in T. • The shortest undirected path between n1 and n2 consists of the paths from the lowest common ancestor of n1 and n2 to n1 and n2.
We denote the tree consisting of these two paths as T|n1,n2. • This tree describes the relationship between the nodes n1 and n2. • For example in Tsr, the tree T|8,13 consists of the nodes 7, 8, 9, 12 and 13.
Relationship tree of n1, n2, …, nk Relationship Trees Lowest common ancestor of n1, n2, …, nk … nk n1 n2
Our “Semantic Relation”: Interconnection • n1,..., nk are interconnected if either • relationship tree of n1,..., nk does not contain two nodes with the same label, or • the only nodes with the same label in the relationship tree of n1,..., nk, are among n1,..., nk
proceedings inproceedings inproceedings title author title author Moshe Y. Vardi Victor Vianu A Web Odyssey: From Codd to XML Querying Logical Databases Lowest common ancestor of circled nodes Example (1) Relationship tree Circled nodes belong to differentinproceedings entities. They ARE NOTinterconnected!
proceedings inproceedings inproceedings title author title author Moshe Y. Vardi Victor Vianu A Web Odyssey: From Codd to XML Querying Logical Databases Example (2) Lowest common ancestor of circled nodes Relationship tree Circled nodes belong to the sameinproceedings entity. They AREinterconnected!
Example (3) Lowest common ancestor of circled nodes proceedings Relationship tree inproceedings inproceedings title author title author author Moshe Y. Vardi Victor Vianu Serge Abiteboul Queries and Computation on the Web Querying Logical Databases Circled nodes belong to the sameinproceedings entity, but are labeled with the same tag. They AREinterconnected.
Example 1 of Query Semantics • Consider the query Q1 defined as; Q1(+title:, author:). • The query Q1 finds pairs of titles and authors, belonging to the same article. • Only tuples where the title is non-null will be returned. • The answers created for Tsr are; (8,10) , (8,12) , (8,14) , (17,18) and (25, )
Example 2 of Query Semantics The answers for Q1 over this document would consists of; (6,3) and (6,4)
Query Processing • Document fragments are extracted using the interconnection index and other indices • Extracted fragments are returned ranked by the estimated relevance
User Ranker Query Processor 1 L1 L2 2 L3 3 L4 4 Index Repository XML Files Indices Indexer Ranker
Ranking Factors Several factors increase the rank of a result • Similarity between query and result • Weight of labels appearing in the result • Characteristics of result tree
Query and Result Similarity TFILF • Extension of TFIDF, classical in IR • Term Frequency: number of occurrences of a query term in a fragment • Inverse Leaf Frequency: number of leaves containing a query term divided by number of leaves in the corpus
TFILF • Term frequency of keyword k in a leaf node nl • Inverse leaf frequency TFILF is the product betweentfandilf
Weight of Labels • Some labels are considered more important than others • Text under an element labeled with title is more “important” than text under element labeled with section • Label weights can be • system generated • user defined
3 nodes article 2 nodes This fragment will obtain an higher rank section article title title XML XML Relationship between Nodes • Sizeof the relationship tree: small fragment indicates that its nodes are closer, and thus, probably, “more related” article: title:XML
This fragment will obtain an higher rank article article section section node is an ancestor of title node section title title XML XML Relationship between Nodes • Ancestor-descendantrelationships between a pair of nodes in a fragment, indicates “strong relation” between these nodes section: title:XML
Combining the Factors Given a query Q and an answer N, we use the measures • sim(Q,N), • tsize(N) • and anc-des(N) to determine the ranking of the answer. We experimented with the following combination of factors by varying the values of α , β and γ sim(Q,N)α / tsize(N)β x (1+ γxanc-des(N))
System Implementation The architecture of the XSEarch system is depicted in the following figure:
User Interface Query Processor Ranker 1 L1 L2 2 L3 3 L4 4 Index Repository XML Files Indices Indexer
The basic follow of information is as follows: • The user enters a query using a browser. • The Search-Query Processor parses the query into a list of search terms. • The Index Repository is used to find nodes that satisfy that satisfy the search terms and to find whether pairs of nodes are interconnected. • It responds by checking the stored indices. • If these indices do not contain sufficient information, the Indexer is used to augment the current indices. • Once the relevant information is returned to the Search-Query Processor, it creates the answers, which are ranked, sorted and then returned. • The Indexer creates several different indices in the Index Repository based on a set of XML documents.
We focus on the most important and novel index structures: • The interconnection index • Path index • The interconnection index allows for rapid checking of the interconnection relationship. • Path index allow us to create first answers with higher estimated ranking.
Dynamic Offline Interconnection Indexing • Checking for interconnection of nodes online is expensive. • Hence, it is decided that at the first to create a node-interconnection index that would store information about the interconnection relationship between each pair of nodes. • This requires solving the following problem: • Given a document T, for all pairs of nodes n and n’ in T, determine whether n and n’ are interconnected. • The algorithm which is the solution of this problem, is based on the following Lemma:
Lemma (Interconnection Characterization) • Let T be a document and let n and n’ be nodes in T. • If n is ancestor of n’, then n and n’ are interconnected if and only if the following hold: • The parent of n’ is strongly-interconnected with n; • The child of n on the path to n’ is strongly-interconnected with n’. • If n is not an ancestor of n’ and n’ is not an ancestor of n, then n and n’ are interconnected if and only if the following hold: • The parent of n’ is strongly-interconnected with n; • The parent of n is strongly-interconnected with n‘.