XRANK

XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ Gökay Burak AKKUŞEce AKSU

This Paper... • Describes the architecture, implementation and evaluation of theXRANK system • The contributions of the paper are: • (a) the problem definition and system architecture • (b) an algorithm for computing the rankingof XML elements • (c) new inverted list indexstructures and associated query processing algorithms • (d) anexperimental evaluation of XRANK Gökay Burak AKKUŞEce AKSU

Overview • Problem: Efficiently producing ranked results for keyword search queries over hierarchical XML documents. • New challanges • Returns deeply nested XML elements. • Ranking is at the granularity of an XML element (not the document) • Keyword proximity is more complex. Gökay Burak AKKUŞEce AKSU

Overview - 2 • This paper pesents XRANK system to handle these features of XML keyword search. • XRANK offers both space & performance benefits • XRANK generalizes a hyperlink based HTML search engine such as Google. • XRANK can be used to query both HTML and XML documents. Gökay Burak AKKUŞEce AKSU

Keyword Search Querying - 1 • Keyword search querying Adv: simple • users do not have to learn a complex query language • can issue queries without any prior knowledge about the structure of the underlying data. Consequence: Interface is fexible • Queries may not always be precise and can return large number of query results. Gökay Burak AKKUŞEce AKSU

Keyword Search Querying - 2 • An important requirement for keyword search is torank the query results so that the most relevant results appear first. • Certain limitations of the HTML data model make such systemsineffective in many domains. • HTML is a presentation language • HTML cannot capture much semantics Gökay Burak AKKUŞEce AKSU

Keyword Search Querying - 3 • The XML data model addresses this limitation byallowing for extensible element tags. (Example: Figure.1) Gökay Burak AKKUŞEce AKSU

Gökay Burak AKKUŞEce AKSU

Querying XML Documents • One approach is the sophisticated query language XQUERY • Effective in some cases • Users have to learn a complex query language and understand the schema of underlying XML • An alternative approach is XRANK • Retain the simple keyword search query interface • Exploit XML’s tagged and nested structure during query processing. Gökay Burak AKKUŞEce AKSU

New Challanges • Keyword searching over XML introduces many new challenges. 1. The result of the keyword search querycan be a deeply nested XML element. • return the ‘deepest’ node 2. Ranking is not solely based on hyperlinks. • semantics of containment links (relating parent and child elements) is very different from that of hyperlinks (such as IDREFs and XLinks) Gökay Burak AKKUŞEce AKSU

New Challanges 3. The notion of proximity among keywords is more complex • In HTML, proximity among keywords translates directly to the distance between keywords in a document. • For XML there is a 2-dimensional proximity metric. • Keyword distance • Ancestor distance Gökay Burak AKKUŞEce AKSU

XML Data Model • XML is a hierarchical formatfor data representation and exchange. • An XML document consists of: • Root element, nested sub-elements, attributes and values, • supports intra-document and inter-document references. Gökay Burak AKKUŞEce AKSU

XML Data Model-2 • Intra-document referencees are represented using IDREFs. • Inter-document references are represented using XLink. • Both IDREFs and XLinks are reffered as hyperlinks! Gökay Burak AKKUŞEce AKSU

Definitions • A collection ofhyperlinked XML documents can be defined as a directed graph: G = (N, CE, HE) N : The set of nodes N = NE UNV NE :The set ofelements NV : The set of values CE :The set of containmentedges relating nodes HE :The set of hyperlink edgesrelating nodes Gökay Burak AKKUŞEce AKSU

Definitions - 2 • The edge (u, v) CE iff v is avalue/nested sub-element of u. • The edge (u, v) HE iff u contains a hyperlinkreference to v. • An element u is a sub-element of an element v if(v,u) CE. • An element u is the parent of node v if (u,v) CE. • The predicate contains*(v, k) is true if the node v directly or indirectly contains the keyword k. Gökay Burak AKKUŞEce AKSU

Keyword Query Results • There are two possible semantics for keyword searchqueries: • conjunctive keyword query semantics • contain all of the query keywords are returned. • disjunctivekeyword query semantics • contain at least one of thequery keywords are returned • This paper focuses on conjunctive keywordquery semantics. Gökay Burak AKKUŞEce AKSU

Keyword Query Results - 2 • Q={k1,…, kn}. • R0 = {vv  NE   k  Q(contains*(v,k))} the set of elements that directly or indirectly contain all of the query keywords. • Result(Q)={v  k  Q c  N ((v,c)  CE  c R0 contains*(c,k))} • ensures that only the most specific results are returned. • ensures that an element that has multipleindependent occurrences of the query keywords is returned, • CE are considered for result set, HE are considered for ranking Gökay Burak AKKUŞEce AKSU

Keyword Query Results - 3 • XML elements provides more contextinformation • Also poses interesting user-interface challenges. • One solution is to allow the user to navigate up to theancestors of the query result • Another solution, is to predefine a set of“answer nodes” AN. • XRANK supports both • may require knowledge of thedomain and underlying XML schema Gökay Burak AKKUŞEce AKSU

Ranking Keyword Query Results • Desired Properties of Ranking Function: 1) Result specificity: more specific results higher than less specific results.one dimension of result proximity. 2) Keyword proximity: another dimension of result proximity. 3) Hyperlink Awareness:hyperlinked structure of XML documents. Gökay Burak AKKUŞEce AKSU

Ranking Function: Definition • ElemRank is defined at the granularityof an element and takes the nested structure of XML into account. • Similar to Google’s PageRank • Q = (k1, k2, …, kn) • R = Result(Q) • A result element v1R • First define the ranking of v1with respect to one query keywordki, r(v1,ki) before defining the overall rank, rank(v1, Q). Gökay Burak AKKUŞEce AKSU

Ranking with respect to one keyword • There exists a sub-element/value node v2 of v1such that v2 R0and contains*(v2, ki). • There is a sequence of containment edges in CE of the form (v1, v2), (v2, v3), …, (vt, vt+1) such that vt+1is a value node that directly contains the keyword ki. Gökay Burak AKKUŞEce AKSU

Ranking with respect to one keyword • r(v1, ki) does not dependon the ElemRank of the result node v1, except when v1 = vt for 2 reasons: 1. less specific results indeed get lower ranks. 2. in fact related to ElemRank(v1) due tocertain properties of containment edges. For multiple occurences of ki in v1 combined rank is: • f = max Gökay Burak AKKUŞEce AKSU

Overall Ranking • The overall ranking is the sum of the ranks with respect to eachquery keyword, multiplied by a measure of keyword proximityp(v1, k1, k2, …, kn). Gökay Burak AKKUŞEce AKSU

XRANK System Architecture Gökay Burak AKKUŞEce AKSU

XRANK System Architecture-2 • ElemRank Computation Module • Computes the ElemRanks of XML elements • Combined with ancestor info • HDIL • Generates an index structure called HDIL • The Query Evaluator Module • Evaluates queries using HDIL • Returnsranked results. Gökay Burak AKKUŞEce AKSU

ElemRank Computational Module • ElemRank is a measure of the objective importance of an XML element and is based on the hyperlinked structure of XML docs. • PageRank function is sum of 2 probabilities • Visiting v at random (d=0.85) • Visiting v by navigating Gökay Burak AKKUŞEce AKSU

ElemRank Computational Module • PageRank is unidirectional • Forward ElemRank propagation • Paper  section • Reverse ElemRank propagation • Paper -- > workshop Gökay Burak AKKUŞEce AKSU

Refinements of PageRank • Bi-directional transfer of ElemRanks • Discrimination between containment and hyperlink edges • Aggregate ElemRanks forreverse containment relationships Gökay Burak AKKUŞEce AKSU

Bi-directional Transfer of ElemRanks • A simple solution is to add reverse containment edges, • does not distinguish between containment and hyperlink edges Gökay Burak AKKUŞEce AKSU

Discrimination between containment and hyperlink edges • It weights forward andreverse containment relationships similarly. Gökay Burak AKKUŞEce AKSU

Aggregate ElemRanks forreverse containment relationships Gökay Burak AKKUŞEce AKSU

XRANK System Efficiently Evaluating XML Keyword Search Queries

Efficiently Evaluating XML Keyword Search Queries • Naïve Approach • Dewey Inverted List (DIL) • Ranked Dewey Inverted List (RDIL) • Hybrid Dewey Inverted List (HDIL) Gökay Burak AKKUŞEce AKSU

Naïve Approach • Main Difference between XML and HTML keyword search: • The granularity of query results • XML keyword search returns elements • HTML keyword search returns documents • One way to do XML keyword search • Treat each element as a document Gökay Burak AKKUŞEce AKSU

Problems of Naïve Approach • Space Overhead • Spurious Query Results • Inaccurate ranking of results Gökay Burak AKKUŞEce AKSU

Space Overhead • An inverted list contains for each keyword, the list of documents that contain the keyword • For XML documents, the list of elements • A large space overhead; because each inverted list contains • XML element that directly contains the keyword(1) • All of (1)s ancestors redundantly Gökay Burak AKKUŞEce AKSU

Spurious Query Results • The naïve approach ignores ancestor-descendant relationships. • All elements treated as independent documents • Results will not correspond to the desired semantics for XML keyword search Gökay Burak AKKUŞEce AKSU

Inaccurate Ranking of Results • Existing approaches do not take result specificity into account when ranking results. Gökay Burak AKKUŞEce AKSU

Dewey Inverted List (DIL) • Naïve approach has drawbacks: • Decouples representation of ancestors and descendants. • Dewey encoding of Element IDs jointly captures ancestor and descendant information. Gökay Burak AKKUŞEce AKSU

DIL • An interesting feature: • ID of an ancestor is a prefix of the ID of a descendant. • Ancestor-descendant relationships are implicitly captured in the Dewey ID. Gökay Burak AKKUŞEce AKSU

DIL Data Structure • The inverted list for a keyword k contains the Dewey IDs of all the XML elements that directly contain the keyword k. • For multiple documents : • First component of each Dewey ID is the document ID Gökay Burak AKKUŞEce AKSU

DIL Data Structure -2 • An entry in DIL: • ElemRank of corresponding XML element • The list of all positions where the keyword k appears in that element. • Entries are sorted by Dewey IDs • The size of DIL is smaller than that of Naïve Approach. Gökay Burak AKKUŞEce AKSU

DIL Query Processing • An algorithm that works in a single pass over the query keyword inverted lists. • The key idea: • Merge the query keyword inverted lists • Simultaneously compute the longest common prefix of the Dewey IDs in different lists. Gökay Burak AKKUŞEce AKSU

Ranked Dewey Inverted List (RDIL) • “If inverted lists are long (due to common keywords or large document collections) even the cost of a single scan of the inverted list can be expensive, especially if the users want only the top few results.” Gökay Burak AKKUŞEce AKSU

RDIL -2 • One solution: • Order the inverted lists by the ElemRank instead of by the Dewey ID. • Higher ranked results will appear first in the inverted list. • Threshold Algorithm. Gökay Burak AKKUŞEce AKSU

RDIL Data Structure • RDIL is similar to DIL except that: • Inverted lists are ordered by ElemRank, • Each inverted list has a B+-tree index of the Dewey ID field. Gökay Burak AKKUŞEce AKSU

XRANK

XRANK

Presentation Transcript

XRANK: Ranked Keyword Search Over XML Documents

XRANK: Ranked Keyword Search over XML Documents