XML Retrieval

XML Retrieval Tarık Teksen Tutal 21.07.2011

Information Retrieval • XML (Extensible Markup Language) • XQuery • Text Centric vs Data Centric

Basic XML Concepts

XML • Ordered, Labeled Tree • XML Element • XML Attribute • XML DOM (Document Object Model): Standard for accessing and processing XML documents.

XML Structure • An Example:

XML DOM Object • XML DOM Object of the Sample in the Previous Slide • Nodes in a Tree • Parse the Tree Top Down

XPath • Standardfor enumerating paths in an XML document collection • Querylanguage for selecting nodes from an XML document • Definedby the World Wide Web Consortium (W3C)

Schema • Puts Constraints on the Structureof AllowableXML • Two Standarts for Schemas: • XML DTD • XML Schema

Challanges in XML Retrieval

Structured Document Retrieval Principle • A system should always retrievethe most specific part of a document answering the query • In a «Cookbook» collection, if a user queries «Apple Pie», the system should return the relevant, «Apple Pie», chapter of the book, «AppleDeserts», not the entire book. • In the same example however, if user queries «Apple», the book should be returned instead of a chapter.

Indexing Unit • Unstructured: • Files on PC, Pages on the Web, E-Mail Messages etc. • Structured • Non-Overlapping Pseudodocuments • Top-Down • Bottom-Up • All

Indexing Unit • Non-Overlapping Pseudodocuments • Not Coherent

Indexing Unit • Top-Down • Start with one of the latest units (e.g book in a book collection) • Postprocess search results to find for each book the subelement that is the best hit. • Fail to return the best element since relevance of a book is generally not a good predictor for relevance of subelements.

Indexing Unit • Bottom-Up • Search all leaves, select relevant ones • Extend them to larger units in postprocessing • Fail to return the best element since relevance of a subelement is generally not a good predictor for relevance of larger units.

Indexing Unit • Index All the Elements • Not Useful to Index Some Elements (e.g ISBN) • Creates redundancy (Deeper Level Elements are Returned Several Times)

Nested Elements • To Get Rid of Redundancy, • Discard All Small Elements • DiscardAllElementTypesthat Usersdo not Lookat (WorkingXML RetrievalSystemLogs) • DiscardAllElementTypesthat AssessorsGenerallydo not Judgeto be Relevant (If RelevanceAssessmentsare Available) • OnlyKeepElementTypesthat a SystemDesigneror Librarianhas Deemedto be UsefulSearchResults

Nested Elements • Remove Nested Elements in a Postprocessing Step • Collapse Several Nested Elements in the Results List and then Highlight Results

Vector Space Model For XML Retrieval

Lexicalized Subtrees • To get eachword together with its position within the XML tree encoded by a dimension of the vector space • Map XML documents to lexicalized subtrees • Take each text node (leaf) and break it into multiple nodes, one for each word. E.g. split Bill Gates into Bill and Gates • Define the dimensions of the vector space to be lexicalized subtrees of documents – subtrees that contain at least one vocabulary term

Lexicalized Subtrees

Lexicalized Subtrees • Queries and documents can be respresented as vectors in this lexicalized subtree context • Matches can then be computed for example by using the Vector Space Formalism • V.S. Formalism -> Unstructured vs Structured • Dimensions: Vocabulary Terms vs Lexicalized Subtrees

Dimensions: Tradeoff • Dimensionality of Space vs Accuracy of Results • Restrict Dimensions to Vocabulary Terms • Standart Vector Space Retrieval System • Do Not Match the Structure of the Query • Separate Lexicalized Dimension for Each Subtree • Dimensionality of Space Becomes too Large

Dimensions: Compromise • Index All Paths that End with a Single Vocabulary Term (XML-Context Term Pairs) • Structural Term <c, t>: a pair of XML-context c and vocabulary term t

Context Resemblance • To measure the similarity between a path in a query and a path in a document • |cq| and |cd| are the number of nodes in the query path and document path respectively • cq matches cdif and only ifwe can transform cq into cd by inserting additional nodes

Context Resemblance • CR(cq4, cd2) = 3/4 = 0.75 • CR(cq4, cd3) = 3/5 = 0.6

Document Similarity Measure • Final Score for a Document • Variant of the Cosine Measure • Also called «SimNoMerge» • Nota True CosineMeasureSinceItsValuecan be Largerthan 1.0

Document Similarity Measure • V is the vocabulary of non-structural terms • B is the set of all XML contexts • weight (q, t, c), weight(d, t, c) are the weights of term t in XML context c in query q and document d, respectively • standard weighting e.g. idft x wft,d, where idft depends on which elements we use to compute dft.

SimNoMerge Algorithm ScoreDocumentsWithSimNoMerge(q, B, V, N, normalizer)

Evaluation of XML Retrieval

INEX • Initiative for the Evaluation of XML Retrieval • Yearly standardbenchmark evaluation that has produced test collections (documents, sets of queries, and relevance judgments) • Based on IEEE journal collection (since 2006 INEX uses the much larger English Wikipedia test collection) • The relevance of documents is judged by human assessors.

INEX Topics • Content Only (CO) • Regular Keyword Queries Like in Unstructured IR • Content and Structure (CAS) • Structured Constraints in Addition to Keywords • Relevance Assessments are More Complicated

INEX Relevance Assessments • INEX 2002 defined component coverage and topical relevance as orthogonal dimensions of relevance • Component Coverage: • Evaluates Whether the Element Retrieved is «Structurally» Correct • Topical Relevance

INEX Relevance Assessments • Component Coverage: • Exact coverage (E): The information sought is the main topic of the component and the component is a meaningful unit of information • Too small (S): The information sought is the main topic of the component, but the component is not a meaningful (self-contained) unit of information • Too large (L): The information sought is present in the component, but is not the main topic • No coverage (N): The information sought is not a topic of the component • Topical Relevance: • HighlyRelevant(3), FairlyRelevant(2), MarginallyRelevant(1) andNonrelevant(0)

Combining The Relevance Dimensions • All of the combinations are not possible -> 3N • Quantization:

INEX Evaluation Measures • Precision and Recall can be applied • Sum Grades vs Binary Relevance • Overlap is not accounted for • Nested elements in the same search result • Recent INEX focus: • Develop algorithms and evaluation measures that return non-redundant results lists and evaluate them properly.

XML Retrieval

XML Retrieval

Presentation Transcript

Hinrich Schütze and Christina Lioma Lecture 10: XML Retrieval

XML Retrieval

XML Retrieval

Will XML and Information Retrieval Make Society Transparent?

INEX: Evaluating content-oriented XML retrieval

Evaluating content-oriented XML retrieval: The INEX initiative

An Adaptive XML Retrieval System

Evaluation of Relevance Feedback Algorithms for XML Retrieval

XML Retrieval and Evaluation: Where are we?

Evaluation of XML Information Retrieval Systems

XML Information Retrieval and INEX

The Overlap Problem in Content-Oriented XML Retrieval Evaluation

A Distributed Indexing Strategy for Efficient XML Retrieval

XML Information Retrieval

INEX 2002 - 2006: Understanding XML Retrieval Evaluation

Ranked Information Retrieval on XML Data

Structure/XML Retrieval

XML Retrieval with slides of C. Manning und H.Schutze

XML Information Retrieval

XML Distributed Retrieval

Lecture 21: XML Retrieval