390 likes | 691 Views
XML Retrieval. Tarık Teksen Tutal 21.07.2011. Information Retrieval. XML ( Extensible Markup Language) XQuery Text Centric vs Data Centric. Basic XML Concepts. XML. Ordered, Labeled Tree XML Element XML Attribute
E N D
XML Retrieval Tarık Teksen Tutal 21.07.2011
Information Retrieval • XML (Extensible Markup Language) • XQuery • Text Centric vs Data Centric
XML • Ordered, Labeled Tree • XML Element • XML Attribute • XML DOM (Document Object Model): Standard for accessing and processing XML documents.
XML Structure • An Example:
XML DOM Object • XML DOM Object of the Sample in the Previous Slide • Nodes in a Tree • Parse the Tree Top Down
XPath • Standardfor enumerating paths in an XML document collection • Querylanguage for selecting nodes from an XML document • Definedby the World Wide Web Consortium (W3C)
Schema • Puts Constraints on the Structureof AllowableXML • Two Standarts for Schemas: • XML DTD • XML Schema
Structured Document Retrieval Principle • A system should always retrievethe most specific part of a document answering the query • In a «Cookbook» collection, if a user queries «Apple Pie», the system should return the relevant, «Apple Pie», chapter of the book, «AppleDeserts», not the entire book. • In the same example however, if user queries «Apple», the book should be returned instead of a chapter.
Indexing Unit • Unstructured: • Files on PC, Pages on the Web, E-Mail Messages etc. • Structured • Non-Overlapping Pseudodocuments • Top-Down • Bottom-Up • All
Indexing Unit • Non-Overlapping Pseudodocuments • Not Coherent
Indexing Unit • Top-Down • Start with one of the latest units (e.g book in a book collection) • Postprocess search results to find for each book the subelement that is the best hit. • Fail to return the best element since relevance of a book is generally not a good predictor for relevance of subelements.
Indexing Unit • Bottom-Up • Search all leaves, select relevant ones • Extend them to larger units in postprocessing • Fail to return the best element since relevance of a subelement is generally not a good predictor for relevance of larger units.
Indexing Unit • Index All the Elements • Not Useful to Index Some Elements (e.g ISBN) • Creates redundancy (Deeper Level Elements are Returned Several Times)
Nested Elements • To Get Rid of Redundancy, • Discard All Small Elements • DiscardAllElementTypesthat Usersdo not Lookat (WorkingXML RetrievalSystemLogs) • DiscardAllElementTypesthat AssessorsGenerallydo not Judgeto be Relevant (If RelevanceAssessmentsare Available) • OnlyKeepElementTypesthat a SystemDesigneror Librarianhas Deemedto be UsefulSearchResults
Nested Elements • Remove Nested Elements in a Postprocessing Step • Collapse Several Nested Elements in the Results List and then Highlight Results
Lexicalized Subtrees • To get eachword together with its position within the XML tree encoded by a dimension of the vector space • Map XML documents to lexicalized subtrees • Take each text node (leaf) and break it into multiple nodes, one for each word. E.g. split Bill Gates into Bill and Gates • Define the dimensions of the vector space to be lexicalized subtrees of documents – subtrees that contain at least one vocabulary term
Lexicalized Subtrees • Queries and documents can be respresented as vectors in this lexicalized subtree context • Matches can then be computed for example by using the Vector Space Formalism • V.S. Formalism -> Unstructured vs Structured • Dimensions: Vocabulary Terms vs Lexicalized Subtrees
Dimensions: Tradeoff • Dimensionality of Space vs Accuracy of Results • Restrict Dimensions to Vocabulary Terms • Standart Vector Space Retrieval System • Do Not Match the Structure of the Query • Separate Lexicalized Dimension for Each Subtree • Dimensionality of Space Becomes too Large
Dimensions: Compromise • Index All Paths that End with a Single Vocabulary Term (XML-Context Term Pairs) • Structural Term <c, t>: a pair of XML-context c and vocabulary term t
Context Resemblance • To measure the similarity between a path in a query and a path in a document • |cq| and |cd| are the number of nodes in the query path and document path respectively • cq matches cdif and only ifwe can transform cq into cd by inserting additional nodes
Context Resemblance • CR(cq4, cd2) = 3/4 = 0.75 • CR(cq4, cd3) = 3/5 = 0.6
Document Similarity Measure • Final Score for a Document • Variant of the Cosine Measure • Also called «SimNoMerge» • Nota True CosineMeasureSinceItsValuecan be Largerthan 1.0
Document Similarity Measure • V is the vocabulary of non-structural terms • B is the set of all XML contexts • weight (q, t, c), weight(d, t, c) are the weights of term t in XML context c in query q and document d, respectively • standard weighting e.g. idft x wft,d, where idft depends on which elements we use to compute dft.
SimNoMerge Algorithm ScoreDocumentsWithSimNoMerge(q, B, V, N, normalizer)
INEX • Initiative for the Evaluation of XML Retrieval • Yearly standardbenchmark evaluation that has produced test collections (documents, sets of queries, and relevance judgments) • Based on IEEE journal collection (since 2006 INEX uses the much larger English Wikipedia test collection) • The relevance of documents is judged by human assessors.
INEX Topics • Content Only (CO) • Regular Keyword Queries Like in Unstructured IR • Content and Structure (CAS) • Structured Constraints in Addition to Keywords • Relevance Assessments are More Complicated
INEX Relevance Assessments • INEX 2002 defined component coverage and topical relevance as orthogonal dimensions of relevance • Component Coverage: • Evaluates Whether the Element Retrieved is «Structurally» Correct • Topical Relevance
INEX Relevance Assessments • Component Coverage: • Exact coverage (E): The information sought is the main topic of the component and the component is a meaningful unit of information • Too small (S): The information sought is the main topic of the component, but the component is not a meaningful (self-contained) unit of information • Too large (L): The information sought is present in the component, but is not the main topic • No coverage (N): The information sought is not a topic of the component • Topical Relevance: • HighlyRelevant(3), FairlyRelevant(2), MarginallyRelevant(1) andNonrelevant(0)
Combining The Relevance Dimensions • All of the combinations are not possible -> 3N • Quantization:
INEX Evaluation Measures • Precision and Recall can be applied • Sum Grades vs Binary Relevance • Overlap is not accounted for • Nested elements in the same search result • Recent INEX focus: • Develop algorithms and evaluation measures that return non-redundant results lists and evaluate them properly.