XML Information Retreival

XML Information Retreival Hui Fang Department of Computer Science University of Illinois at Urbana-Champaign Some slides are borrowed from Nobert Fuhr’s XML Tutorial.

Outline • XML basics • Research Topics • XML IR • Tasks • Retrieval methods • Clustering XML documents

XML standards

Basic XML • Hierarchical document format for information exchange in WWW • Self describing data (tags) • Nested element structure having a root • Element data can have • Attributes • Sub-elements (Slides from Jayavel Shanmugasundaram)

Element Attribute Example XML document <?xml version="1.0" encoding="ISO-8859-1" ?> -  <book> <title> Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW</title> <author id = “rbelew”> <name> <firstname> Richard </firstname> <lastname> Belew </lastname> </name> <address> <city> San Diego </city> <zip> 92093 </zip> </address> </author> </book>

Tree structure of XML documents book title author id=“rbelew” name address Finding…. First name Last name city Zip code Richard Belew San Diego 92093

Basic XML standard does not deal with … • Standardization of element names XML namespaces • Structure of element content XML DTDs • Data types of element content XML schema

<table> <tr> <td>Apples</td> <td>Bananas</td> </tr> </table> <table> <name>GPA Table</name> <width>80</width> <length>120</length> </table> XML namespace Provide a method to avoid element name conflicts

<h:table xmlns:h="http://www.w3.org/TR/html4/"> <h:tr> <h:td>Apples</h:td> <h:td>Bananas</h:td> </h:tr> </h:table> <f:table xmlns:f="http://www.w3schools.com/gpa"> <f:name>GPA Table</f:name> <f:width>80</f:width> <f:length>120</f:length> </f:table> XML namespace(Cont.) Provide a method to avoid element name conflicts

<?xml version="1.0"?> <!DOCTYPE note SYSTEM "note.dtd"> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Have a rest!</body> </note> XML Document Type Definition Define the document structure with a list of legal elements <!ELEMENT note (to,from,heading,body)> <!ELEMENT to (#PCDATA)> <!ELEMENT from (#PCDATA)> <!ELEMENT heading (#PCDATA)> <!ELEMENT body (#PCDATA)>

Research Topics related to XML

IR areas Retrieval Models Query Languages … DB areas Query Languages System architecture Apply relational DB technology to XML data Streaming XML XML Query Processing XML indexing and compression …… Research Topics

XML IR

INEX:Initiative for the Evaluation for XML Retrieval • Documents: 12,107 articles in XML format • Queries: 30 Content-only; 30 Content and structure • Relevance Assessments: by participating groups • Participants: 36 active groups in 2003

CO search task • Document as hierarchical structure of nested elements • Type of elements is not considered • Query refers to content only • Query syntax as in standard text retrieval • Task: Find smallest subtree(element) satisfying the query

Example of CO Topic <INEX-Topic topic-id=“45” query-type=“CO” ct-no=“056”> <Title> <cw>augmented reality and medicine</cw></Title> <Description> How virtual (or augmented )reality can contribute to improve the medical and surgical practice. </Description> <Narrative> In order to be considered relevant, a document/component must include considerations about applications of computer graphics and especially augmented (or virtual) reality to medice(including surgery). </Narrative> <Keywords> Augmented virtual reality medicine surgery improve computer assisted aided image </Keywords> </INEX-Topic>

CAS search Task • Queries contain explicit references to the XML structure, by restricing • The context of interest • <te>:target element • The context of certain search concepts • (<cw>,<ce>) pairs

Example of CAS topic <INEX-Topic topic-id=“09” query-type=“CAS” ct-no=“048”> <Title> <te>article</te> <cw>non-monotonic reasoning</cw><ce>bdy/sec</ce> <cw>1999 2000</cw> <ce>hdr//yr</ce> <cw>-calendar</cw><ce><tig/at1<ce> <cw>belief revision</cw> </Title> <Narrative> Retrieve all articles from the years 1999-2000 that deal with works on non-monotonic reaonsing. Do not retrieve CfPs/calendar entries </Narrative> <Keywords>non-monotonic reasoning belief revision </Keywords> </INEX-Topic>

XML Retrieval Methods • XIRQL • XML query languages with IR-related features • Language models • JuruXML

XIRQL(I) • CO Approaches : • Split document text into disjoint nodes • Index nodes separately • Aggregate indexing weights for higher-level elements (subtrees)

document class="H.3.3" chapter chapter author title heading section section John Smith heading This. . . XML Query We describe heading heading syntax of XQL Lang. XQL XML Retrieval Introduction 1 3 2 Examples Syntax 4 5 Index nodes as units for term weighting Application of known indexing functions (e.g. tf*idf)

Index nodes for relevance-oriented search document class="H.3.3" chapter chapter author title heading section section John Smith heading This. . . XML Query We describe heading heading syntax of XQL Lang. XQL XML Retrieval Introduction 1 3 2 Examples Syntax Q1: syntax  example Q2: XQL 4 5

0.8+0.3-0.8*0.3=0.86 0.86 0.5 example 0.7*0.5=0.35 0.7 syntax Combining weights …by disjunction chapter 0.3 XQL section1 section2 0.5 example 0.8 XQL 0.7 syntax Need to return most specific element satisfying the query! Q1: syntax  example Q2: XQL

0.48+0.3-0.48*0.3=0.64 0.64 0.30 example 0.42 syntax 0.6 0.6 Combining weights … with augmentation weight chapter 0.3 XQL section1 section2 0.5 example 0.8 XQL 0.7 syntax Q2: XQL

XIRQL(II) • CAS approaches • Extension of XQL by • Weighting and ranking • Data types with vague predicates • Structural relativism

XQL Expressions • Path condition • search for single elements heading • parent-child: chapter/heading • ancestor-descendant: chapter//section • document root: /book/* • Filter wrt. structure: //chapter[heading] • Filter wrt. content: /document[@class=“H.3.3” $and$ author=“John Smith”]

Data types with vague predicates • Compares two values of a specific data-type • E.g. Near, broader, narrower • Returns (probabilistic) matching value • E.g. “Search for an artist named Ulbrich, living in Frankfurt, Germany about 100 years ago” Ernst Olbrich, Darmstadt, 1899 P(Olbrich Ulbrich)=0.8 (phonetic similarity) P(1899 1903)=0.9 (numeric similarity) P(Darmstadt Frankfurt)=0.7 (geographic distance)

Semantic Relativism • Drop distinction attribute/element: ~author searches for attribute or element • Generalize to data types: #personname searches for attribute/elements of specific data type

Language models • Generate language models for each node in the tree • Combine the children language models using linear interpolation • Use EM approach to train the linear interpolation parameters

Element-specific language models---CO Approaches

0.5 0.5 Higher level nodes: mixture of language models Query: dog and cat

Type-specific language models--- CAS approaches

0.5 0.5 0.5 0.5 • “Return components of type x where it has component y that contains the query term w” • e.g. return documents where the title is contains the word “bird” e.g. return documents where the body’s first section is contains the word “dog”

Juru-XML • Element-specific indexing+vector space model: • Transform query into set of (term,path)-conditions • Vague matching of path conditions • Modified cosine similarity as retrieval function

JuruXML(1)---Transform Query

JuruXML(2)---Vague matching of path conditions

Standard cosine similarity wQ(ti): query term weight of term ti wD(ti): indexing weight of term ti in the document Modified cosine similarity wQ(ti ,ciQ): query term weight of pair (ti,ciQ) wD(ti ,ciD): indexing weight of pair (ti,ciD)in the document JuruXML(3)---Retrieval function

JuruXML(4)---Alternative approach (Merging contexts) • For each query term (ti,ciQ) treat all matched document terms (ti,cjD) equally from the user perspective. • Define a weight function w(ciQ) • E.g.

Clustering XML documents

Document similarity • Document representation: documentN-dimensional vector • N= # document features • Feature sets • Text only • Tags only • Text + Tags • Feature weighting in the document vector • Similarity measure--- vector similarity • E.g. cosine measure

Clustering methods • Hierarchical clustering: • Main weakness: quadratic complexity • Partitional clustering: • K-means • Linear time complexity • Simplicity of its algorithm

K-Means clustering algorithm

Measuring clustering quality • External quality: comparison of clusters with external classification • Entropy distribution of classes within clusters • Purity largest class in a cluster/cluster size • Internal quality: calculate average inter- and intra- cluster similarities. • cohesiveness ( overall similarity)

Discussion • Text alone give best results • Text+tags: problem with weighting of tags vs. terms

Conclusion • XML basics • XML Retrieval Tasks and methods • Clustering XML documents

Bayesian Networks

Context-dependent Retrieval • The score of one element is given by RSV(Retrieval Status Value). • RSV of node depends on RSVs of nodes in the context(parent nodes) • Elements with highest values are then presented to the user.

Bayesian Networks

Bayesian Networks(Cont.)

XML Information Retreival