Searching XML Documents via XML Fragments

Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang

Database: IR: Schema: Papers (Title, Authors, Conf., Journal) An example document: <title> Title Authors Conf. Journal Intel: New chip, new price war . February 1, 2004: 6:32 PM EST. Intel Corp. on Sunday said it had refreshed its line of microchips for desktop computers with a new version of the Pentium 4 processor, designed to run increasingly power-hungry office and home entertainment software faster. In 1998,….. </title> </date> <date> XIRQL N.Fuhr, K. Grobjohann SIGIR ------ <content> Protein John Smith --- Bioinformatics DB+IR: </content> Semi-structured Data Background(1) --- Data Why is semi-structured data important? Well-structured Data Un-structured Data Lack of flexibility Lack of extensibility <paper> <title> XIRQL </title> <author> N.Fuhr </author> <author> K.Grobjohann </author> <conf> SIGIR </conf> </paper> Lack of the logical structure of a document.

Attribute content End tag Start tag … book element id=“25” title year author author publisher Readings in … 1997 Peter Willett Morgan Kaufmann Karen Sparck Jones XML in a nutshell <book id=“25”> <year>1997</year> <author> Karen Sparck Jones </author> <author>Peter Willett </author> <publisher>Morgan Kaufmann</publisher> <title> Readings in Information Retrieval </title> </book> … • Hierarchical data format • Nested element structure having a root • Self describing data (tags), schema is attached to the data itself.

IR: Ranked Query • Keywords: • paper SIGIR • Return the ranked documents according to the relevance. Database: Boolean Query • SQL (Structured Query Language): • SELECT title • FROM papers • WHERE conf=‘SIGIR’ • Return the unranked tuples satisfying the query. Background(2) --- Query How to query semi-structured data (e.g. XML data) ?

Related Work • DB-oriented approaches • E.g. XML-QL, XQL, XQUERY … WHERE <book> <title>Harry Potter </title> <author>$a</author>, <year> $y </year> </book> in “books.xml”, $y>2002 CONSTRUCT <result> <author>$t</author> </result> • DB+IR approaches • E.g. XIRQL • IR-oriented approaches • E.g. this paper

Problem Refinement---CAS Search • Document collection: • XML documents • Each document is a hierarchical structure of nested elements • Markup in the document mainly serves for exposing the logical structure of a document. • Query • content + explicit references to the XML structure • specifies the target element need to be returned An example: Retrieval all articles from the years 1999-2000 and deal with works on nonmonotonic reasoning. Do not retrieve articles that are calendar/call for papers.

Approach • Compare apple and apple • Recall vector space models • Both documents and queries are expressed in free text. • Compare unstructured data to unstructured data • This paper: • Search XML documents via XML fragments

XQuery <results> { for $t in document (“library.xml”//book/title) where contains ($t/text(), “search”) return $t } </results> Query---XML Fragments(1) More intuitive More flexible • Topic 1: Find all books about fishing <book> fishing </book> • Topic 2: Find all books having a title about search <book> <title> fishing </title> </book>

Query --- XML Fragment(2) • Limited expressiveness • E.g. “Finding figures that describe the Corba architecture and the paragraphs that refer to those figures. “ Requires a “join” operation between two elements “figures” and “paragraphs”

d q • Vector Space Model • Represent doc/query by a vector of terms • Relevance between doc and query distance between two vectors Recall: Text Retrieval Task • Give a query • According to the retrieval formula, compute the relevance score for each document; • Rank the documents according to relevance score.

Context resemblance measure Perfect match: ,when ; 0 ,otherwise. Partial match: ,when ci subsequence of ck; 0, otherwise Fuzzy match: Flat (ignore context): Extending the Vector Space Model(1) • Indexing unit: • E.g. (“Harry Potter ”, /book/title) • Can be matched with • (“Harry Potter ”,/book) • (“Harry Potter ”,/book/sec/title) • Retrieval Formula

,where “Merge-idf” variant: and ,where “Merge” variant: Extending the Vector Space Model(2) If c is rare, idf(t,c) would be high in spite of t being very common.

Evaluation • Runs • Partial-match • Partial-match. merge-idf • Partial-match.merge • Fuzzy-match.merge-idf • Flat (ignore context)

Result(1) • Result for “free-text-oriented” topics • An example topic : <yr>1995,1996,1997,1998,1999</yr> <bdy>XML Electronic commerce </bdy>

Result(2) • Result for “context-oriented” topics • An example topic: <atl> Content-Based retrieval of video databases</atl>

Summary • Using XML fragments with an extended vector space model is promising. • Use different solutions for different types of applications • Something wrong?

Another Problem --- CO Search • Document collection: • XML documents • Query: • a set of keywords • Task: Find smallest element satisfying the query Challenge: rank the components instead of document

Possible Solutions ,where Possible Method(1): treat each component as a document. Problem with this method: XML components are nested. <article> t1 <sec> <p> t2</p></sec> </article>

Impossible to differentiate between the rankings of the three sections Possible Solutions (Cont.) ,where Possible Method(2): counting TF at the component level; computing N & DF at the document level. <article> <sec>t1</sec> <sec>t1</sec> <sec>t2</sec> </article>

Proposed Solution • Create a index for each component type • Elements in each index are regarded as documents • Keep N, DF,TF for the specific component type • Can apply the regular vector space model on each index • Given a query • Run the query in parallel on each index • Return one ranked list of results, one from each index • Normalize the scores in each index into the range (0,1) • Achieved by computing • Merge the normalized results into a one ranked list of all components Assume the set of potential components to be returned must be known in advance. Assume no nesting of the same component.

Conclusion • Possible solutions to solve the following challenges. • Challenge 1 (Information/Doc Unit): What is an appropriate information unit? • Document may no longer be the most natural unit • Components in a document may be more appropriate • Challenge 2 (Query): What is an appropriate query language? • Keyword (free text) query is no longer the only choice • Constraints on the structures can be posed

References • Retrieving the most relevant XML components, by Y. Mass, M. Mandelbrod. INEX’03 workshop. • Searching XML Documents via XML fragments, by D. Carmel, Y. S.Maarek, M. Mandelbrod, Y. Mass and A. Soffer. SIGIR’03 • XIRQL: A Query Language for Information Retrieval in XML Documents by N. Fuhr, K. Großjohann. SIGIR’02

Searching XML Documents via XML Fragments