260 likes | 425 Views
RMIT University at INEX 2004 Heterogeneous Track Experiments. Jovan Pehcevski Email: jovanp@cs.rmit.edu.au School of Computer Science and Information Technology, RMIT University, Melbourne, Australia. Overview. Research questions Collection statistics Topics Retrieval systems
E N D
RMIT University at INEX 2004Heterogeneous Track Experiments Jovan Pehcevski Email: jovanp@cs.rmit.edu.au School of Computer Science and Information Technology, RMIT University, Melbourne, Australia.
Overview • Research questions • Collection statistics • Topics • Retrieval systems • Zettair (using two similarity measures) • Hybrid (Zettair with eXist, using two retrieval heuristics) • Runs: all automatic, title-only runs • #1: Zettair (Okapi BM25) • #2: Zettair (Pivoted Cosine) • #3: Hybrid (MpE heuristic) • #3: Hybrid (PME heuristic) • Results • Efficiency • Effectiveness (for the IEEE collection) • Final thoughts
Research Questions • The goal of the Heterogeneous track at INEX 2004 is to set up a test collection (a heterogeneous XML document collection, suitable retrieval topics, and relevance assessments that correspond to these topics) and to explore new retrieval challenges • Our group at RMIT focuses on answering the following questions: • For CO queries, what methods are feasible for determining elements that would be reasonable answers? • Should the data be organised (and indexed) as a single heterogeneous collection, or is it better to treat this collection as a set of homogeneous sub-collections? • Methods that can be used to map structural criteria from one DTD to another are NOT considered in this work
Heterogeneous collection • The heterogeneous XML collection at INEX 2004 consists of the following sub-collections: • QMULDCSDBPub - Publications database of QMUL Department of Computer Science • BibDBPub - BibTeX converted to XML by the IS group at the University of Duisburg-Essen • HCIBIB - Human-Computer Interaction Resources, bibliography from www.hcibib.org • Berkeley – library catalog records of books in the area of computer and information science from Berkeley • DBLP - from the Digital Bibliography & Library Project in Trier • CompuScience - from the Computer Science database of FIZ Karlsruhe • IEEE – IEEE Computer Society publications in the period between 1995 - 2002
Collection statistics We analyse and pre-process each sub-collection to determine the concept of a Document
Topics • Four types of retrieval topics are considered for the Heterogeneous track at INEX 2004 • CO(Content-Only) – plain queries, no structural constraints and target elements (10 topics) Example:XML information retrieval • BCAS(Basic Content-And-Structure) – queries using single structural and content-based constraints to enable synonym matches (1 topic) Example: //article[about(., XML information retrieval)] • CCAS (Complex Content-And-Structure) – queries using complex structural and content-based constraints to enable a wide range of path transformations and partial mappings (13 topics) Example: //article[about(.//sec, XML information retrieval)] • ECCAS(Extended Complex Content-And-Structure) – queries using probability likelihood of a structural constraint (0 topics) Example: //article(0.8)[about(.//sec(0.5), XML information retrieval)]
CCAS Topic Example <inex_topic topic_id="3" query_type="CCAS"> <title> //article[about(.//abs, Web usage mining) or about(.//sec, "Web mining" traversal navigation patterns)] </title> <content_description> We are looking for documents that describe capturing and mining Web usage, in particular the traversal and navigation patterns; motivations include Web site redesign and maintenance. </content_description> <structure_description> Article is a tag identifying a document, which can also be represented as a book tag, an inproceedings (or incollection) tag, an entry tag, etc. Abs is a tag identifying abstract of a document, which can be represented as an abstract tag, an abs tag, etc. Sec is a tag identifying an informative document component, such as section or paragraph. It can also be represented as sec, ss1, ss2, p, ip1 or other similar tags. </structure_description> <narrative> To be relevant, a document must describe methods for capturing and analysing web usage, in particular traversal and navigation patterns. The motivation is using Web usage mining for site reconfiguration and maintenance, as well as providing recommendations to the user. Methods that are not explicitly applied to the Web but could apply are still relevant. Capturing browsing actions for pre-fetching is not relevant.</narrative> <keywords> Web usage mining, Web log analysis, browsing pattern, navigation pattern, traversal pattern, Web statistics, Web design, Web maintenance, user recommendations </keywords> </inex_topic>
Retrieval Systems • Our runs use two systems • Zettair – a compact and fast full-text search engine • Hybrid – a modular system using best retrieval features from Zettair and eXist (a native XML database), and a top-up module to identify the appropriate units of retrieval • Unconstrained, plain text queries are used by each retrieval system. For each topic, the structural constraints and the target element are removed. Terms from the <title> are used to formulate the queries • The systems use two different strategies to index the terms in the heterogeneous XML collection
Zettair • From zetta (1021) and IR • A scalable, fast search engine server • Supports ranked, simple Boolean, and phrase queries • Indexes HTML, XML, plain text, and TREC-formatted documents • Usable as a C and python library • Native support for TREC experiments (not yet for INEX) • Documented. Includes easy-to-follow examples • BSD license • Emphasis on simplicity and efficiency • One executable does everything • Under continued development • Ported to Mac OS X, FreeBSD, MS Windows, Linux, Solaris • Available from www.seg.rmit.edu.au/zettair
Zettair Indexing • With Zettair, the seven homogeneous XML collections are indexed as a single heterogeneous XML collection • Single-pass, sort-merge scheme • Document-ordered, word position inverted indexes • Efficient, variable-byte index compression • Indexed the HET collection (1.14 GB) in under 5 minutes on a single AUD$2000 Intel P4 machine. Throughput: 230MB/minute • Fast configurable parser. Handles badly-formed HTML: • Validates each tag by matching < with > within a character • HTML comments are not indexed but are validated • Entity references translated • No support for internationalised text
Zettair Querying • B-tree vocabulary bulk-loaded at index construction time • For a 1.14 Gb collection, average query time is 10 milliseconds (without explicit caching or other optimisations) • Single-threaded, blocking I/O, and relatively unoptimised • Provides query-biased summaries of documents (see Tombros and Sanderson, “Advantages of query biased summaries in information retrieval”, SIGIR 1998) • Supports Pivoted Cosine and Okapi BM25 similarity measures • Working on further measures • Measures can be manipulated externally
Zettair Querying… The Pivoted Cosine similarity measure is: where: and: Wd = document length WAL = average document length s = 0.25 (the slope) N = number of docs in collection ft = collection frequency fd,t = within-document frequency (# of docs that t occurs in)
Zettair Querying… The Okapi BM25 similarity measure is: where: and: Wd = document length WAL = average document length k1 = 1.2 k3 = 1000 (effectively infinite) b = 0.75 N = number of docs in collection fq,t = query-term frequency fd,t = within-document frequency ft = collection frequency (# of docs that t occurs in)
Hybrid • Utilising best features fromZettair and eXist • With eXist, the seven homogeneous XML collections are indexed separately, but queries can span across the XML collections • The Hybrid system uses a “fetch and browse” approach, where heterogeneous Documents are first retrieved and ranked by Zettair (the fetch phase), and the most specific elements from the highly ranked Documents are then extracted by eXist (the browse phase) • The system also uses a retrieval module that identifies and ranks Coherent Retrieval Elements (CREs) (more on next slides)
Coherent Retrieval Elements Definition: A Coherent Retrieval Element (CRE) is an element that contains at least two matching elements (extracted by eXist), or at least two other Coherent Retrieval Elements, or a combination of a matching element and a Coherent Retrieval Element. In plain words: The list of matching elements, extracted by eXist, is a document-ordered list (see Table 1 on the next slide). The list is processed by considering a pair of elements, starting from the first element down to the last. In each step, a CRE is identified as the most specific ancestor of the two matching elements that constitute this pair.
Matching Elements Table 1. eXist list of matching elements
Matching versus CREs Figure 1. Matching versus Coherent Retrieval Elements
Ranking the CREs • To determine the final ranks of CREs, the retrieval module uses a combination of the following heuristics: • The number of times a CRE appears in the absolute path of each extracted element in the eXist list of matching elements - more matches (M) or fewer matches (m) • The length of the absolute path of the CRE, taken from the root element - longer path (P) or shorter path (p) • The ordering of the XPath sequence in the absolute path of the CRE - nearer to beginning (B) or nearer to end (E) • For INEX 2003 test set, MpE yields best performance, although PME is more suitable for some metrics
Ranking the CREs… Table 2. Ranked list of Coherent Retrieval elements (using theMpEheuristic)
Ranking the CREs… Table 3. Ranked list of Coherent Retrieval elements (using thePMEheuristic)
Runs • Four runs: automatic, title-only • Zettair_BM25, using Zettair with Okapi BM25 similarity measure • Zettair_PCosine, using Zettair with Pivoted Cosine similarity measure • Hybrid_MpE, using the hybrid system with MpEheuristic combination • Hybrid_PME, using the hybrid system with PMEheuristic combination • The two hybrid runs use Zettair with Pivoted Cosine similarity measure • We use each of the above runs in each topic category (except ECCAS), resulting in 12 runs in total* * Our official INEX 2004 submission had 9 runs, since Hybrid_MpEwas not initially considered
Efficiency Results • The following efficiency results apply for Zettair only • HET collection indexed on a single $2000 Intel P4 machine • 808884documents, 1.14 GB of text • 5 minutes to index, at 230 MB/minute • 10 milliseconds per query to search (on average) • No stopping or stemming • Limited accumulators with “continue” strategy • Interesting statistics: • Full text index size, with full word positions, was 38.4% of the collection size (438.5 MB) • Distinct terms: 1.94 million • Term occurrences: 1.06 billion
Efficiency Results… Detailed statistics (per collection):
Effectiveness Results The following results consider the IEEE collection only
Effectiveness Results… • Quantitative, rather than qualitative analysis for the IEEE collection (although we will perform a detailed qualitative, query-and-run oriented analysis once Het relevance assessments are ready) • With P@10 for the IEEE collection, the hybrid runs are (on average) NOT substantially better than the full text runs • CO topics • Okapi better than Pivoted Cosine • MpE heuristic better than PME heuristic • Hybrid_MpE is best, although with P@10 Zettair_BM25 is competitive • CCAS topics • Pivoted Cosine better than Okapi • MpE heuristic (again) better than PME heuristic • Hybrid_MpE is best (with MAP), but Zettair_PCosine is best (with P@10) • With P@10, for either CO or CCAS topic type the best Zettair run is equal or better than the best Hybrid run
Final Thoughts • Four very different runs, exploring different similarity measures and retrieval heuristics (Okapi BM25 versus Pivoted Cosine, MpE heuristic versus PME heuristic) • Surprises in the results • Plain full-text search engine very competitive • More evaluation and follow up after INEX 2004 • Research questions • For CO queries, what methods are feasible for determining elements that would be reasonable answers? • The MpE heuristic in the CRE module appears to be a feasible method • Should the data be organised (and indexed) as a single heterogeneous collection, or is it better to treat this collection as a set of homogeneous sub-collections? • Indexing the data as a single heterogeneous collection appears to be both an efficient and an effective choice