Overview of the INEX 2008 Efficiency Track

Overview of the INEX 2008 Efficiency Track Martin Theobald Ralf Schenkel Max Planck Institute

General Idea • Have a nice collection of readily assessed INEX ad-hoc topics from 2006—2008, so why not consider runtimes for a change? • Attract more people from DB&IR to efficient XML-IR • Investigate effectiveness/efficiency trade-offs for different retrieval modes and topic types: Article, Thorough, Focused, NEXI CO/CAS, XPath 2.0 Full-Text, high-dimensional content (query expansion/relevance feedback), deep structure, top-k, distribution, sequential vs. parallel executions

Test Collection • Default INEX-Wikipedia collection 2007 • 4.38 GB XML sources > 659,000 documents and > 115,000,000 elements • Not very heterogeneous but a rather awkward structure sometimes, many deeply nested paths > 3,000 distinct tags (> 1,000 out of which have content) > 120,000 distinct root-to-leaf paths • No DTD available • But: Be able to reuse large body of Ad-Hoc topics & assessments

Topics • 540 type (A) topics (no. 289—828) • Previous and current Ad-Hoc topics taken from INEX 2006—2008 • 308 topics have assessments • Topic titles in NEXI CO & CAS and XPath 2.0 Full-Text syntax • Full-text predicates: “”, +, - • 21 type (B) topics (no. 829—849) • High-dimensional content with up to 112 keywords • Obtained from the 2006 Interactive Track feedback experiments by the Royal School Of Library And Information Science, Denmark • Originally CO topics only, cast into CAS using //*[about(…)] • Mapping to original Ad-Hoc topic id, thus reuse assessments • 7 type (C) topics (no. 850—856) • High-dimensional structure with multiple branches • Newly submitted by Efficiency Track participants • Assessments skipped due to low expected overall impact on overall effectiveness results

Example Topic - Type (B) <topic id="844" adhocid="517" type="B"> <co_title> castle mound castles fortress defensive earthworks offensive herefordshire french fortification ditch hollingbury scrob type circularconquest defend siegecraft surrounded dry essential walls norman ages weapon skagerrak kattegat inseparably internees feature citadels halland bayeux connotes palisade zar segovia mota provide earth castel word motte richard middle twofold fitz moat inroads tapestry confessor shropshire country flattened article disambiguation examines perimeter include bordering angular denotes styled crest prehistoric discusses fortified maiden timber denote countryside occupy welsh summit hostile erected towers parish danish mainland siege depicted wait mechanism stronger restricted residence aspect familiar provinces knight subjects survive virtually medieval lay swedish estate enemies describes measure denmark structures architecture traditionally domestic techniques store permanent normally camp fort </co_title> <cas_title> //*[about(., castle mound castles fortress defensive earthworks offensive herefordshire french fortification ditch hollingbury scrob type circular conquest defend siegecraft surrounded dry essential walls norman ages weapon skagerrak kattegat inseparably internees feature citadels halland bayeux connotes palisade zar segovia mota provide earth castel word motte richard middle twofold fitz moat inroads tapestry confessor shropshire country flattened article disambiguation examines perimeter include bordering angular denotes styled crest prehistoric discusses fortified maiden timber denote countryside occupy welsh summit hostile erected towers parish danish mainland siege depicted wait mechanism stronger restricted residence aspect familiar provinces knight subjects survive virtually medieval lay swedish estate enemies describes measure denmark structures architecture traditionally domestic techniques store permanent normally camp fort)] </cas_title> <xpath_title>//*[. ftcontains "castle mound castles fortress defensive earthworks offensive herefordshire french fortification ditch hollingbury scrob type circular conquest defend siegecraft surrounded dry essential walls norman ages weapon skagerrak kattegat inseparably internees feature citadels halland bayeux connotes palisade zar segovia mota provide earth castel word motte richard middle twofold fitz moat inroads tapestry confessor shropshire country flattened article disambiguation examines perimeter include bordering angular denotes styled crest prehistoric discusses fortified maiden timber denote countryside occupy welsh summit hostile erected towers parish danish mainland siege depicted wait mechanism stronger restricted residence aspect familiar provinces knight subjects survive virtually medieval lay swedish estate enemies describes measure denmark structures architecture traditionally domestic techniques store permanent normally camp fort"] </xpath_title> </topic> Conjunctive evaluation not feasible!

Example Topic - Type (C) <topic id="856" type="C"> <co_title> State Parks Geology Geography +Canyon </co_title> <cas_title> //article//body[about(.//section//p, State Park) and about(.//section//title, Geology) and about(.//section//title, Geography)] //figure[about(.//caption, +Canyon)] </cas_title> <xpath_title> //article//body[.//section//p ftcontains "State Park" and .//section//title ftcontains "Geology" and .//section//title ftcontains "Geography"] //figure[.//caption ftcontains "Canyon"] </xpath_title> <description> I’m looking for state parks with sections describing their geology and/or geography, preferably with a figure of a canyon as target element. </description> <narrative> State park pages often follow the common pattern of having sections entitled with "Geology" or "Geography". I’m particularly interested in those pages with a figure of a canyon, e.g., the Grand Canyon. </narrative> </topic>

Sub-Tasks • Article • Article-only runs, naturally overlap-free • In combination with CO queries, this resembles a classic IR setting with keyword queries and documents as results • Thorough • Used in INEX 2003—2006, allows overlapping results • May be more efficient for some systems • Focused • Current default mode in INEX, overlap-free at both passage- and element-level • May be an expensive post-processing step to remove overlap

Submissions <!ELEMENT efficiency-submission (topic-fields, general_description, ranking_description, indexing_description, caching_description, topic+)> <!ATTLIST efficiency-submission participant-id CDATA #REQUIRED run-id CDATA #REQUIRED task (article|thorough|focused) #REQUIRED query (automatic|manual) #REQUIRED sequential (yes|no) #REQUIRED no_cpu CDATA #IMPLIED ram CDATA #IMPLIED no_nodes CDATA #IMPLIED hardware_cost CDATA #IMPLIED hardware_year CDATA #IMPLIED topk (15|150|1500) #IMPLIED > <!ELEMENT efficiency-submission (topic-fields, general_description, ranking_description, indexing_description, caching_description,topic+)> <!ATTLIST efficiency-submission participant-id CDATA #REQUIRED run-id CDATA #REQUIRED task (article|thorough|focused) #REQUIRED query (automatic|manual) #REQUIRED sequential (yes|no) #REQUIRED no_cpu CDATA #IMPLIED ram CDATA #IMPLIED no_nodes CDATA #IMPLIED hardware_cost CDATA #IMPLIED hardware_year CDATA #IMPLIED topk (15|150|1500) #IMPLIED > <!ELEMENT topic-fields EMPTY> <!ATTLIST topic-fields co_title (yes|no) #REQUIRED cas_title (yes|no) #REQUIRED xpath_title (yes|no) #REQUIRED text_predicates(yes|no) #REQUIRED description (yes|no) #REQUIRED narrative (yes|no) #REQUIRED > <!ELEMENT general_description (#PCDATA)> <!ELEMENT ranking_description (#PCDATA)> <!ELEMENT indexing_description (#PCDATA)> <!ELEMENT caching_description (#PCDATA)> <!ELEMENT topic (result*)> <!ATTLIST topic topic-id CDATA #REQUIRED total_time_ms CDATA #REQUIRED cpu_time_ms CDATA #IMPLIED io_time_ms CDATA #IMPLIED > <!ELEMENT result (file, path, rank?, rsv?) > <!ELEMENT file (#PCDATA)> <!ELEMENT path (#PCDATA)> <!ELEMENT rank (#PCDATA)> <!ELEMENT rsv (#PCDATA)> DTD for run submissions

Submissions <!ELEMENT efficiency-submission (topic-fields, general_description, ranking_description, indexing_description, caching_description,topic+)> <!ATTLIST efficiency-submission participant-id CDATA #REQUIRED run-id CDATA #REQUIRED task (article|thorough|focused) #REQUIRED query (automatic|manual) #REQUIRED sequential (yes|no) #REQUIRED no_cpu CDATA #IMPLIED ram CDATA #IMPLIED no_nodes CDATA #IMPLIED hardware_cost CDATA #IMPLIED hardware_year CDATA #IMPLIED topk (15|150|1500) #IMPLIED > <!ELEMENT topic-fields EMPTY> <!ATTLIST topic-fields co_title (yes|no) #REQUIRED cas_title (yes|no) #REQUIRED xpath_title (yes|no) #REQUIRED text_predicates(yes|no) #REQUIRED description (yes|no) #REQUIRED narrative (yes|no) #REQUIRED > <!ELEMENT general_description (#PCDATA)> <!ELEMENT ranking_description (#PCDATA)> <!ELEMENT indexing_description (#PCDATA)> <!ELEMENT caching_description (#PCDATA)> <!ELEMENT topic (result*)> <!ATTLIST topic topic-id CDATA #REQUIRED total_time_ms CDATA #REQUIRED cpu_time_ms CDATA #IMPLIED io_time_ms CDATA #IMPLIED > <!ELEMENT result (file, path, rank?, rsv?) > <!ELEMENT file (#PCDATA)> <!ELEMENT path (#PCDATA)> <!ELEMENT rank (#PCDATA)> <!ELEMENT rsv (#PCDATA)> <!ELEMENT topic-fields EMPTY> <!ATTLIST topic-fields co_title (yes|no) #REQUIRED cas_title (yes|no) #REQUIRED xpath_title (yes|no) #REQUIRED text_predicates(yes|no) #REQUIRED description (yes|no) #REQUIRED narrative (yes|no) #REQUIRED > DTD for run submissions

Submissions <!ELEMENT efficiency-submission (topic-fields, general_description, ranking_description, indexing_description, caching_description,topic+)> <!ATTLIST efficiency-submission participant-id CDATA #REQUIRED run-id CDATA #REQUIRED task (article|thorough|focused) #REQUIRED query (automatic|manual) #REQUIRED sequential (yes|no) #REQUIRED no_cpu CDATA #IMPLIED ram CDATA #IMPLIED no_nodes CDATA #IMPLIED hardware_cost CDATA #IMPLIED hardware_year CDATA #IMPLIED topk (15|150|1500) #IMPLIED > <!ELEMENT topic-fields EMPTY> <!ATTLIST topic-fields co_title (yes|no) #REQUIRED cas_title (yes|no) #REQUIRED xpath_title (yes|no) #REQUIRED text_predicates(yes|no) #REQUIRED description (yes|no) #REQUIRED narrative (yes|no) #REQUIRED > <!ELEMENT general_description (#PCDATA)> <!ELEMENT ranking_description (#PCDATA)> <!ELEMENT indexing_description (#PCDATA)> <!ELEMENT caching_description (#PCDATA)> <!ELEMENT topic (result*)> <!ATTLIST topic topic-id CDATA #REQUIRED total_time_ms CDATA #REQUIRED cpu_time_ms CDATA #IMPLIED io_time_ms CDATA #IMPLIED > <!ELEMENT result (file, path, rank?, rsv?) > <!ELEMENT file (#PCDATA)> <!ELEMENT path (#PCDATA)> <!ELEMENT rank (#PCDATA)> <!ELEMENT rsv (#PCDATA)> <!ELEMENT topic (result*)> <!ATTLIST topic topic-id CDATA #REQUIRED total_time_ms CDATA #REQUIRED cpu_time_ms CDATA #IMPLIED io_time_ms CDATA #IMPLIED > <!ELEMENT result (file, path, rank?, rsv?) > <!ELEMENT file (#PCDATA)> <!ELEMENT path (#PCDATA)> <!ELEMENT rank (#PCDATA)> <!ELEMENT rsv (#PCDATA)> DTD for run submissions

Metrics • Interpolated Precision (iP) and Mean Average Interpolated Precision (MAiP) for Focused & Article modes • New INEX evaluation software 2008 (passage-based) • Qrels from 2006-2007 transformed into 2008 format • Classic precision/recall plots for Thorough mode • INEX EvalJ 2006-2007 (element-based) • Qrels from 2008 transformed back into 2006 format (made available for download at the track homepage)

Participants & Runs • Max-Planck-InstitutInformatik [10], 8 runs • University of Frankfurt [16], 5 runs • University of Toronto [42], 2 runs • University of Twente & CWI [53], 4 runs • JustSystems Corporation [56], 1 run

Results Overview • General Setting (parameters taken from submission headers)

Results Overview • Effectiveness (iP, MAiP) vs. Efficiency (wallclock runtime)

Effectiveness: Focused & Article, All Topics

Focused, Type (A)

Focused, Type (B)

Focused, Type (C)

Thorough, All Topics

Conclusions • Continue in 2009 with more new topics & subtasks • Establish as reference benchmark for XML-IR experiments for a broad DB&IR audience • Make available also to non INEX participants?

http://www.inex.otago.ac.nz/efficiency/efficiency.asp

Overview of the INEX 2008 Efficiency Track