The Overlap Problem in Content-Oriented XML Retrieval Evaluation

The Overlap Problem in Content-Oriented XML Retrieval Evaluation Gabriella Kazai1 Mounia Lalmas1 Arjen de Vries2 1 Queen Mary University of London, UK 2 CWI, The Netherlands

Outline • What is overlap and why is it a problem • The INEX test collection • Problems with current INEX metrics • Proposed metrics • Conclusions and future work

Assessments: Ranked result list: p p . . . sec sec Overlap in XML Retrieval • Overlapping (nested) result elements in output list • Overlapping (nested) reference elements in recall-base article ... sec author sec title subsec subsec title p p

Initiative for the Evaluation of XML retrieval (INEX) • Effectiveness of content-oriented XML retrieval • Ad-hoc retrieval task • Content-only (CO): no structural hints XML engine to identify most appropriate level of granularity

INEX Evaluation Criterion • Two relevance dimensions! • Exhaustivity (E):how exhaustively a component discusses the topic of request • Specificity (S):how focused the component is on the topic of request (i.e. discusses no other, irrelevant topics) • Multiple grades! • highly (3), • fairly (2), • marginally (1), • not (0) exhaustive/specific • Assessments as (e,s) pairs (3,1) (3,2) (1,3) (2,3)

INEX Test Collection • Documents • 12,107 articles of IEEE CS 1995-2002, 8.2 million XML elements • Topics • 31 CO topics • Relevance assessments • Propagation effect of Exhaustivity! • ~26,000 relevant elements on ~14,000 relevant paths • Propagated assessments: ~45% • Increase in size of recall-base: ~182%

p sec Current INEX Metrics • inex-2002 and inex-2003 • Based on recall/precision • Quantisation functions: • E.g., generalised: • inex-2003 penalises overlap of results • Reduced score for components seen in full or in part

Problem with Current INEX Metrics • Both metrics ignore overlap of reference elements! • 100% recall only if all reference elements returned including overlapping elements (contradicts task!) • Extent of problem: evaluation of an ideal run inex_2002inex_2003 inex_2002inex_2003 Strict quantisation Generalised quantisation Precision is plotted against lower recall values than merited according to the task definition!

Proposed Metrics • Metrics not directly dependent on size of recall-base • Separation of ideal results vs. near misses • Metrics independent of user model • Extended Cumulated Gain (CG) based metrics • Relevance-value functions • Ideal Recall-base

Ideal Recall-base and Run • Ideal recall-base • Ideal results should be retrieved; near misses could be retrieved, but should not penalise if not retrieved • Derived based on user preferences • Ideal run • Ordering elements of the ideal recall-base by relevance score (3,1) (3,2) (3,3) (1,2) (1,3)

Relevance-Value (RV) Functions • Models user behaviour • Result-list independent • Based only on (e,s) value pairs (~quantisation functions) • Result-list dependent • Considers overlap of result elements (~inex-2003) : ranked result list : reflects user’s tolerance to redundant component parts

Cumulated Gain • Gain vector (G) from ranked document list • Ideal gain vector (I) from documents in recall-base • Cumulated gain (CG) • Plot CGG of actual run against CGI of ideal ranking • nCGG = CGG / CGI L = <d4,d5,d2,d3,d1> G = <3,0,1,3,2> I = <3,3,2,1,0> CGG= <3,3,4,7,9> CGI= <3,6,8,9,9>

Cumulated Gain for XML Recall-base: Ranked result list: Ideal gain vector I[i] = r(ci) (r(ci) from ideal recall-base) Actual gain vector G[i] = r(ci) (r(ci) from full recall-base!)

Retrieval of ideal results is rewarded, near misses can be rewarded partial score, but does not penalise systems for not retrieving near misses! Cumulated Gain for XML • Multiple relevance • Result-list dependent RV function Overlap of • I derived from ideal recall-base Overlap of dimensions result elements reference elements

(3,1) (3,3) Cumulated Gain for XML • However, consequences of ideal recall-base in CG • | G | < | I | • Max(CGG) > Max(CGI) G = <1,0.75,…> I = <1> I = <1,0,...> Extend ideal gain vector with irrelevant elements Force CGG to level after reaching Max(CGI)

Conclusions • Unsolved issues with recall/precision due to overlap of reference elements in recall-base • XML-CG with ideal recall-base provides a solution for overlap of result and reference elements • Still possible to reward partial success without theside-effect • “Plug-in” user models: RV function used as parameter of metrics • Limitation: Max(CGG) = Max(CGI) :

Future Work • Metric to be used in INEX 2004 • Evaluation of metric: stability testing • RV functions based on user models in INEX 2004 Interactive track • General problem of overlap of result elements when no predefined unit of retrieval exists

Thank you

Does NOTconsideroverlap ofresult elementsnoroverlap ofreference elements! inex-2002 metric • Precall [Raghavan, Bollman & Jung 1989]: • Quantisation functions • Strict • Generalised

Does NOTconsideroverlap ofreference elements! inex-2003 metric • E,S in ideal concept space [Gövert, Kazai, Fuhr & Lalmas 2003]: • Quantisation functions • Strict • Generalised

The Overlap Problem in Content-Oriented XML Retrieval Evaluation