Evaluation issues in anaphora resolution and beyond

Evaluation issues in anaphora resolution and beyond Ruslan Mitkov University of Wolverhampton Faro, 27 June 2002

Evaluation • Evaluation is a driving force for every NLP task/approach/application • Evaluation is indicative of the performance of a specific approach/application but not less importantly, reports where it stands as compared to other approaches/applications • Growing research in evaluation inspired by the availability of annotated corpora

Major impediments to fulfilling evaluation’s mission • Different approaches evaluated on different data • Different approaches evaluated in different modes • Results not independently confirmed • As a result, no comparison or objective evaluation possible

Anaphora resolution vs. coreference resolution • Anaphora resolution has to do with tracking down an antecedent of an anaphor • Coreference resolution seeks to identify all coreference classes (chains)

Anaphora resolution • For nominal anaphora which involves coreference it would be logical to regard each of the preceding noun phrases which are coreferential with the anaphor(s) as a legitimate antecedent Computational Linguistsfrom many different countries attended PorTAL. The participants enjoyed the presentations; they also took an active part in the discussions.

Evaluation in anaphora resolution Two perspectives: • Evaluation of anaphora resolution algorithms • Evaluation of anaphora resolution systems

Recall and Precision • MUC introduced the measures recall and precision for coreference resolution. • These measures, as defined, are not satisfactory in terms of clarity and coverage (Mitkov 2001).

Evaluation package for anaphora resolution algorithms (Mitkov 1998; 2000) Evaluation package for anaphora resolution algorithms (i) performance measures (ii) comparative evaluation tasks and (iii) component measures.

Performance measures • Success rate • Critical success rate Critical success rate applies only to those ‘tough’ anaphors which still have more than one candidate for antecedent after gender and number filter

Example • Evaluation data: 100 anaphors • Number of anaphors correctly resolved: 80 • Number of anaphors correctly resolved after gender and number constraints: 30 Success rate: 80/100 = 80%, Critical success rate 50/70 = 71.4%

Comparative evaluation tasks • Evaluation against baseline models • Comparison to similar approaches • Comparison with well-established approaches Approaches frequently used for comparison: Hobbs (1978), Brenan et al. (1987), Lappin and Leass (1994), Kennedy and Boguraev (1996), Baldwin (1997), Mitkov (1996; 1998)

Component measures • Relative importance • Decision power (Mitkov 2001)

Evaluation measures for anaphora resolution systems • Success rate • Critical success rate • Resolution etiquette (Mitkov et al. 2002)

Reliability of evaluation results Evaluation results can be regarded as reliable if evaluation covers/employs • All naturally occurring texts • Sampling procedures

Relative vs. absolute results • Results may be relative with regard to a specific evaluation set or other approach • More “absolute” figures may be obtained if there existed a measure which quantified for the complexity of anaphors to be resolved

Measures quantifying complexity in anaphora resolution Measures for complexity (Mitkov 2001): • Knowledge required for resolution • Distance between anaphor and antecedent (in NPs, clauses, sentences) • Number of competing candidates

Fair evaluation Algorithms should be evaluated on the basis of the same • Evaluation data • Pre-processing tools

Evaluation workbench Evaluation workbench for anaphora resolution (Mitkov 2000; Barbu and Mitkov 2001) • Allows the comparison of approaches sharing common principles or similar pre-processing • Enables the ‘plugging in’ and testing of different anaphora resolution algorithms All algorithms implemented operate in a fully automatic mode

The need for annotated corpora Annotated corpora are vital for training and evaluation Annotation should cover anaphoric or coreferential chains and not only anaphor-antecedent pairs only

Scarce commodity • Lancaster Anaphoric Treebank (100 000 words) • MUC coreference task annotated data (65 000) • Part of the Penn Treebank (90 000)

Additional issues • Annotation scheme • Annotating tools • Annotation strategy Interannotators’ (dis)agreement is a major issue!

The Wolverhampton coreference annotation project A 500 000-word corpus annotated for anaphoric and coreferential links (identity-of-sense direct nominal anaphora) Less ambitious in terms of coverage, but much more consistent

Watch out for the traps! • Are all annotated data reliable? • Are all original documents reliable? • Are all results reported “honest”?

Morale and motivation important! If I may offer you my advice.... • Do not despair if your first evaluation results are not as high as you wanted them to be • Be prepared to provide considerable input in exchange of minor performance improvement • Work hard • Be transparent ... and you´ll get there!

Anaphora resolution projects Ruslan Mitkov’s home page http://www.wlv.ac.uk/~le1825 Research Group in Computational Linguistics http://clg.wlv.ac.uk

Evaluation issues in anaphora resolution and beyond

Evaluation issues in anaphora resolution and beyond

Presentation Transcript

Anaphora

Issues in Evaluation

Trends in Dispute Resolution: Idaho and Beyond!

Anaphora and Discourse

Anaphora

Anaphora

Anaphora

Anaphora

Anaphora in English vs. Anaphora in Arabic.

Anaphora

Anaphora

Anaphora

Exploiting Syntactic Patterns as Clues in Zero-Anaphora Resolution

Anaphora Resolution

Anaphora Resolution

Anaphora Resolution

Resolution issues and DOI

Anaphora

Statistical Anaphora Resolution

Anaphora Resolution

Anaphora and Discourse

Anaphora