Task 1: Intrinsic Evaluation

Task 1: Intrinsic Evaluation VasileRus, Wei Chen, Pascal Kuyten, Ron Artstein

Task definition • Only interested in info-seeking questions • Evaluation biased towards current technology • Asking for the “trigger” text is problematic: • Future QG systems may not employ a trigger • Trigger less important for deep/holistic questions • Need to define what counts as QG • Would mining for questions be acceptable? • Require generative component? (defined how?) • Internal representation? Structure?

Evaluation criteria • Evaluate question alone, or question+answer? • System provides question • Evaluator decides if answer is available • Separately, evaluate system answer if given • Answer = contiguous text? • Can this be relaxed? • Additional criteria: conciseness?

Annotation guidelines • Question type: need more detailed definition • Yao et al (submitted): • What category includes (what|which) (NP|PP) • Question type identified mechanically with ad-hoc rules

Terminology • For QG from sentences task: • “Ambiguity” is really specificity or concreteness • “Relevance” is really answerability

Rating disagreements • Many (most?) of the disagreements are between close ratings (e.g. 3 vs. 4) • Need a measure that considers magnitudes, such as Krippendorff’s α • Perhaps normalize ratings by rater? • Specific disagreement on in-situ questions • The codes are not what? • Needs to be addressed in the guidelines

New tasks • Replace QG from sentences with QG from metadata • Evaluates only the generation component • Finding things to ask remains a component of the QG from paragraphs task • Make all system results public for analysis • Required? Voluntary? • Use data to learn from others’ problems

Task 1: Intrinsic Evaluation

Task 1: Intrinsic Evaluation

Presentation Transcript

Evaluation Task 3

Main Task Evaluation

PRELIMINARY TASK EVALUATION

Task 1

Task 1

TASK 1

Main Task Evaluation

TASK 1

Task 1

Task 1

Evaluation of Main Task

Task 1

Task 1

Task 1

Task 1

Task 1

Preliminary Task Evaluation

Evaluation Task Force Goals

GEOSS Evaluation Task Analysis

Task 1:

Task 1

Evaluation Task 1