Question Generation: Proposed Challenge Tasks and Their Evaluation

Question Generation: Proposed Challenge Tasks and Their Evaluation Rodney D. Nielsen Boulder Language Technologies, Boulder, CO Center for Computational Language and Education Research, CU, Boulder

The Nature of Automatic QG • Application Dependent • Educational Assessment • Evaluate • Socratic Tutoring • Guide • Etc. • Gather information

Defining the QG Tasks • QG can be viewed as a 3-step process • Concept Selection • Question Type Determination • Question Construction

Key Concept Identification • Givens: • The full text document • The application track • Objective: • Identify key spans of text for which questions are likely to be generated.

Question Type Determination • Givens: • Source text snippets • The full text • The application track • Objective: • Identify the most likely types of questions to be generated

Question Construction • Application independent • Givens: • Source text snippets • A question type • The full text • Objective: • Construct a natural language question

Evaluating Key Concept Identification • K experts annotate a set of documents • Tag spans of text regarding key concepts • Adjudicate and tag as vital or optional • Instance Recall for each vital snippet • Instance Precision based on all snippets • F-measure • Fully Automatic

Evaluating Question Construction • Compare system question to K expert questions (similar to MT and AS) • Average question F-measure based on facet entailment • Use most similar expert question • Recall: proportion of facets in the expert question entailed by the system question • Precision: proportion of facets in the system question entailed by the expert question

Facet Representation • Original Dependency Parse det vc vmod sbar prd nmod sub vmod vmod pmod sub vmod det det The brass ring would not stick to the nail because the ring is not iron. • Final Semantic Representation theme_not cause_because be_prd_not nmod destination_to_not The brass ring would not stick to the nail because the ring is not iron.

Evaluating Question Construction • Prior work • Analysis of n-gram size effects (Soricut and Brill, 2004) • Dependence evaluation metrics (Owczarzak et al., 2007) • F-measure in similar evaluations (Turian et al., 2003) • N-gram inadequacy in entailment (Perez & Alfonseca, 2005) • Macro-average over nuggets (Lin & Demner-Fushman, 2005) • Facet entailment results (Nielsen et al., 2008)

Summary • QG can be viewed as a 3-step process • Concept Selection • Question Type Determination • Question Construction • Ultimate goal should be very context specific Question Generation • E.g., incorporating a learner model with their goals and a history of interactions

Thanks! • Thanks to Wayne Ward, Steve Bethard, James Martin, Matha Palmer, Philipp Wetzler, the CU Computational Semantics Group and the anonymous reviewers for helpful feedback. • This work was partially funded by Award Numbers: • NSF 0551723, • IES R305B070434, and • NSF DRL-0733323.

Evaluating Question Construction • A Unified Framework for Automatic Evaluation using N-gram Co-Occurrence Statistics (Soricut and Brill, 2004) • MT: 4-grams to ensure fluency • AS: unigrams; little syntactic construction • QG: bigram-level; uses question stems and extraction of key phrases, but more syntactic composition than typical AS

Question Generation: Proposed Challenge Tasks and Their Evaluation