The Question Generation Task

The Question Generation Task Vasile Rus, Zhiqiang Cai, and Art Graesser

Outline • Shared task for NLG? • Why is question generation important? • Landscape of example questions • Definition of Question Generation • Subtasks • Evaluation Methodologies • Black-box vs. Glass-box • Manual vs. Automatic • Data sets

NLG: Shared Task(s) or Not? • Shared Tasks (3) • Pros • Define evaluation metrics • Compare approaches to the chosen task • Monitor task • Community wide efforts are needed for building resources, infrastructure • Bring the community together • increase visibility of NLG • Cons • Too much effort spent on the chosen task • Shadow other basic research effort

What Shared Task(s)? • Principle: due to the inherent difficulty of Language Generation choose a (relatively) simple task • Question Answering has avoided deep questions • Summarization focuses on extractive summaries • Textual Entailment = text understanding? • Full-fledged NLU evaluation?

Why is Question Generation Important? • Help systems and FAQ facilities need example questions for users to model • Information retrieval queries need suggested revised questions • A need for automated systems with proactive question asking and answering • Intelligent tutoring systems need automated hints and other question probes

Who may care about Question Generation? • Natural Language Generation community • Learning Technologies community • Intelligent Tutoring Systems • Subject testing (ETS) • Question Answering community

Landscape of Questions to Generate (Graesser and Person,1994; Lehnert, 1978) LEVEL 1: SIMPLE or SHALLOW 1. Verification Is X true or false? Did an event occur? 2. Disjunctive Is X, Y, or Z the case? 3. Concept completion Who? What? When? Where? 4. Example What is an example or instance of a category?). LEVEL 2: INTERMEDIATE 5. Feature specification What qualitative properties does entity X have? 6. Quantification What is the value of a quantitative variable? How much? 6. Definition questions What does X mean? 8. Comparison How is X similar to Y? How is X different from Y? LEVEL 3: COMPLEX or DEEP 9. Interpretation What concept/claim can be inferred from a pattern of data? 10. Causal antecedent Why did an event occur? 11. Causal consequence What are the consequences of an event or state? 12. Goal orientation What are the motives or goals behind an agent’s action? 13. Instrumental/procedural What plan or instrument allows an agent to accomplish a goal? 14. Enablement What object or resource allows an agent to accomplish a goal? 15. Expectation Why did some expected event not occur? 16. Judgmental What value does the answerer place on an idea or advice?

Question Generation • Input: one or more sentences • Output: set of questions related to the input text

Examples • AutoTutor • INPUT: There are no horizontal forces on the packet after release. • OUTPUT: What can you say about the horizontal forces on the packet? • NIST QA track • INPUT: But here is who will actually direct Dreamgirls -- none other than Frank Oz, the voice of Miss Piggy on the Muppets. • OUTPUT: Who is the voice of Miss Piggy?

Subtasks - Input • INPUT • Input one sentence • Input one paragraph • Input specified in a formalism appropriate for Language Generation

Subtasks - Output • OUTPUT • Subtask 1: generate question containing only words from input • Subtask 2: generate questions containing only words from input, except for one word • Subtask 3: generate questions containing replaced phrases from input • Subtask 4: generate WHO questions, WHEN questions, etc. • Subtask 5: freely generate questions

Evaluation • Black-box • Simply look at the quality of the output • Glass-box • Some subtask are designed to test for particular components of language generation • Subtask 1 is suitable for testing syntactic variability and microplanning • Subtask 2 is suitable for testing lexical generation

Evaluation • Manual • Human experts judge the questions on quality and/or relevance • What is a good question? • Automatic • Suitable for some subtasks • Use automatic evaluation techniques from summarization (extractive summarization)

Evaluation - Metrics • Precision • Recall • Prepare a set of good questions for each input • Re-use existing data, e.g. NIST QA data • Use NIST method: • Collect all good questions from all submissions and use it as the pool of GOLD STANDARD questions • Ranking: MRR (mean reciprocal rank) • Confidence measure: confidence weighted measure

Data • AutoTutor • Hints and prompts to elicit physics principles • Expert-generated questions in curriculum scripts • NIST QA track • Thousands of Question-Answer pairs • Manipulate existing data • New data

Pros and Cons • Pros: • Textual input could help with wide adoption • Suitable for glass- and black-box evaluation • Automatic evaluation is possible • Data sets already available or almost available • Cons: • Discourse planning • Alternative: generate set of related questions where anaphora and other discourse aspects are present • Pre-posed context clause • Fundamental issue: • What is a good question?

Summary • Simple and attractive • Automatic evaluation possible • Data sets available

Thank You!

The Question Generation Task

The Question Generation Task

Presentation Transcript

Building Resources for an Open Task on Question Generation

Automatic task generation for DAGuE

Resources for An Open Task on Question Generation

Generation Adequacy Task Force Update to WMS

Generation Adequacy Task Force Update to WMS

Distributed Generation Task Force Discussion

IPPSO Distributed Generation Task Force

Distributed Generation Task Force Discussion

Distributed Generation Task Force

Generation Adequacy Task Force Update to WMS

Task Question

TSS Wind Generation Task Force Report

ADGEN: Advanced Generation for Question Answering

ADGEN: Advanced Generation for Question Answering

Good Question! Statistical Ranking for Question Generation

Generation Adequacy Task Force Report to TAC

Automatic Question Generation for Vocabulary Assessment

The First Question Generation Shared Task and Evaluation Campaign

Generation Adequacy Task Force Report to TAC

Generation Adequacy Task Force Update to WMS

Generation Adequacy Task Force Update to WMS

Generation Adequacy Task Force Report to WMS