190 likes | 456 Views
The Question Generation Task. Vasile Rus, Zhiqiang Cai, and Art Graesser. Outline. Shared task for NLG? Why is question generation important? Landscape of example questions Definition of Question Generation Subtasks Evaluation Methodologies Black-box vs. Glass-box Manual vs. Automatic
E N D
The Question Generation Task Vasile Rus, Zhiqiang Cai, and Art Graesser
Outline • Shared task for NLG? • Why is question generation important? • Landscape of example questions • Definition of Question Generation • Subtasks • Evaluation Methodologies • Black-box vs. Glass-box • Manual vs. Automatic • Data sets
NLG: Shared Task(s) or Not? • Shared Tasks (3) • Pros • Define evaluation metrics • Compare approaches to the chosen task • Monitor task • Community wide efforts are needed for building resources, infrastructure • Bring the community together • increase visibility of NLG • Cons • Too much effort spent on the chosen task • Shadow other basic research effort
What Shared Task(s)? • Principle: due to the inherent difficulty of Language Generation choose a (relatively) simple task • Question Answering has avoided deep questions • Summarization focuses on extractive summaries • Textual Entailment = text understanding? • Full-fledged NLU evaluation?
Why is Question Generation Important? • Help systems and FAQ facilities need example questions for users to model • Information retrieval queries need suggested revised questions • A need for automated systems with proactive question asking and answering • Intelligent tutoring systems need automated hints and other question probes
Who may care about Question Generation? • Natural Language Generation community • Learning Technologies community • Intelligent Tutoring Systems • Subject testing (ETS) • Question Answering community
Landscape of Questions to Generate (Graesser and Person,1994; Lehnert, 1978) LEVEL 1: SIMPLE or SHALLOW 1. Verification Is X true or false? Did an event occur? 2. Disjunctive Is X, Y, or Z the case? 3. Concept completion Who? What? When? Where? 4. Example What is an example or instance of a category?). LEVEL 2: INTERMEDIATE 5. Feature specification What qualitative properties does entity X have? 6. Quantification What is the value of a quantitative variable? How much? 6. Definition questions What does X mean? 8. Comparison How is X similar to Y? How is X different from Y? LEVEL 3: COMPLEX or DEEP 9. Interpretation What concept/claim can be inferred from a pattern of data? 10. Causal antecedent Why did an event occur? 11. Causal consequence What are the consequences of an event or state? 12. Goal orientation What are the motives or goals behind an agent’s action? 13. Instrumental/procedural What plan or instrument allows an agent to accomplish a goal? 14. Enablement What object or resource allows an agent to accomplish a goal? 15. Expectation Why did some expected event not occur? 16. Judgmental What value does the answerer place on an idea or advice?
Question Generation • Input: one or more sentences • Output: set of questions related to the input text
Examples • AutoTutor • INPUT: There are no horizontal forces on the packet after release. • OUTPUT: What can you say about the horizontal forces on the packet? • NIST QA track • INPUT: But here is who will actually direct Dreamgirls -- none other than Frank Oz, the voice of Miss Piggy on the Muppets. • OUTPUT: Who is the voice of Miss Piggy?
Subtasks - Input • INPUT • Input one sentence • Input one paragraph • Input specified in a formalism appropriate for Language Generation
Subtasks - Output • OUTPUT • Subtask 1: generate question containing only words from input • Subtask 2: generate questions containing only words from input, except for one word • Subtask 3: generate questions containing replaced phrases from input • Subtask 4: generate WHO questions, WHEN questions, etc. • Subtask 5: freely generate questions
Evaluation • Black-box • Simply look at the quality of the output • Glass-box • Some subtask are designed to test for particular components of language generation • Subtask 1 is suitable for testing syntactic variability and microplanning • Subtask 2 is suitable for testing lexical generation
Evaluation • Manual • Human experts judge the questions on quality and/or relevance • What is a good question? • Automatic • Suitable for some subtasks • Use automatic evaluation techniques from summarization (extractive summarization)
Evaluation - Metrics • Precision • Recall • Prepare a set of good questions for each input • Re-use existing data, e.g. NIST QA data • Use NIST method: • Collect all good questions from all submissions and use it as the pool of GOLD STANDARD questions • Ranking: MRR (mean reciprocal rank) • Confidence measure: confidence weighted measure
Data • AutoTutor • Hints and prompts to elicit physics principles • Expert-generated questions in curriculum scripts • NIST QA track • Thousands of Question-Answer pairs • Manipulate existing data • New data
Pros and Cons • Pros: • Textual input could help with wide adoption • Suitable for glass- and black-box evaluation • Automatic evaluation is possible • Data sets already available or almost available • Cons: • Discourse planning • Alternative: generate set of related questions where anaphora and other discourse aspects are present • Pre-posed context clause • Fundamental issue: • What is a good question?
Summary • Simple and attractive • Automatic evaluation possible • Data sets available