Share and Share Alike: Resources for Language Generation

Share and Share Alike: Resources for Language Generation Prof. Marilyn Walker University of Sheffield NSF- 20 April 2007

What type of resource is needed for generation? • What type of scientific problem is generation? • An essential difference between language generation and language interpretation problems (parsing, WSD, relation extraction, coreference) is that there is no single right answer for language generation; • Language Productivity Assumption: An optimal generation resource will represent multiple outputs for each input, with a human-generated quality metric associated with each output

Dialogue vs. generation? • Dialogue is like generation in that there is no single right answer for how to do a task in dialogue; • Information gathering and information presentation in dialogue systems are generation problems; • DARPA evaluation for dialogue systems; • Fixed domain “TRAVEL PLANNING” • First: ATIS evaluations compared dialogue system behaviour against human behaviour in corpus of human-wizard dialogues (Hirschman 2000); • No “mixed initiative”, different dialogue strategies, divergence of context, user modeling;

Dialogue vs. generation? • Second: define context, evaluate on system response to user utterance in a particular context; • Much more like generation, context is defined, system ‘communicative goal’ is defined • Form: How is ‘the same response’ defined? Some forms for identical content may be better than others; • Content: User Models, definitions of context. Also dialogue system should be able to decide on communicative goal.

Dialogue vs. generation? • Third: Communicator evaluation: given user task (NYC to LHR, Continental, April 22nd, 2007), collect metrics (time to completion, ASR error, utterance output quality, concept understanding, user satisfaction); • Corpus semi-automatically labelled with dialogue act (quality/strategy metrics) for system utterances (8 or more different instantiations from different systems for particular communicative goals); • Try to understand which metrics are contributors to user satisfaction (PARADISE); • User utterance labelled subsequently, used in RL experiments comparing dialogue strategies; • Hard to compare particular scientific techniques for particular modules in systems, plug and play never worked

Dialogue vs. generation: Conclusions? • Just having a fixed task (TRAVEL) by itself does not necessarily lead to scientific progress; • Want to compare particular scientific techniques for particular modules in systems; • Plug and play is the only way to do this; • BUT: very hard to define for a whole community what interfaces between modules should be

Position • What type of resources would be useful for scientific advancement in language generation?? • Almost anything!! • “If you build it they will come” - “If its useful people will use it” • Can we leverage what we already have in our own research groups, share it, and make it better?

What is needed to incentivize data sharing • Many different domains/problems/modules => NEED LOTS OF DIFFERENT RESOURCES; • Resources costly (developing group not ‘finished’ yet) => FINANCIAL INCENTIVE; SCIENTIFIC INCENTIVE; CITATION INCENTIVE; • Costs too much to support resource preparation, maintenance, distribution and re-use => NSF/LDC FINANCIAL/SUPPORT • NOTE: MANY LDC RESOURCES ARE ``FOUND DATA’’ (not explicitly commissioned)

A proposal for one shared resource

Information presentation of one or more database entities • Natural Language Interfaces/SDS (McKeown85, McCoy89, Cooperative Response literature, Carenini&Moore01, Polifroni etal 03, COGENTEX w/ active buyers website, Walkeretal04,Demberg&Moore06, etc) • Different communicative goals; Summarize, Recommend, Compare, Describe (DB entities) • Representation not controversial (attributes and values for DB entities, relations between entity and attribute) • Application not dependent on NLU

What type of resource is needed for generation? • What type of scientific problem is generation? • An essential difference between language generation and language interpretation problems (parsing, WSD, relation extraction, coreference) is that there is no single right answer for language generation; • Language Productivity Assumption: An optimal generation resource will represent multiple outputs for each input, with a human-generated quality metric associated with each output

We could make available a resource of: • INPUT-1: Speech ACT, SET of DB Entities • SUMMARIZE(SET); DESCRIBE(ENTITY), RECOMMEND(ENTITY,SET), COMPARE(SET) • INPUT-2: user model, discourse/dialogue context, style parameters, etc. • OUTPUT-1: a set of alternative outputs possibly with TTS markup • OUTPUT-2: human generated ratings or rankings for the outputs oriented to the criteria specified by INPUT-2

A Content Plan for a Recommend • strategy: recommend • relations: justify(nuc1; sat:2); justify(nuc:1; sat:3); justify(nuc:1, sat:4) • content: 1. assert(best (Babbo)) 2. assert(has-att (Babbo, foodquality(superb))) 3. assert(has-att (Babbo, decor(excellent))) 4. assert(has-att (Babbo, service(excellent)))

Human Feedback for Ranking • The ratings can represent any metric associated with the possible response, e.g. coherence, information quality, social appropriateness, personality. • Informational Coherence • SPARKY, a generator for MATCH • SPOT, a generator for AT&T COMMUNICATOR • Users are shown response variants then told: • For each variant, please rate to what extent you agree with this statement. • The utterance is easy to understand, well-formed and appropriate to the dialogue context.

Examples: Learned Rules applied to test fold

Individual Differences (Sentence Planning Preferences)

Human Feedback for Ranking (2) • Ten Item Personality Inventory Questionnaire, (Gosling 2003) • PERSONAGE • Users are shown response variants then told: • For each variant, rate on a scale of 1 to 7 whether: • The speaker is quiet, reserved; • The speaker is enthusiastic;

Personality judgments: `Recommend Le Marais’

What else is out there? • Coconut corpus: referring expression generation, but add alternatives and ratings? • Boston directions corpus (NSF funded early 1990s) • Communicator corpus (8 different system outputs for dialogue contexts that can be characterized) • Tools: Halogen, Penman, FUF-SURGE, RealPro • Library of text plans, content plans, sentence planners?

Share and Share Alike: Resources for Language Generation