NLG Shared Tasks: Lets try it and see what happens

NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen) http://www.csd.abdn.ac.uk/~ereiter

Contents • General Comments • Geneval proposal

Good points of Shared Task • Compare different approaches • Encourage people to interact more • Reduce NLG “barriers to entry” • Better understanding of evaluation

Bad Points • May narrow focus of community • IR ignored web search because of TREC? • May encourage incremental research instead of new ideas

My opinion • Lets give it a try • But I suspect one-off exercises are better than a series • Many people think MUC, DUC, etc were very useful initially but became less scientifically exciting over time

Practical Issues • Domain/task? • Need something which several (6?) group are interested in • Evaluation technique • Avoid techniques that are biased • Eg, some automatic metrics may favour stat systems

Geneval • Proposal to evaluate NLG evaluation • Core idea is to evaluate in many ways a set of systems with similar input/output functionality, and see how well different evaluation techniques correlate • Anja Belz and Ehud Reiter • Hope to submit to EPSRC (roughly similar to NSF in US) soon

NLG Evaluation • Many types • Task-based, human ratings, BLEU-like metrics, etc • Little consensus on best technique • Ie, most appropriate for a context • Poorly understood

Some open questions • How well do diff types correlate? • Eg, does BLEU predict human ratings? • Are there biases? • Eg, are statistical NLG systems over/under rated by some techniques? • What is best design? • Number of subjects, subject expertise, number (quality) of reference texts, etc

Belz and Reiter (2006) • Evaluated several systems for generating wind statements in weather forecasts, using both human judgements and BLEU-like metrics • Found OK (not wonderful) correlation, but also some biases • Geneval: do this on a much larger scale • More domains, more systems, more evaluation techniques (including new ones), etc

Geneval: Possible Domains • Weather forecasts (not wind statements) • Use SumTime corpus • Referring expressions • Use Prodigy-Grec or Tuna corpus • Medical summaries • Use Babytalk corpus • Statistical summaries • Use Atlas corpus

Geneval: Evaluation techniques • Human task-based • Eg, referential success • Human ratings • Likert vs pref; expert vs non-expert • Automatic metrics based on ref texts • BLEU, ROUGE, METEOR, etc • Automatic metrics without ref texts • MT T and X scores, length

Geneval: new techniques • Would also like to explore and develop new evaluation techniques • Post-edit based human evaluations? • Automatic metrics which look at semantic features? • Open to suggestions for other ideas!

Would like systems contributed • Study would be better if other people would contribute systems • We supply data sets and corpora, and carry out evaluations • So you can focus 100% on your great new algorithmic ideas!

Geneval from STEC perspect • Sort of like STEC??? • If people contribute systems based on our data sets and corpora • But results will be anonymised • only developer of system X knows how well X did • One-off exercises, not repeated • Multiple evaluation techniques • Hope data sets will reduce barriers to entry

Geneval • Please let Anja or I know if • You have general comments, and/or • You have a suggestion for an additional evaluation technique • You might be interested in contributing a system

NLG Shared Tasks: Lets try it and see what happens

NLG Shared Tasks: Lets try it and see what happens

Presentation Transcript

Simple Models for Emergence of a Shared Vocabulary

Automating System Administration Tasks

Reduced Bearing

AP World History Review

Lets Plan Ahead

Measuring Team Shared Understanding: Using Analysis-Constructed Shared Mental Model Methodology

CO-OP Shared Branching Policies and Procedures January 2014

LOTE Common Core Reading to Writing Tasks

Part 4

Peter Sullivan

The Toga Detective V The Mystery of The Seven Tasks

Hey guys lets go to the zoo

Protecting Our Shared Water Resources

Distributed Systems: Shared Data

Supercomputing in Plain English Shared Memory Multithreading

Shared Memory Multiprocessors

Generating a Shared Vision for the Future

Text Mining: Techniques, Tools, ontologies and Shared tasks

How to Start and Grow a Video Podcast on Facebook