160 likes | 179 Views
Geneval proposes to evaluate NLG systems with similar input/output functionality using various evaluation techniques, including human task-based assessments and automatic metrics. The goal is to explore correlations between different evaluation techniques and improve the understanding of NLG evaluation. This effort aims to reduce barriers to entry in NLG research and encourage interaction among researchers.
E N D
NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen) http://www.csd.abdn.ac.uk/~ereiter
Contents • General Comments • Geneval proposal
Good points of Shared Task • Compare different approaches • Encourage people to interact more • Reduce NLG “barriers to entry” • Better understanding of evaluation
Bad Points • May narrow focus of community • IR ignored web search because of TREC? • May encourage incremental research instead of new ideas
My opinion • Lets give it a try • But I suspect one-off exercises are better than a series • Many people think MUC, DUC, etc were very useful initially but became less scientifically exciting over time
Practical Issues • Domain/task? • Need something which several (6?) group are interested in • Evaluation technique • Avoid techniques that are biased • Eg, some automatic metrics may favour stat systems
Geneval • Proposal to evaluate NLG evaluation • Core idea is to evaluate in many ways a set of systems with similar input/output functionality, and see how well different evaluation techniques correlate • Anja Belz and Ehud Reiter • Hope to submit to EPSRC (roughly similar to NSF in US) soon
NLG Evaluation • Many types • Task-based, human ratings, BLEU-like metrics, etc • Little consensus on best technique • Ie, most appropriate for a context • Poorly understood
Some open questions • How well do diff types correlate? • Eg, does BLEU predict human ratings? • Are there biases? • Eg, are statistical NLG systems over/under rated by some techniques? • What is best design? • Number of subjects, subject expertise, number (quality) of reference texts, etc
Belz and Reiter (2006) • Evaluated several systems for generating wind statements in weather forecasts, using both human judgements and BLEU-like metrics • Found OK (not wonderful) correlation, but also some biases • Geneval: do this on a much larger scale • More domains, more systems, more evaluation techniques (including new ones), etc
Geneval: Possible Domains • Weather forecasts (not wind statements) • Use SumTime corpus • Referring expressions • Use Prodigy-Grec or Tuna corpus • Medical summaries • Use Babytalk corpus • Statistical summaries • Use Atlas corpus
Geneval: Evaluation techniques • Human task-based • Eg, referential success • Human ratings • Likert vs pref; expert vs non-expert • Automatic metrics based on ref texts • BLEU, ROUGE, METEOR, etc • Automatic metrics without ref texts • MT T and X scores, length
Geneval: new techniques • Would also like to explore and develop new evaluation techniques • Post-edit based human evaluations? • Automatic metrics which look at semantic features? • Open to suggestions for other ideas!
Would like systems contributed • Study would be better if other people would contribute systems • We supply data sets and corpora, and carry out evaluations • So you can focus 100% on your great new algorithmic ideas!
Geneval from STEC perspect • Sort of like STEC??? • If people contribute systems based on our data sets and corpora • But results will be anonymised • only developer of system X knows how well X did • One-off exercises, not repeated • Multiple evaluation techniques • Hope data sets will reduce barriers to entry
Geneval • Please let Anja or I know if • You have general comments, and/or • You have a suggestion for an additional evaluation technique • You might be interested in contributing a system