160 likes | 242 Views
NLG Shared Tasks: Lets try it and see what happens. Ehud Reiter (Univ of Aberdeen ) http://www.csd.abdn.ac.uk/~ereiter. Contents. General Comments Geneval proposal. Good points of Shared Task. Compare different approaches Encourage people to interact more Reduce NLG “barriers to entry”
E N D
NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen) http://www.csd.abdn.ac.uk/~ereiter
Contents • General Comments • Geneval proposal
Good points of Shared Task • Compare different approaches • Encourage people to interact more • Reduce NLG “barriers to entry” • Better understanding of evaluation
Bad Points • May narrow focus of community • IR ignored web search because of TREC? • May encourage incremental research instead of new ideas
My opinion • Lets give it a try • But I suspect one-off exercises are better than a series • Many people think MUC, DUC, etc were very useful initially but became less scientifically exciting over time
Practical Issues • Domain/task? • Need something which several (6?) group are interested in • Evaluation technique • Avoid techniques that are biased • Eg, some automatic metrics may favour stat systems
Geneval • Proposal to evaluate NLG evaluation • Core idea is to evaluate in many ways a set of systems with similar input/output functionality, and see how well different evaluation techniques correlate • Anja Belz and Ehud Reiter • Hope to submit to EPSRC (roughly similar to NSF in US) soon
NLG Evaluation • Many types • Task-based, human ratings, BLEU-like metrics, etc • Little consensus on best technique • Ie, most appropriate for a context • Poorly understood
Some open questions • How well do diff types correlate? • Eg, does BLEU predict human ratings? • Are there biases? • Eg, are statistical NLG systems over/under rated by some techniques? • What is best design? • Number of subjects, subject expertise, number (quality) of reference texts, etc
Belz and Reiter (2006) • Evaluated several systems for generating wind statements in weather forecasts, using both human judgements and BLEU-like metrics • Found OK (not wonderful) correlation, but also some biases • Geneval: do this on a much larger scale • More domains, more systems, more evaluation techniques (including new ones), etc
Geneval: Possible Domains • Weather forecasts (not wind statements) • Use SumTime corpus • Referring expressions • Use Prodigy-Grec or Tuna corpus • Medical summaries • Use Babytalk corpus • Statistical summaries • Use Atlas corpus
Geneval: Evaluation techniques • Human task-based • Eg, referential success • Human ratings • Likert vs pref; expert vs non-expert • Automatic metrics based on ref texts • BLEU, ROUGE, METEOR, etc • Automatic metrics without ref texts • MT T and X scores, length
Geneval: new techniques • Would also like to explore and develop new evaluation techniques • Post-edit based human evaluations? • Automatic metrics which look at semantic features? • Open to suggestions for other ideas!
Would like systems contributed • Study would be better if other people would contribute systems • We supply data sets and corpora, and carry out evaluations • So you can focus 100% on your great new algorithmic ideas!
Geneval from STEC perspect • Sort of like STEC??? • If people contribute systems based on our data sets and corpora • But results will be anonymised • only developer of system X knows how well X did • One-off exercises, not repeated • Multiple evaluation techniques • Hope data sets will reduce barriers to entry
Geneval • Please let Anja or I know if • You have general comments, and/or • You have a suggestion for an additional evaluation technique • You might be interested in contributing a system