Putting development and evaluation of core technology first

Putting development and evaluation of core technology first Anja Belz Natural Language Technology Group University of Brighton, UK N L T G

Overview • NLG needs comparative evaluation • Core technology first, applications second • Towards common subtasks, corpora and evaluation techniques • What kind of STEC event for NLG? Belz: Putting development and evaluation of core technology first

NLG needs comparative evaluation • NLG has strong evaluation traditions • But there has been no comparative evaluation, except handful of results, e.g.: • regenerating the Wall Street Journal Corpus • SumTime wind forecast generation • At present, we don’t really know which NLG techniques generally work better • For consolidation of results and collective progress, need ability to comparatively evaluate Belz: Putting development and evaluation of core technology first

Core technology first, applications second • Biggest challenge: identifying sharable tasks • Shared application—potentially divisive: • NLG is a varied field with many applications • hard to select one with enough agreement • evaluation results would be application-specific • Instead—choose tasks that can unify NLG: • tasks that are relevant to all NLG • core technology that is potentially useful to all NLG • utilise commonalities and agreement that have already emerged: GRE, lexicalisation, content ordering Belz: Putting development and evaluation of core technology first

Towards common subtasks, corpora and evaluation techniques • Standardising subtasks and input/output requirements • Building data resources for building and evaluating systems • Creating NLG-specific evaluation techniques • ISO quality characteristics: functionality, reliability, usability, efficiency, maintainability, portability • Need to focus on evaluation of quality of outputs: • (New) GENEVAL: test existing and new evaluation techniques • that assess different evaluation criteria • and have a range of associated cost/time requirements Belz: Putting development and evaluation of core technology first

What kind of STEC? • Don’t have an NLG STEC at application level (yet) • Don’t invest millions (yet) • Don’t have a large organisation run it (yet) • Because: • NLG technology isn’t ready • participation would involve large investment in terms of money and time • not many groups would be able to do that • would have to decide on an application – potentially divisive Belz: Putting development and evaluation of core technology first

What kind of STEC? • Do encourage many different shared tasks and subtasks (at least, initially) • Involve many NLG researchers in organising STECs • Involve SIGGEN, have steering committee • Because: • diversity in tasks reflects diversity of field (NLG just isn’t one thing) • it’s inclusive and representative • control stays with international academic community Belz: Putting development and evaluation of core technology first

Stakeholder STECs • Similar to SemEval 2007 (Senseval 4) • As opposed to shareholder STECs like DUC and MT-Eval • Annual STEC event attached to INLG and ENLG • Call for task proposals • Proposers organise and run their own STEC tasks • Ready test bed for new tasks: popular tasks grow, less popular ones disappear Belz: Putting development and evaluation of core technology first

Putting development and evaluation of core technology first