C é cile Paris, Nathalie Colineau and Ross Wilkinson CSIRO ICT Centre Sydney, Australia

NLG Systems Evaluation: Establishing the Big Picture Cécile Paris, Nathalie Colineau and Ross Wilkinson CSIRO ICT Centre Sydney, Australia

What have we learnt from a shared task approach from our siblings (e.g., IR) • Advantages • Some algorithm & system comparability tests (e.g., inverse frequency works, length normalisation does not) • Some shared resources • (Recognised) Disadvantages • It will only tell you some of the things one needs to know – important elements will be missed • It does not allow the community to answer some important theoretical and practical questions • Too narrow Note: • No beliefs that there is a perfect system • There are some standards, but not a gold standard

Some beliefs? • Subtasks and input/output requirements need to be standardised to make core technologies trulycomparable. • What needs to be evaluated is an approach with its characteristics and its application context. • To evaluate systems/approaches we need to compare them in a shared task. • Comparison is not a requirement for evaluation nor shared task for comparison. • Quality of systems equates quality of output (i.e., English text). • A system cannot be reduced to its output; there are other attributes. • There has to be agold standard. • One measure cannot account for everything (even if we were to look at the quality of the output only). • One NLG technique works better than another. • Vive la difference! There is no such thing as “one size fits all”; typically there are pros and cons.

We can compare apples and pears • We do it in every day for many things (e.g., usefulness of comparisons as found in consumer reports) • Comparison: • Does not require exact similarity • But focuses on a set of characteristics/attributes. • Depending on situations and needs, different characteristics are required or favoured over others: fitness for purpose • We propose a framework in which to describe characteristics of NLG systems, modules or approaches

General: Convertible Safety: Front side airbag Brake assist system Convertible rollover protection Rain sensor windshield wipers Dimensions: 277 mm lower 600 mm smaller curb to curb turning circle Engine: Clutchless manual gearbox General: 4 wheel drive Larger seating capacity Safety: Rear window wipers Dimensions: 766 mm longer & 160 mm wider Engine: 1.9 l larger engine 2.1 faster acceleration 0-100 km/h 28 l larger fuel tank Example: Buying a car Hard constraint: must be between $ 30,000 and $ 40,000 RRP $31,990 Manual, Convertible, 2 doors, 4 seats, 82kw, 1.60L Origin: Spain, 2004 RRP $36,490 Manual, 4WD, 5 doors, 7 seats, 145kw, 3.50L, Origin: Korea, 2007 Set of attributes that characterise a car • General • Safety • Dimensions • Engine • Etc…

How can we compare (and choose)? Depends on the criteria of a person (or of a situation) • Robert’s Priorities: • Sports car • Size: wants a smaller car • Safety important • Bill’s Priorities: • Needs 4 wheel drive for camping trips • Seating capacity: large Does this mean that one car is better than another? No Comparison and evaluation in the abstract is not necessarily meaningful What is required is a way – a framework – to describe characteristics

Input: Type (e.g., numerical, semantic) Output: Type (e.g., English, logical form) “Quality” Number of expressions generated Fitness into other modules: Place into overall NLG architecture (e.g., requires a text planning or a grammar component) Configuration: Availability of parameters to fine-tune (e.g., user model, domain model) General: Execution time Can we apply these ideas to generation systems (or modules)? Example: Generating Referring Expressions (GRE) Input Output Fitness into other modules Configuration General

Input: Type: numerical Output: Type: English Quality: has been shown to allow people to select specific objects in a landscape Fitness into other modules: Place into overall NLG architecture: Requires a text planning component No additional lexico-grammatical component needed Configuration: Parameters to fine-tune: Yes, user model Requirements: creation of user model An example: Comparison of GRE components Hard constraint: need referring expressions in English System Y GRE module English LanguageOrigin: University Y System X GRE moduleEnglish LanguageOrigin: Lab X Input: • Type: knowledge base Output: • Type: logical form • Quality: produces appropriate input to a functional grammar Fitness into other modules: • Place into overall NLG architecture: • Requires a text planning component • Requires a functional grammar for realisation Configuration: • Parameters to fine-tune: no

Possible situations/criteria System X System Y My situation: • My input is numerical data • I need parameters to fine-tune Your situation: • You have a domain model available • You already have a grammar component • You need a GRE to “plug in” Different systems/approaches will be appropriate (Similar debate has taken place for template vs planning: no “best” method – depends on what one needs to do)

What we need to develop/agree upon • Comprehensive set of characteristics that describe and specify NLG components and systems • How to measure them? (when they need to be measured) • Might be qualitative or quantitative • Might not be a gold standard • Might depend on the characteristics • (e.g., different measure for fluency, task effectiveness, user satisfaction or cost/ease of building a system)

A framework for evaluation • Inspired by other work --looking beyond ourimmediate “siblings”, e.g., • Information systems • Delone and McLean 92 • Cornford et al. 94 • ISO 9126 • UM (effectiveness)

Need for a more general framework for evaluation • Enlarge the view of evaluation • Ensure we have a big picture(avoid dangers of local view) • Organise the possible criteria/ways to think about the questions to ask • Guides the experimental work • Consider NLG in its context andthat of its stakeholders • Consider costs and benefits • Allows one to choose system/module best fit for purpose • Allows for specific evaluation tasks, placing them and their results into a larger context

A proposed framework

Refining the characteristics (with our work)

Using the framework to define characteristics -- GRE What does this allow? Choice: Given a requirement, choose system with characteristics that fit the environment New attributes, guided by theframework Comparison & Evaluation:Given a system/module for specific requirements, evaluation with other systems can be done for a specific characteristic (e.g., user satisfaction, task completion, ease of building required input)

Impact of such a framework • Way to describe system (component, approach) better understanding of strengths and weaknesses. • Useful for evaluations and comparisons. • But also in general: • Someone needing a component can choose appropriate one • Someone outside the NLG community can choose a module for their own purposes, without knowing much about it increase visibility of field in other communities • Way to compare systems (modules, approaches) without need to standardise • Fit-for-purpose vs generic: not an issue • Researchers constrained to work on a specific domain/application can still describe their work and be part of this activity no exclusion

(Almost final) Remarks • Big picture • Funding • Fine-tuning a system for specific task – no longer an issue • Attention to important theoretical problems • Understanding of weaknesses & strengths of systems (modules, approaches) Orthogonal issues • Finding balance between talking and doing • Shared resources vs shared tasks N/A

Moving forward as a community • What should we do? • Define set of characteristics to: • Understand position and specificity of an approach (module, system) • Allow descriptions and comparisons • How? • Reflect on our own work and characterise it in terms of its strengths (and weaknesses!) – e.g., think about different stakeholders involved in construction, maintenance, funding, etc. • Use framework as guidance • To understand an approach (module, system) from a variety of perspectives (e.g., not just the output) • To know what to evaluate depending on the situation • To ensure we see the big picture

References • Cornford, T, Doukidis, G.I. & Forster, D. (1994). Experience with a structure, process and outcome framework for evaluating an information system, Omega, International Journal of Management Science, 22 (5), 491-504. • DeLone, W. H. & McLean, E. R. (1992). Information Systems Success: The Quest for the Dependent Variable. In Information Systems Research, Volume 3, Issue 1 (March, 1992), 60-96.

Outline • Misconceptions: what we commonly think is true • Can we compare apples and pears to get rid of the lemons? • How does this apply to NLG? • Enlarging the view of evaluation: “The Big Picture” • Remarks • Moving forward

C é cile Paris, Nathalie Colineau and Ross Wilkinson CSIRO ICT Centre Sydney, Australia