1 / 20

Evaluating NLG Systems: A Framework for Comparison

Explore the advantages and disadvantages of shared task approaches in NLG system evaluation. Learn how to compare different systems based on key attributes and criteria for better decision-making.

tyrees
Download Presentation

Evaluating NLG Systems: A Framework for Comparison

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NLG Systems Evaluation: Establishing the Big Picture Cécile Paris, Nathalie Colineau and Ross Wilkinson CSIRO ICT Centre Sydney, Australia

  2. What have we learnt from a shared task approach from our siblings (e.g., IR) • Advantages • Some algorithm & system comparability tests (e.g., inverse frequency works, length normalisation does not) • Some shared resources • (Recognised) Disadvantages • It will only tell you some of the things one needs to know – important elements will be missed • It does not allow the community to answer some important theoretical and practical questions • Too narrow Note: • No beliefs that there is a perfect system • There are some standards, but not a gold standard

  3. Some beliefs? • Subtasks and input/output requirements need to be standardised to make core technologies trulycomparable. • What needs to be evaluated is an approach with its characteristics and its application context. • To evaluate systems/approaches we need to compare them in a shared task. • Comparison is not a requirement for evaluation nor shared task for comparison. • Quality of systems equates quality of output (i.e., English text). • A system cannot be reduced to its output; there are other attributes. • There has to be agold standard. • One measure cannot account for everything (even if we were to look at the quality of the output only). • One NLG technique works better than another. • Vive la difference! There is no such thing as “one size fits all”; typically there are pros and cons.

  4. We can compare apples and pears • We do it in every day for many things (e.g., usefulness of comparisons as found in consumer reports) • Comparison: • Does not require exact similarity • But focuses on a set of characteristics/attributes. • Depending on situations and needs, different characteristics are required or favoured over others: fitness for purpose • We propose a framework in which to describe characteristics of NLG systems, modules or approaches

  5. General: Convertible Safety: Front side airbag Brake assist system Convertible rollover protection Rain sensor windshield wipers Dimensions: 277 mm lower 600 mm smaller curb to curb turning circle Engine: Clutchless manual gearbox General: 4 wheel drive Larger seating capacity Safety: Rear window wipers Dimensions: 766 mm longer & 160 mm wider Engine: 1.9 l larger engine 2.1 faster acceleration 0-100 km/h 28 l larger fuel tank Example: Buying a car Hard constraint: must be between $ 30,000 and $ 40,000 RRP $31,990   Manual, Convertible, 2 doors, 4 seats, 82kw, 1.60L Origin: Spain, 2004 RRP $36,490   Manual, 4WD, 5 doors, 7 seats, 145kw, 3.50L, Origin: Korea, 2007 Set of attributes that characterise a car • General • Safety • Dimensions • Engine • Etc…

  6. How can we compare (and choose)? Depends on the criteria of a person (or of a situation) • Robert’s Priorities: • Sports car • Size: wants a smaller car • Safety important • Bill’s Priorities: • Needs 4 wheel drive for camping trips • Seating capacity: large Does this mean that one car is better than another? No Comparison and evaluation in the abstract is not necessarily meaningful What is required is a way – a framework – to describe characteristics

  7. Input: Type (e.g., numerical, semantic) Output: Type (e.g., English, logical form) “Quality” Number of expressions generated Fitness into other modules: Place into overall NLG architecture (e.g., requires a text planning or a grammar component) Configuration: Availability of parameters to fine-tune (e.g., user model, domain model) General: Execution time Can we apply these ideas to generation systems (or modules)? Example: Generating Referring Expressions (GRE) Input Output Fitness into other modules Configuration General

  8. Input: Type: numerical Output: Type: English Quality: has been shown to allow people to select specific objects in a landscape Fitness into other modules: Place into overall NLG architecture: Requires a text planning component No additional lexico-grammatical component needed Configuration: Parameters to fine-tune: Yes, user model Requirements: creation of user model An example: Comparison of GRE components Hard constraint: need referring expressions in English System Y GRE module English LanguageOrigin: University Y System X GRE moduleEnglish LanguageOrigin: Lab X Input: • Type: knowledge base Output: • Type: logical form • Quality: produces appropriate input to a functional grammar Fitness into other modules: • Place into overall NLG architecture: • Requires a text planning component • Requires a functional grammar for realisation Configuration: • Parameters to fine-tune: no

  9. Possible situations/criteria System X System Y My situation: • My input is numerical data • I need parameters to fine-tune Your situation: • You have a domain model available • You already have a grammar component • You need a GRE to “plug in” Different systems/approaches will be appropriate (Similar debate has taken place for template vs planning: no “best” method – depends on what one needs to do)

  10. What we need to develop/agree upon • Comprehensive set of characteristics that describe and specify NLG components and systems • How to measure them? (when they need to be measured) • Might be qualitative or quantitative • Might not be a gold standard • Might depend on the characteristics • (e.g., different measure for fluency, task effectiveness, user satisfaction or cost/ease of building a system)

  11. A framework for evaluation • Inspired by other work --looking beyond ourimmediate “siblings”, e.g., • Information systems • Delone and McLean 92 • Cornford et al. 94 • ISO 9126 • UM (effectiveness)

  12. Need for a more general framework for evaluation • Enlarge the view of evaluation • Ensure we have a big picture(avoid dangers of local view) • Organise the possible criteria/ways to think about the questions to ask • Guides the experimental work • Consider NLG in its context andthat of its stakeholders • Consider costs and benefits • Allows one to choose system/module best fit for purpose • Allows for specific evaluation tasks, placing them and their results into a larger context

  13. A proposed framework

  14. Refining the characteristics (with our work)

  15. Using the framework to define characteristics -- GRE What does this allow? Choice: Given a requirement, choose system with characteristics that fit the environment New attributes, guided by theframework Comparison & Evaluation:Given a system/module for specific requirements, evaluation with other systems can be done for a specific characteristic (e.g., user satisfaction, task completion, ease of building required input)

  16. Impact of such a framework • Way to describe system (component, approach) better understanding of strengths and weaknesses. • Useful for evaluations and comparisons. • But also in general: • Someone needing a component can choose appropriate one • Someone outside the NLG community can choose a module for their own purposes, without knowing much about it increase visibility of field in other communities • Way to compare systems (modules, approaches) without need to standardise • Fit-for-purpose vs generic: not an issue • Researchers constrained to work on a specific domain/application can still describe their work and be part of this activity no exclusion

  17. (Almost final) Remarks • Big picture • Funding • Fine-tuning a system for specific task – no longer an issue • Attention to important theoretical problems • Understanding of weaknesses & strengths of systems (modules, approaches) Orthogonal issues • Finding balance between talking and doing • Shared resources vs shared tasks N/A

  18. Moving forward as a community • What should we do? • Define set of characteristics to: • Understand position and specificity of an approach (module, system) • Allow descriptions and comparisons • How? • Reflect on our own work and characterise it in terms of its strengths (and weaknesses!) – e.g., think about different stakeholders involved in construction, maintenance, funding, etc. • Use framework as guidance • To understand an approach (module, system) from a variety of perspectives (e.g., not just the output) • To know what to evaluate depending on the situation • To ensure we see the big picture

  19. References • Cornford, T, Doukidis, G.I. & Forster, D. (1994). Experience with a structure, process and outcome framework for evaluating an information system, Omega, International Journal of Management Science, 22 (5), 491-504. • DeLone, W. H. & McLean, E. R. (1992). Information Systems Success: The Quest for the Dependent Variable. In Information Systems Research, Volume 3, Issue 1 (March, 1992), 60-96.

  20. Outline • Misconceptions: what we commonly think is true • Can we compare apples and pears to get rid of the lemons? • How does this apply to NLG? • Enlarging the view of evaluation: “The Big Picture” • Remarks • Moving forward

More Related