DQR test suites for spoken dialogue system evaluation : A paradigm for a qualitative evaluation

DQR test suites for spoken dialogue system evaluation :A paradigm for a qualitative evaluation Jean-Yves Antoine VALORIA U. Bretagne Sud Vannes, France Jérôme Zeiliger INRS-Telecom Quebec, Canada Jean Caelen CLIPS Institut IMAG Grenoble, France

Quantitative evaluation • Overall performance of the system • Accuracy rates outputs / predefinite references • Advantages • Objective evaluation • Overall improvements over time • Drawbacks • Lack of predictive power • Lack of genericness

Predictability : some questions • Overall accuracy rate of the system How does it depend on the performances of its components ? • Overall accuracy rate of a specific component • How does it depend on the testing data ? • How does it depend on the application ? • How should it enlighten us about future improvements ?

Predictability : a solution Quantitative evaluation Qualitative evaluation • Assessment of the Overall improvements of the technology • Appropriateness to a specific task / application Evaluation of the system’s behaviour on EVERY specific phenomenon PREDICTABILITY

DQR methodology • Qualitative Evaluation in NLP TSNLP — FRACAS — AUPELF-UREF • DQR test suites • Declaration D : the utterance the system should understand. D concerns a specific phenomenon • Peter is attending a meeting. He is to chair it. • Question Q : assesses the understanding of D • Is Peter to chair a meeting ?. • Reply R : [Yes] / [No]

DQR Evaluation and Speech EXTENSIONS OF THE DQR METHODOLOGY Specificity of the spoken language interaction Specificity of the speech technologies Structural Analysis spontaneous unexpected structures Dialog Strategy Practical adaptation of the DQR test suites

Multi-level Evaluation • Speech Understanding • Literal understanding (structural analysis) • Implicit understanding (anaphora, ellipses) • Inference- common sense reasonning (logical inferences) • - pragmatic reasonning • - multiple turns inferences • Dialogue • Speech acts interpretation (intention in action) • Speaker’s intention recognition (preliminary intention) • Relevance- reply of the system • - dialogue strategy

Practical achievement Simplicity of the question Q (D) I need to go to Granada tomorrow morning (Q) Go to Granada (R) [Yes] Simplicity of the evaluation • Computation of the answer : mere unification • Accuracy rate : specific to each phenomenon Rsystem = UNIF ( D, Q )

Genericity Unification of the intrinsic representations of the system No predefinite references No common representations Complete independance

Predicatibility: literal understanding • Key information retrieval • (D) I need to go to Granada tomorrow morning • (Q) Go to Granada • (R) [Yes] • Sharper understanding • (D) Turn on right after the building with the red shutters • (Q) Red shutters • (R) [Yes] • (Q) Building with shutters • (R) [Yes]

Predicatibility: negative tests Positive Tests Tracking the errors Negative Tests Explaining the errors Example : literal understanding • (D) Turn on right after the building with the red shutters • (Q) Red building • (R) [No] • (D) Move the circle and the triangle on the right • (Q) Move the right triangle • (R) [No]

Predicatibility: spoken constructions • Repetitions, self-corrections • (D) I want to leave tomorrow evening … no sorry … morning • (Q) Tomorow morning • (R) [Yes] • Word-order alterations • (D) On the right of the circle, draw a red triangle • (Q) Draw a circle • (R) [No]

Conclusion • A predictive and generic paradigm of evaluation • Already in use in NLP (Fracas, 1996) • Adaptable to spoken language understanding • AUPELF-UREF French-speaking evaluation • Adaptable to spoken dialog ???? • Lack of interactive abilities of the present systems

DQR test suites for spoken dialogue system evaluation : A paradigm for a qualitative evaluation