Is Text-to-Speech Synthesis Ready for use in CALL?

Is Text-to-Speech Synthesis Ready for use in CALL? Zöe Handley Learning Sciences Research Institute (LSRI), University of Nottingham and Marie-Josée Hamel Department of French, Dalhousie University CALL 2008, Antwerp, Belgium

Plan • TTS synthesis in CALL • Evaluation • Requirements analysis • Readiness of TTS synthesis for CALL • Conclusions

TTS synthesis • What is TTS synthesis? • Speech synthesis “systems … allow the generation of novel messages, either from scratch (i.e. entirely by rule) or by re-combining shorter pre-stored units” (van Bezooijen and van Heuven, 1997: 709) • Text-to-Speech Synthesis systems allow the automatic generation of speech from text • Why use TTS in CALL? • There is a general need in language learning and teaching for “self-paced interactive learning environments” which provide “controlled interactive speaking practice outside the classroom” (Ehsani and Knodt, 1998: 45). http://www.acapela-group.com/text-to-speech-interactive-demo.html Graham Lucy

CALL Applications • Reading machine • Talking dictionaries, texts (de Pijper, 1997; Hamel, 2003), word processors, and conjugators, dictations (Santiago-Oriola, 1999; Mercier et al., 1999), and grapheme-phoneme exercises • Pronunciation model • Auditory discrimination; repetition (Hamel, 1998; Mercier et al., 2000) • Conversational partner • In combination with automatic speech recognition, speech understanding, the generative power of TTS synthesis can be harnessed to provide learners with interactive speaking practice, i.e. a dialogue partner (Raux and Eskenazi, 2004; Senef et al., 2004) Oxford Hachette 4 French Dictionary on CD-ROM

Benefits of TTS synthesis • Improvements on other media • Easy creation and editing of speech samples • Simultaneous presentation of text and speech • Low storage requirements • Non-human and therefore perceived as non-judgemental • Adds value • Generation of examples on demand (Sherwood, 1981) and therefore the automatic generation of feedback, conversational turns, and exercises with speech models

Why evaluation? • Few CALL applications integrating TTS synthesis are available on the market • Few evaluations of TTS synthesis for the purposes of CALL have been conducted • Since the failure of the language laboratory teachers have been sceptical about unevaluated technologies • TTS synthesis is being used in CALL in roles in which it has not been used in previous applications outside CALL - the most common, perhaps only, role that TTS synthesis assumes outside CALL is that of a reading machine

Framework for the evaluation of TTS synthesis for use in CALL (Handley and Hamel, 2005) • Basic research evaluation of TTS synthesis for use in CALL • Viability and potential benefits of the use of TTS synthesis in CALL • Technology evaluation of TTS synthesis for use in CALL • Adequacy of TTS synthesis for use in CALL • Judgemental evaluation of the CALL application • Potential of the CALL program to provide ideal conditions for SLA • Judgemental evaluation of the teacher-planned activity • Potential of the planned activity to provide ideal conditions for SLA • Usage evaluation of the teacher-planned activity • Learner’s performance in the planned activity • This is a combination of the levels of evaluation recommended by Chapelle (2001) for the evaluation of CALL activities and by ELSE (1999) for the evaluation of Speech and Language Technologies (SALT).

Evaluations of TTS Synthesis for CALL • Technology evaluations of TTS synthesis for use in CALL • Stratil et al (1987) • Evaluated the quality of a Spanish TTS chip for use for the presentation of grammar exercises in a language laboratory. • Usage evaluation of the teacher-planned activity • Outcome-oriented • Santiao-Oriola (1999) • Evaluated the use of a French TTS synthesiser for the presentation of dictation exercises. • Hincks (2002) • Evaluated the use of a Swedish TTS synthesiser in combination with a speech editor (re-synthesis) for teaching the lexical stress of English to Swedophones. • Process-oriented • Cohen (1993) • Evaluated the use of a talking word processor to support literacy activities, namely writing stories, for young learners of French as a second language.

The evaluation process ISO (1999) and EAGLES (1999) guidelines Establish the evaluation requirements Establish the purpose of the evaluation Identify the types of products to be evaluated Specify the quality model Specify the evaluation Select metrics Establish rating levels for metrics Establish criteria for assessment Design the evaluation Execute the evaluation CALL requirements “When the language competence of the system begins to outstrip that of some of the better second language users, such systems become useful adjunct tools” (Keller and Zellner-Keller, 2000) Requirements analysis

CALL requirements analysis • Ideal conditions for Second Language Acquisition (SLA) (Chapelle, 2001) • Language learning potential • Goals of SLA • Communicative competence • Quality of the output • Primary requirement: Comprehensibility/intelligibility • Secondary requirements: Accuracy and naturalness • At both the level of individual speech sounds and the prosodic level • Focus on form • Flexibility • Speech rate, pitch

Explorative investigation (Handley and Hamel, 2005) • Research questions • Do the different roles identified impose different requirements on the quality of speech synthesis? • Does comprehensibility account for acceptability for use in CALL? • Method • 17 French teachers • One research TTS system, FIPSvox from the University of Geneva • 3 roles: (1) reading machine, (2) pronunciation model, and (3) conversational partner • Likert scales: (1) comprehensibility, (2) acceptability, and (3) appropriateness • Word pointing paradigm (van Santen, 1993) • Results • Most suitable as a dialogue partner. Least suitable as a pronunciation model. • Comprehensibility is not the only requirement. Accuracy and naturalness matter as do register and expressiveness.

Is TTS synthesis ready for use in CALL? • Research questions • Do the different roles identified impose different requirements on the quality of speech synthesis? • Is TTS synthesis ready for use in CALL? • Design • Within subjects, N = 17, French Teachers • Dependent variables • Quality of the speech output • Acceptability • Adequacy • Independent variables • Role of TTS in CALL: (1) Reading Machine (RM), (2) Pronunciation Model (PM) at the (a) segmental level and (b) suprasegmental level, and (3) Conversational Partner (CP) • TTS synthesis system

Systems evaluated • http://www.research.att.com/~ttsweb/tts/demo.php#top • French English • http://212.8.184.250/tts/demo_login.jsp • French English • http://www.multitel.be/TTS/layout.php?page=eLite_demo • French English • http://www.acapela-group.com/text-to-speech-interactive-demo.html • French English

Questionnaire • MOS-CALL • ITU-T Overall Quality Test • MOS-X (Polkosky and Lewis, 2003) On-line presentation of questionnaire

Is TTS synthesis ready for use in CALL? Mean ratings of adequacy • Different TTS synthesis systems are most suitable for use in different roles • Reinforces the need to evaluate every TTS synthesis system • System 4 is ready for use in all applications where TTS synthesis adds value Mean ratings of acceptability

System 1: AT&T Next-Gen (Alain) Mean ratings of quality of output

System 2: Nuance Vocalizer (Julie) Mean ratings of quality of output

System 3: eLite (Vincent) Mean ratings of quality of output

Do the different roles have different requirements? Mean ratings of adequacy • Differences in adequacy were statistically significant for systems 2 and 4 (χ²r = 8.010, df = 3, p = 0.046; χ²r = 8.063, df = 3, p = 0.045, respectively) • But, not for systems 1 and 3 (χ²r = 2.352, df = 3, p = 0.503; χ²r = 3.467, df = 3, p = 0.325; χ²r = 3.194, respectively) • Differences in acceptability were not significant (system 1 χ²r = 6.616, df = 3, p = 0.085, system 2 χ²r = 6.303, df = 3, p = 0.098, system 3 χ²r = 3.194, df = 3, p = 0.363, and system 4 χ²r = 5.547, df = 3, p = 0.163) Mean ratings of acceptability

Conclusions • Some French TTS synthesis systems are reaching readiness for use in CALL in applications which add value • In order to fully meet the requirements of CALL more attention needs to be paid to accuracy and naturalness, in particular at the prosodic level, and expressiveness • Expressive speech synthesis is the focus of much current research (Campbell et al., 2006) • This may not be the case for all languages; different languages pose different problems to TTS • It will not be long before learners will be able to benefit from the support of an untiring non-judgemental substitute native speaker 24/7 in CALL applications.

Is Text-to-Speech Synthesis Ready for use in CALL?