Enhancing Speech Recognition for Non-Native Users in Let’s Go! Dialogue System

Non-Native Users in the Let’s Go!! Spoken Dialogue System: Dealing with Linguistic Mismatch Antoine Raux & Maxine Eskenazi Language Technologies Institute Carnegie Mellon University

Background • Speech-enabled systems use models of the user’s language • Such models are tailored for native speech • Great loss of performance for non-native users who don’t follow typical native patterns

Previous Work on Non-Native Speech Recognition • Assumes knowledge about/data from a specific non-native population • Often based on read speech • Focuses on acoustic mismatch: • Acoustic adaptation • Multilingual acoustic models

Linguistic Particularities of Non-Native Speakers • Non-native speakers might use different lexical and syntactic constructs • Non-native speakers are in a dynamic process of L2 acquisition

Outline of the Talk • Baseline system and data collection • Study of non-native/native mismatch and effect of additional non-native data • Adaptive lexical entrainment

The CMU Let’s Go!! System:Bus Schedule Information for the Pittsburgh Area ASR Sphinx II Parsing Phoenix HUBGalaxy Dialogue ManagementRavenClaw Speech Synthesis Festival NLG Rosetta

Data Collection • Baseline system accessible since February 2003 • Experiments with scenarios • Publicized the phone number inside CMU in Fall 2003

Data Collection Web Page

Data • Directed experiments: 134 calls • 17 non-native speakers (5 from India, 7 from Japan, 5 others) • Spontaneous: 30 calls • Total: 1768 utterances • Evaluation Data: • Non-Native: 449 utterances • Native: 452 utterances

Speech Recognition Baseline • Acoustic Models: • semi-continuous HMMs (codebook size: 256) • 4000 tied states • trained on CMU Communicator data • Language Model: • class-based backoff 3-gram • trained on 3074 utterances from native calls

Speech Recognition Results Word Error Rate: • Causes of discrepancy: • Acoustic mismatch (accent) • Linguistic mismatch (word choice, syntax)

Language Model Performance Evaluation on transcripts. Initial model: 3074 native utterances

Language Model Performance Adding non-native data: 3074 native+1308 non-native utterances Initial (native) model Mixed model

Natural Language Understanding • Grammar manually written incrementally, as the system was being developed • Initially built with native speakers in mind • Phoenix: robust parser (less sensitive to non-standard expressions)

Initial grammar: Manually written for native utterances Grammar Coverage

Grammar Coverage • Grammar designed to accept some non-native patterns: • “reach” = “arrive” • “What is the next bus?” = “When is the next bus?”

Relative Improvement due to Additional Data

Effect of Additional Data on Speech Recognition

Adaptive Lexical Entrainment • “If you can’t adapt the system, adapt the user” • System should use the same expressions it expects from the user • But non-native speakers might not master all target expressions • Use expressions that are close to the non-native speaker’s language • Use prosody to stress incorrect words

I want to go the airport Did you mean: I want to go TO the airport? Adaptive Lexical Entrainment:Example

Adaptive Lexical Entrainment:Algorithm I want to go the airport ASR Hypothesis ConfirmationPrompt DP-basedAlignment PromptSelection Emphasis TargetPrompts

Adaptive Lexical Entrainment:Algorithm I’d like to go to the airport I want to go the airport ASR Hypothesis ConfirmationPrompt DP-basedAlignment PromptSelection Emphasis TargetPrompts

Adaptive Lexical Entrainment:Algorithm I’d like to go to the airport I want to go the airport I want to go to the airport ASR Hypothesis ConfirmationPrompt DP-basedAlignment PromptSelection Emphasis TargetPrompts

Adaptive Lexical Entrainment:Algorithm I’d like to go to the airport I want to go the airport Did you mean: I want to go to the airport ? ASR Hypothesis ConfirmationPrompt DP-basedAlignment PromptSelection Emphasis TargetPrompts

Adaptive Lexical Entrainment:Study • 24 dialogues, 12.4 turns/dialogue, 2.5 entrainment prompts/dialogue • Trigger entrainment whenever best target does not perfectly match ASR • Too simple? 40.7% false positive • 64.4% prompts are triggered by misrecognitions

Conclusion • LM and Grammar perform significantly worse on non-native speech • Including non-native data reduces the gap in linguistic metrics but not WER • Future work: • Improve Speech Recognition • More data for LM • Acoustic/Lexical adaptation • Integrate adaptive lexical entrainment in the dialog management framework

Example Dialogue S: Welcome to the Let’s Go!! bus information system. What can I do for you? U: Excuse me. I’m in the place which uh where the Forbes Avenue and Murray Avenue are crossing and I want to go to the place mm Forbes where the Forbes Avenue and Craig Street are crossing. S: I heard “SIX”. Is that correct? U: Which bus uh should I take? Uh sorry uh maybe 61C bus I have… What time the next 61C bus will come? S: The time is currently 5:37 PM. Where are you leaving from? …

Example Dialogue S: Welcome to the Let’s Go!! bus information system. What can I do for you? U: I want to go to downtown now I’m at Fifth and Bigelow. S: Sorry, did you say “I want to go to downtown”? U: Which number bus can I take and when will the next bus come here? S: ... What can I do for you? …

Enhancing Speech Recognition for Non-Native Users in Let’s Go! Dialogue System

Enhancing Speech Recognition for Non-Native Users in Let’s Go! Dialogue System

Presentation Transcript

Sifting through the textual evidence: Linguistic variation in 17th century Amsterdam

Developing Spoken Dialogue Systems in the Communicator / RavenClaw Framework

Spoken Dialogue Systems

Introduction to Rhetoric

Applications of Discourse Structure for Spoken Dialogue Systems

Introduction to Linguistics 6 Spoken Language

Spoken Dialogue System Architecture

Dialogue Manager Internals

Review

Mismatch:

Dealing with the Reindeer

HIGGINS Error handling strategies in a spoken dialogue system

Writing Dialogue Enclose spoken words in quotation marks. “No!” cried Ed.

Spoken Dialogue Systems A Tutorial

Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources

Detecting Misunderstandings in the CMU Communicator Spoken Dialog System

Building Spoken Dialogue Systems for Embodied Agents Lecture 3

Automatic Spoken Document Processing for Retrieval and Browsing

Building Spoken Dialogue Systems for Embodied Agents

Error Detection and Correction in Spoken Dialogue Systems

Dialogue

17.0 Spoken Dialogues