320 likes | 353 Views
Dagstuhl 2000 Pervasive Speech and Language Technology. Wolfgang Wahlster. German Research Center for Artificial Intelligence, DFKI GmbH Stuhlsatzenhausweg 3 66123 Saarbruecken, Germany phone: (+49 681) 302-5252/4162 fax: (+49 681) 302-5341 e-mail: wahlster@dfki.de
E N D
Dagstuhl 2000 Pervasive Speech and Language Technology Wolfgang Wahlster German Research Center for Artificial Intelligence, DFKI GmbH Stuhlsatzenhausweg 3 66123 Saarbruecken, Germany phone: (+49 681) 302-5252/4162 fax: (+49 681) 302-5341 e-mail: wahlster@dfki.de WWW:http://www.dfki.de/~wahlster
Pervasive Speech and Language Technology Speech-controlled coffee machine A capuccino in 10 minutes, please! Let‘s go to Baker Street in Berkeley! Speech-based car navigation I would like to hear Mozart‘s piano concert No. 3! Speech-enabled music selection Send the following email to Mark Maybury: Hi Mark, please forward the following agenda to your project partners! Dictation
Pervasive Speech and Language Technology Show me all CNN news of the last 3 months that feature Bill Clinton discussing health care! Information on demand What has Jim Hendler said about DAML during our recent Dagstuhl seminar? Audio Mining I would like to make an appointment with Dr. Kuremastu in Kyoto next week! Speech-to-Speech Translation
Three Levels of Language Processing Speech Input Acoustic Language Models Speech Recognition What has the speaker said? 100 Alternatives Word Lists Sprachanalyse Speech Analysis Grammar What has the speaker meant? 10 Alternatives Lexical Meaning Reduction of Uncertainty Speech Under- standing Discourse Context Knowledge about Domain of Discourse What does the speaker want? Unambiguous Understanding in the Dialog Context
Challenges for Language Engineering Input Conditions Naturalness Adaptability Dialog Capabilities Close-Speaking Microphone/Headset Push-to-talk Speaker Dependent Isolated Words Monolog Dictation Speaker Independent Information- seeking Dialog Read Continuous Speech Telephone, Pause-based Segmentation Increasing Complexity Spontaneous Speech Open Microphone, GSM Quality Multiparty Negotiation Speaker adaptive Verbmobil
Context-Sensitive Speech-to-Speech Translation Wann fährt der nächste Zug nach Hamburg ab? When does the next train to Hamburg depart? Wo befindet sich das nächste Hotel? Where is the nearest hotel? Verbmobil Server
Mobile Speech-to-Speech Translation of Spontaneous Dialogs As the nameVerbmobilsuggests, the system supportsverbal communication with foreign dialogpartners inmobilesituations. 1 face-to-face conversations 2 telecommunication
Mobile Speech-to-Speech Translation of Spontaneous Dialogs Verbmobil Speech Translation Server Solution: Conference Call: The Verbmobil Speech Translation Server is accessed by GSM mobile phones.
German English Japanese General Speech Recognition Task Audio Signal Recognizers Word Hypotheses Graph
WHGs realize the interface between acoustic and linguistic processing Word Hypotheses Graphs (WHGs) Edge = Word Best Hypothesis Acoustic Score
Massive Data Collection Efforts Transliteration Variant 1 Transliteration Variant 2 Lexical Orthography Canonical Pronounciation Manual Phonological Segmentation 3,200 dialogs (182 hours) with 1,658 speakers 79,562 turns distributed on 56 CDs, 21.5 GB Automatic Phonological Segmentation Word Segmentation Prosodic Segmentation Dialog Acts Noises Superimposed Speech Syntactic Category Word Category Syntactic Function Prosodic Boundaries The so-called Partitur (German word for musical score) orchestrates fifteen strata of annotations
Extracting Statistical Properties from Large Corpora Segmented Speech with Prosodic Labels Treebanks & Predicate- Argument Structures Annotated Dialogs with Dialog Acts Aligned Bilingual Corpora Transcribed Speech Data Machine Learning for the Integration of Statistical Properties into Symbolic Models for Speech Recognition, Parsing, Dialog Processing, Translation Neural Nets, Multilayered Perceptrons Probabilistic Transfer Rules Hidden Markov Models Probabilistic Automata Probabilistic Grammars
From Multi-Agent Architectures to a Multi-Blackboard Architectures Multi-Agent Architecture Multi-Blackboard Architecture M3 M1 M2 M3 M1 M2 Blackboards BB 1 BB 2 BB 3 M4 M5 M6 M4 M5 M6 Each module must know, which module produces what data Direct communication between modules Each module has only one instance Heavy data traffic for moving copies around Multiparty and telecooperation applications are impossible Software: ICE and ICE Master Basic Platform: PVM All modules can register for each blackboard dynamically No direct communication between modules Each module can have several instances No copies of representation structures (word lattice, VIT chart) Multiparty and Telecooperation applications are possible Software: PCA and Module Manager Basic Platform: PVM
A Multi-Blackboard Architecture for the Combinationof Results from Deep and Shallow Processing Modules Command Recognizer Channel/Speaker Adaptation Audio Data Spontaneous Speech Recognizer Prosodic Analysis Statistical Parser Chunk Parser Word Hypotheses Graph with Prosodic Labels Dialog Act Recognition HPSG Parser Semantic Construction Semantic Transfer VITs Underspecified Discourse Representations Robust Dialog Semantics Generation
The Use of Prosodic Information at All Processing Stages Speech Signal Word Hypotheses Graph Multilingual Prosody Module Prosodic features: l duration l pitch l energy l pause Boundary Information Boundary Information Sentence Mood Accented Words Prosodic Feature Vector Dialog Act Segmentation and Recognition Search Space Restriction Lexical Choice Speaker Adaptation Constraints for Transfer Speech Synthesis Dialog Understanding Translation Parsing Generation
Competing Strategies for Robust Speech Translation Concurrent processing modules combine deep semantic translation with shallow surface-oriented translation methods. Word Lattice Expensive, but precise Translation Cheap, but approximate Translation time out? l Principled and compositional syntactic and semantic analysis l Semantic-based transfer of Verbmobil Interface Terms (VITs) as set of underspecified DRS l Case-based Translation l Dialog-act based translation l Statistical translation Selection of best result Results with Confidence Values Results with Confidence Values Acceptable Translation Rate
Integrating Shallow and Deep Analysis Components in a Multi-Blackboard Architecture Augmented Word Hypotheses Graph Statistical Parser Chunk Parser HPSG Parser partial VITs Chart with a combination of partial VITs partial VITs partial VITs Robust Dialog Semantics Combination and knowledge- based reconstruction of complete VITs Complete and Spanning VITs
VHG: A Packed Chart Representation of Partial Semantic Representations l Incremental chart construction and anytime processing l Rule-based combination and transformation of partial UDRS coded as VITs l Selection of a spanning analysis using a bigram model for VITs (trained on a tree bank of 24 k VITs) l Chart Parser using cascaded finite-state transducers l Statistical LR parser trained on treebank l Very fast HPSG parser Semantic Construction
The Understanding of Spontaneous Speech Repairs I need a car next Tuesday oops Monday Editing Phase Repair Phase Original Utterance Reparans Hesitation Reparandum Recognition of Substitutions Transformation of the Word Hypothesis Graph I need a car next Monday Verbmobil Technology: Understands Speech Repairs and extracts the intended meaning Dictation Systems like: ViaVoice, VoiceXpress, FreeSpeech, Naturally Speaking cannot deal with spontaneous speech and transcribe the corrupted utterances.
Automatic Understanding and Correction of Speech Repairs in Spontaneous Telephone Dialogs Wir treffen uns in Mannheim, äh, in Saarbrücken. (We are meeting in Mannheim, oops, in Saarbruecken.) We are meeting in Saarbruecken. German English
Robust Dialog Semantics: Combining and Completing Partial Representations Let us meet (in) the late afternoon to catch the train to Frankfurt the late afternoon the train to Frankfurt meet to catch Let us The preposition ‚in‘ is missing in all paths through the word hypotheses graph. A temporal NP is transformed into a temporal modifier using a underspecified temporal relation: [temporal_np(V1)] [typeraise_to_mod (V1, V2)] & V2 The modifier is applied to a proposition: [type (V1, prop), type (V2, mod)] [apply (V2, V1, V3)] & V3
Integrating Deep and Shallow Processing: Combining Results from Concurrent Translation Threads Segment 1 If you prefer another hotel, Segment 2 please let me know. Statistical Translation Case-Based Translation Dialog-Act Based Translation Semantic Transfer Alternative Translations with Confidence Values Selection Module Segment 1 Translated by Semantic Transfer Segment 2 Translated by Case-Based Translation
Sentence to synthesize I have time on monday. time have I on monday I have time on monday S E S E I have time on monday Tokens I have on monday I on Edge direction Unit Selection Algorithm
Linguatronic : Spoken Dialogs with Mercedes-Benz Please call Doris Wahlster. Open the left window in the back. I want to hear the weather channel. When will I reach the next gas station? Where is the next parking lot? Microphone Push-to-talk Switch l Speech control of: cellular phone, radio, windows / AC, route guidance system l Option for S-, C-, and E-Class of Mercedes and BMW l Speaker-independent, Garbage models for non-speech (blinker, AC, wheels)
Multilingual and Mobile Communication Assistants Dialog Translation Multilingual Indexing and Annotation of Videos Multilingual Audio Retrieval and Audio Mining Speech-based Web Access to Multilingual Web pages • l Call Centers • l ECommerce • Mobile Travel Assistance • Telephone Translations l Video Archives l News Archives • l Discussions • Lecture Notes • Organizers l Multimodal Interfaces l WAP Phones l WebTV SmartKom Verbmobil Spontaneous Speech, Robust Processing and Translation, Semantic and Pragmatic Understanding International Research Trends in Multilingual Systems Multilingual Language Technology Speech Recognition, Language Understanding, Language Generation, and Speech Synthesis
Conclusion I Real-world problems in language technology like the understanding of spoken dialogs, speech-to-speech translation and multimodal dialog systems can only be cracked by the combined muscle of deep and shallow processing approaches. In a multi-blackboard architecture based on packed representations on all processing levels (speech recognition, parsing, semantic processing, translation, generation) using charts with underspecified representations (eg. UDRS) the results of concurrent processing threads can be combined in an incremental fashion.
Conclusion II All results of concurrent processing modules should come with a confidence value, so that a selection module can choose the most promising result at a each processing stage. Packed representations together with formalisms for underspecification capture the uncertainties in a each processing phase, so that the uncertainties can be reduced by linguistic, discourse and domain constraints as soon as they become applicable.
Conclusion III Deep Processing can be used for merging, completing and repairing the results of shallow processing strategies. Shallow methods can be used to guide the search in deep processing. Statistical methods must be augmented by symbolic models (eg. Class-based language modelling, word order normalization as part of statistical translation). Statistical methods can be used to learn operators or selection strategies for symbolic processes. It is much more than a balancing act... (see Klavans and Resnik 1996)
Open Problems for the Next Decade • Problems with current machine learning approaches • LExpensive data collection • LCognitively unrealistic training data • LData sparseness • Problems with current hand-crafted knowledge sources • LBrittleness • LDomain dependence • LLimited scalability
A Speculative Conclusion (+50 years) -500 yearsTODAY+50 years Oral Society è Textual Society è Oral Society News and knowledge is passed orally News and knowledge is passed textually News and knowledge is passed orally No mass storage No automatic processing No automatic retrieval Mass storage of texts Text Processing Text Retrieval Mass storage of speech Speech Processing Audio Retrieval