1.14k likes | 1.16k Views
Explore the mobile speech-to-speech translation system, Verbmobil, that supports verbal communication in mobile situations. This system provides robust real-time translation for face-to-face conversations and telecommunication.
E N D
Seventeenth International Joint Conference on Artificial Intelligence, IJCAI-01 Seattle Wednesday, 8 August 2001 Robust Translation of Spontaneous Speech: A Multi-Engine Approach Wolfgang Wahlster German Research Center for Artificial Intelligence DFKI GmbH www.dfki.de/~wahlster
Mobile Speech-to-Speech Translation of Spontaneous Dialogs As the name Verbmobil suggests, the system supports verbal communication with foreign dialog partners in mobile situations. 1 face-to-face conversations telecommunication 2
Mobile Speech-to-Speech Translation of Spontaneous Dialogs Verbmobil Speech Translation Server Conference Call: The Verbmobil Speech Translation Server connects GSM cell phone users
Robust Realtime Translation with Verbmobil At a German Airport: An American business man calls the secretary of a German business partner.
Outline l Verbmobil‘s Multi-Blackboard and Multi-Engine Architecture l Exploiting Underspecification in a Multi-Stratal Semantic Representation Language l Combining Deep and Shallow Processing Strategies for Robust Dialog Translation l Evaluation and Technology Transfer l Lessons Learned and Conclusions
German German GermanEnglish English German English English Telephone-based Dialog Translation Verbmobil Server Cluster German Dialog Partner l ISDN Conference Call (3 Participants): -German Speaker -Verbmobil -American Speaker l Speech-based Set-up of the Conference Call Bianca/Brick XS BinTec ISDN-LAN Router American Dialog Partner LINUX Server Sun Server 450 Sun ULTRA 60/80
Verbmobil: The First Speech-Only Dialog Translation System American Speaker: “Verbmobil” (Voice Dialing) Mobile GSM Phone Mobile DECT Phone
Verbmobil: The First Speech-Only Dialog Translation System American Speaker: “Verbmobil” (Voice Dialing) Connect to the Verbmobil Speech-to-Speech Translation Server +49 631 3111911 Mobile GSM Phone Mobile DECT Phone
Verbmobil: The First Speech-Only Dialog Translation System American Speaker: “Verbmobil” (Voice Dialing) Connect to the Verbmobil Speech-to-Speech Translation Server +49 631 3111911 Mobile GSM Phone Mobile DECT Phone Verbmobil: “Welcome to the Verbmobil Translation System. Please speak the telephone number of your partner.”
Verbmobil: The First Speech-Only Dialog Translation System American Speaker: “Verbmobil” (Voice Dialing) Connect to the Verbmobil Speech-to-Speech Translation Server +49 631 3111911 Mobile GSM Phone Mobile DECT Phone Verbmobil: “Welcome to the Verbmobil Translation System. Please speak the telephone number of your partner.” American Speaker: “0177555”
Verbmobil: The First Speech-Only Dialog Translation System American Speaker: “Verbmobil” (Voice Dialing) Connect to the Verbmobil Speech-to-Speech Translation Server +49 631 3111911 Foreign Participant is placed into the Conference Call Mobile GSM Phone Mobile DECT Phone Verbmobil: “Welcome to the Verbmobil Translation System. Please speak the telephone number of your partner.” To German Participant To American Participant American Speaker: “0177555” Verbmobil: Verbmobil hat eine neue Verbindung aufgebaut. Bitte sprechen Sie jetzt. Verbmobil: Welcome to the Verbmobil server. Please start your input after the beep.
Verbmobil is a Multilingual System English (American) German Japanese German Chinese (Mandarine) German It supports bidirectional translation between:
Verbmobil Partner TU-BRAUNSCHWEIG DAIMLERCHRYSLER RHEINISCHE FRIEDRICH WILHELMS-UNIVERSITÄT BONN LUDWIG MAXIMILIANS UNIVERSITÄT MÜNCHEN Phase 2 UNIVERSITÄT BIELEFELD UNIVERSITÄT DES SAARLANDES TECHNISCHE UNIVERSITÄT MÜNCHEN UNIVERSITÄT HAMBURG FRIEDRICH- ALEXANDER- UNIVERSITÄT ERLANGEN-NÜRNBERG RUHR-UNIVERSITÄT BOCHUM EBERHARDT-KARLS UNIVERSITÄT TÜBINGEN UNIVERSITÄT STUTTGART UNIVERSITÄT KARLSRUHE W. Wahlster, DFKI
Three Levels of Language Processing Speech Telephone Input Acoustic Language Models Speech Recognition What has the caller said? 100 Alternatives Word Lists Sprachanalyse Speech Analysis Grammar Reduction of Uncertainty What has the caller meant? 10 Alternatives Lexical Meaning Speech Under- stan- ding Discourse Context Knowledge about Domain of Discourse What does the caller want? Unambiguous Understanding in the Dialog Context
Challenges for Language Engineering Close-Speaking Microphone/ Headset Push-to-talk Speaker Dependent Isolated Words Monolog Dictation Speaker Independent Information- seeking Dialog Read Continuous Speech Telephone, Pause-based Segmentation Increasing Complexity Spontaneous Speech Open Microphone, GSM Quality Multiparty Negotiation Speaker adaptive Verbmobil Input Conditions Naturalness Adaptability Dialog Capabilities
Verbmobil II: Three Domains of Discourse Scenario 2 Travel Planning & Hotel Reservation Scenario 3 PC-Maintenance Hotline Scenario 1 Appointment Scheduling
Verbmobil II: Three Domains of Discourse Scenario 2 Travel Planning & Hotel Reservation Scenario 3 PC-Maintenance Hotline Scenario 1 Appointment Scheduling When? What? When? Where? How? When? Where? How?
Verbmobil II: Three Domains of Discourse Scenario 2 Travel Planning & Hotel Reservation Scenario 3 PC-Maintenance Hotline Scenario 1 Appointment Scheduling When? What? When? Where? How? When? Where? How? Focus on temporal expressions Integration of special sublanguage lexica Focus on temporal and spatial expressions
Verbmobil II: Three Domains of Discourse Scenario 2 Travel Planning & Hotel Reservation Scenario 3 PC-Maintenance Hotline Scenario 1 Appointment Scheduling When? What? When? Where? How? When? Where? How? Focus on temporal expressions Integration of special sublanguage lexica Focus on temporal and spatial expressions Vocabulary Size: 6000 Vocabulary Size: 30000 Vocabulary Size: 10000
Context-Sensitive Speech-to-Speech Translation Wann fährt der nächste Zug nach Hamburg ab? When does the next train to Hamburg depart? Wo befindet sich das nächste Hotel? Whereis the nearest hotel? Verbmobil Server
Verbmobil‘s Massive Data Collection Effort Transliteration Variant 1 Transliteration Variant 2 Lexical Orthography Canonical Pronounciation Manual Phonological Segmentation 3,200 dialogs (182 hours) with 1,658 speakers 79,562 turns distributed on 56 CDs, 21.5 GB Automatic Phonological Segmentation Word Segmentation Prosodic Segmentation Dialog Acts Noises Superimposed Speech Syntactic Category Word Category Syntactic Function Prosodic Boundaries The so-called Partitur (German word for musical score) orchestrates fifteen strata of annotations
Extracting Statistical Properties from Large Corpora Segmented Speech with Prosodic Labels Treebanks & Predicate- Argument Structures Annotated Dialogs with Dialog Acts Aligned Bilingual Corpora Transcribed Speech Data Machine Learning for the Integration of Statistical Properties into Symbolic Models for Speech Recognition, Parsing, Dialog Processing, Translation Neural Nets, Multilayered Perceptrons Probabilistic Transfer Rules Hidden Markov Models Probabilistic Automata Probabilistic Grammars
Multilinguality Japanese German English 100 90 80 Word accuracy [%] 70 60 50 '97 '98 2000 '99.1 '99.2 '99.3 VM1
Multilinguality Language Identification (LID) German Recognizer Independent LID- Module w1 … wn Speech English Recognizer Japanese Recognizer
From a Multi-Agent Architecture to a Multi-Blackboard Architecture Verbmobil I Verbmobil II Multi-Agent Architecture Multi-Blackboard Architecture M3 M1 M2 M3 Blackboards M1 M2 BB 1 BB 2 BB 3 M4 M5 M6 M4 M5 M6 Each module must know, which module produces what data Direct communication between modules Heavy data traffic for moving copies around All modules can register for each blackboard dynamically No direct communication between modules No copies of representation structures (word lattice, VIT chart)
Multi-Blackboard/Multi-Engine Architecture Module 2.1 Module 1.1 Module 3.1 2.2 3.2 1.2 . . . . . . Blackboard 3 Syntactic Representation: Parsing Results Blackboard 1 Preprocessed Speech Signal Blackboard 4 Semantic Representation: Lambda DRS Blackboard 5 Dialog Acts Blackboard 2 Word Lattice Module 5.1 Module 4.1 Module 6.1 5.2 4.2 6.2 . . . . . .
A Multi-Blackboard Architecture for the Combinationof Results from Deep and Shallow Processing Modules Command Recognizer Channel/Speaker Adaptation Audio Data Spontaneous Speech Recognizer Prosodic Analysis
A Multi-Blackboard Architecture for the Combinationof Results from Deep and Shallow Processing Modules Command Recognizer Channel/Speaker Adaptation Audio Data Spontaneous Speech Recognizer Prosodic Analysis Statistical Parser Chunk Parser Word Hypotheses Graph with Prosodic Labels Dialog Act Recognition HPSG Parser
A Multi-Blackboard Architecture for the Combinationof Results from Deep and Shallow Processing Modules Command Recognizer Channel/Speaker Adaptation Audio Data Spontaneous Speech Recognizer Prosodic Analysis Statistical Parser Chunk Parser Word Hypotheses Graph with Prosodic Labels Dialog Act Recognition HPSG Parser Semantic Construction Semantic Transfer VITs Underspecified Discourse Representations Robust Dialog Semantics Generation
VIT (Verbmobil Interface Terms) as a Multi-Stratal Representation Language l used as a common representation scheme for information exchange between all components and processing threads l design inspired by underspecified discourse representation structures (UDRS, Reyle/Kamp 1993) l compact representation of lexical and structured ambiguities and scope underspecifications of quantifiers, negations and adverbs l variable-free sets of non-recursive terms: [beginning (35, i37), arg3 (35, i37 ,i38),come (27, i35),arg1 (27, i35, i36),decl (37, h43),pron (26, i36),at (36, i35, i37),mofy (34 ,i38, aug),def (28, i37, h42, h41),udef (31, i38, h45, h44)], l streams of literals as flat multi-stratal representations that are very efficient for incremental processing
VIT for ‘He is coming at the beginning of August‘ Vit (vitID (sid (104,a,en,10,80,1,en,y,semantics), % Segment Identifier [word (he, 1, [26]), word(is, 2, []),word(coming, 3, [27]),word(at, 4, [36]),word(the ,5, [28]),word(beginning, 6, [35]),word(of, 7, [35]),word(``August'', 8, [34])]),% WHG String index (38, 25 ,i35), % Index [beginning (35, i37), arg3 (35, i37 ,i38),come (27, i35),arg1 (27, i35, i36),decl (37, h43),pron (26, i36),at (36, i35, i37),mofy (34 ,i38, aug),def (28, i37, h42, h41),udef (31, i38, h45, h44)], % Conditions [in_g (26, 25), in_g (37, 38), in_g (27, 25), in_g (28, 30),in_g (31, 33), in_g (34, 32),in_g (35, 29), in_g (36, 25),leq (25, h41), leq (25, h43),leq (29, h42), leq (29, h44),leq (30, h43), leq (32, h45),leq (33, h43)], % Scope and Grouping Constraints [s_sort (i35, situation), s_sort (i37, time),s_sort (i38, time)],% Sortal Specifications for Instance Variables [dialog_act (25, inform), dir (36, no),prontype (i36, third,std)], % Discourse and Pragmatics [cas (i36, nom), gend (i36, masc),num (i36, sg), num (i37, sg), num (i38, sg),pcase (l135, i38, of)], % Syntax [ta_aspect (i35, progr), ta_mood (i35, ind),ta_perf (i35, nonperf),ta_tense (i35, pres)], % Tense and Aspect [pros_accent (35)] % Prosody
Information between Layers is Linked TogetherUsing Constant Symbols Instances are constants interpreted as skolemized variables [word (he, 1, [26]), word(is, 2, []),word(coming, 3, [27]),word(at, 4, [36]),word(the ,5, [28]),word(beginning, 6, [35]),word(of, 7, [35]),word(``August'', 8, [34])]),% WHG String [beginning (35, i37), arg3 (35, i37 ,i38),come (27, i35),arg1 (27, i35, i36),decl (37, h43),pron (26, i36),at (36, i35, i37),mofy (34 ,i38, aug),def (28, i37, h42, h41),udef (31, i38, h45, h44)], % Conditions [s_sort (i35, situation), s_sort (i37, time),s_sort (i38, time)],% Sorts [cas (i36, nom), gend (i36, masc),num (i36, sg), num (i37, sg),], % Syntax
Information between Layers Linked TogetherUsing Constant Symbols Instances are constants interpreted as skolemized variables [word (he, 1, [26]), word(is, 2, []),word(coming, 3, [27]),word(at, 4, [36]),word(the ,5, [28]),word(beginning, 6, [35]),word(of, 7, [35]),word(``August'', 8, [34])]),% WHG String [beginning (35, i37), arg3 (35, i37 ,i38),come (27, i35),arg1 (27, i35, i36),decl (37, h43),pron (26, i36),at (36, i35, i37),mofy (34 ,i38, aug),def (28, i37, h42, h41),udef (31, i38, h45, h44)], % Conditions [s_sort (i35, situation), s_sort (i37, time),s_sort (i38, time)],% Sorts [cas (i36, nom), gend (i36, masc),num (i36, sg), num (i37, sg),], % Syntax
Information between Layers Linked TogetherUsing Constant Symbols Instances are constants interpreted as skolemized variables [word (he, 1, [26]), word(is, 2, []),word(coming, 3, [27]),word(at, 4, [36]),word(the ,5, [28]),word(beginning, 6, [35]),word(of, 7, [35]),word(``August'', 8, [34])]),% WHG String [beginning (35, i37), arg3 (35, i37 ,i38),come (27, i35),arg1 (27, i35, i36),decl (37, h43),pron (26, i36),at (36, i35, i37),mofy (34 ,i38, aug),def (28, i37, h42, h41),udef (31, i38, h45, h44)], % Conditions [s_sort (i35, situation), s_sort (i37, time),s_sort (i38, time)],% Sorts [cas (i36, nom), gend (i36, masc),num (i36, sg), num (i37, sg),], % Syntax
Information between Layers Linked TogetherUsing Constant Symbols Instances are constants interpreted as skolemized variables [word (he, 1, [26]), word(is, 2, []),word(coming, 3, [27]),word(at, 4, [36]),word(the ,5, [28]),word(beginning, 6, [35]),word(of, 7, [35]),word(``August'', 8, [34])]),% WHG String [beginning (35, i37), arg3 (35, i37 ,i38),come (27, i35),arg1 (27, i35, i36),decl (37, h43),pron (26, i36),at (36, i35, i37),mofy (34 ,i38, aug),def (28, i37, h42, h41),udef (31, i38, h45, h44)], % Conditions [s_sort (i35, situation), s_sort (i37, time),s_sort (i38, time)],% Sorts [cas (i36, nom), gend (i36, masc),num (i36, sg), num (i37, sg),], % Syntax
Information between Layers Linked TogetherUsing Constant Symbols Instances are constants interpreted as skolemized variables [word (he, 1, [26]), word(is, 2, []),word(coming, 3, [27]),word(at, 4, [36]),word(the ,5, [28]),word(beginning, 6, [35]),word(of, 7, [35]),word(``August'', 8, [34])]),% WHG String [beginning (35, i37), arg3 (35, i37 ,i38),come (27, i35),arg1 (27, i35, i36),decl (37, h43),pron (26, i36),at (36, i35, i37),mofy (34 ,i38, aug),def (28, i37, h42, h41),udef (31, i38, h45, h44)], % Conditions [s_sort (i35, situation), s_sort (i37, time),s_sort (i38, time)],% Sorts [cas (i36, nom), gend (i36, masc),num (i36, sg), num (i37, sg),], % Syntax
The Use of Underspecified Representations Two Readings in the Source Language Wir telephonierten mit Freunden aus Schweden. A compact representation of scope ambiguities in a logical language without using disjunctions Underspecified Semantic Representation Ambiguity Preserving Translations Two Readings in the Target Language We called friends from Sweden.
Verbmobil is the First Dialog Translation System that Uses Prosodic Information Systematicallyat All Processing Stages Speech Signal Word Hypotheses Graph Multilingual Prosody Module Prosodic features: l duration l pitch l energy l pause Boundary Information Boundary Information Sentence Mood Accented Words Prosodic Feature Vector Dialog Act Segmentation and Recognition Search Space Restriction Lexical Choice Speaker Adaptation Constraints for Transfer Speech Synthesis Dialog Understanding Translation Parsing Generation
Using Syntactic-Prosodic Boundaries to Speed-Upthe Parsing Process yes S1 no problem S4 Mister Mueller S4 when would you like to go to HannoverS4 without boundaries: # chart edges: 1256 runtime: 1.31 secs with boundaries: #chart edges: 632 runtime: 0.62 secs speed-up: 53%