1.07k likes | 1.3k Views
Human-Machine Dialogue Espere and Reality. Dr. Zhang Sen zhangsen@gscas.ac.cn Chinese Academy of Sciences Beijing, CHINA 2014/8/15. Overview Core Technologies Speech-to-Text Text-to-Speech Natural Language Processing Dialogue Management Middlewares & Protocols Conclusion. OUTLINE.
E N D
Human-Machine DialogueEspere and Reality Dr. Zhang Sen zhangsen@gscas.ac.cn Chinese Academy of Sciences Beijing, CHINA 2014/8/15
Overview Core Technologies Speech-to-Text Text-to-Speech Natural Language Processing Dialogue Management Middlewares & Protocols Conclusion OUTLINE
Motivation and Goal State of the art Why so difficult? Application Areas My works Overview
Machine is tool invented by human Industry Revolution, free human’s manual labor Information Revolution, free human’s mental labor? Fundamental functions required Espere and Goal (Bill Gates) talk with machine via speech/NL freely machine can understand/imitate human activities Machine’s intelligence Turing test, classical and extended Motivation and Goal
Alan M. Turing “Computing Machinery and Intelligence”, (Mind, 1950 - Vol. 59, No. 236, pp. 433-460) I propose to consider the question, “Can machines think?” This should begin with definitions of the meaning of the terms “machine” and “think”. To answer this question, Turing proposed the “Imitation Game” later named the “Turing Test” Turing’s Question
Turing Test Simple, operative objective and convincible Subject #1 Observer Subject #2 Which subject is the machine? Subject #2
Conditions classical Turing test: assumed communications would be via typed text (keyboard) extended Turing test: assumed communications would be via speech input/output assumed communications would be unrestricted (as to subject, etc) The ability to communicate is equal to “thinking” and “intelligence” (Turing) Conditions and Answers
Today, great advances in HW/SW, even computer can defeat the greatest player in chess game, but machine is still unable to fool interrogator on unrestricted subjects. Turing predicted that test (classical) would passed in 50 years, but exactly speaking, not passed yet, not failed yet. The extended Turing test is harder and still has a long way to go. So did some AI experts’ predictions in 50s and 60s. Human-machine dialogue become possible, and can provide useful functions Travel Reservations, Stock Brokerages, Banking, etc. Turing Test - Today
Though Turing test not passed, it have promoted and boosted great advances in many areas: Computer Science AI Cognitive Science Natural Language Processing (NLU, NLG, ...) MT Robot Speech-to-Text Text-to-Speech Computer Vision etc Impact and Influence
DARPA Projects two times, 80s and 90s, ATIS (996 words, connected), communicator (>5000 words, continuous) MIT projects, Galaxy CMU, OGI, JANUS project Bell Lab, IBM, Microsoft, VUI, VoiceXML, SALT Verbmobil, DFKI (German), SUNDIAL Grenoble, INRIA (France), MIAMM, OZONE ATR, JSPS projects (Japan) CSTAR-I, II, III, S2S project etc Projects
Subject-restricted, small vocabulary, possible, but far from satisfactory Metrics for the evaluation of H-M Dialogue Systems CU Communicator 2002, the values are means task of completion (70%), time to completion (260s) total turns to completion (37), response latency (2s) user words to task end (39), system words to task end (332) # of reprompts (3) WER (22% ? 30%) DARPA Communicator project proposed a set of metrics including more than 18 items. State of the Art
Overview Architecture Applications middleware Speech I/O middleware NLU middleware KB DB DM
Galaxy Hub Architecture (MIT, CU) Audio Server ASR NL generator NL parser Hub Database WWW Confidence server DM TTS MIT Galaxy hub architecture with CU communicator
Natural language variation ambiguity at word, sentence levels NL as an open, changing set, numerical? Speech variation and communication channel distortion non-stationary, rate, power, timbre, … what is the fundamental feature of speech? Computing power limitations optimal search algorithms’ requirement Current computer architecture limitations weak to deal with analogous, fuzzy values Limited knowledge on human intelligence learning mechanism of human beings Why So Difficult?
Can ASR hear everything? Can NLP understand everything heard? Can DM deal with multiple strands? Does TTS sound natural? In my opinion, Problems such as ASR, NLP, TTS, MT, etc., have some common characteristics. One solved, others too. Open Issues
Statistical approach training problems, false sample problems Rule-based approach rules’ selection and conflict DP-based search algorithms Viterbi, F-B search, beam search Mathematical modeling time-series finite state transition model Main Methodologies
Improve Existing Applications Scheduling - Airlines, Hotels Financial - Banks, Brokerages Enabling New Applications Complex Travel Planning Voice Web search and browsing speech-to-speech MT Catalogue Order Many applications require Text-to-Speech role games speaking toys Application Areas
Project “Research on human-machine dialog through spoken language”, JSPS sponsored, 1998-2000 · Improved DTW approach with regard to prominent acoustic features, Proceedings of the ASJ, 1999 · Re-estimation of LP coefficients in the sense of L∞ criterion, IEEE ICSLP, 2000, Beijing, China · Visual approach for Automatic Pitch Period Estimation, IEEE ICASSP, 2000, Istanbul, Turkey · Automatic Labeling Initials and Finals in Chinese Speech Corpus, IEEE ICSLP 2000, Beijing, China A speech coding approach based on human hearing model, Proceedings of the ASJ, 2000 Works in Waseda
Project “CU Communicator”, DARPA sponsored and NSF supported, 2000-2001 N-gram LM smoothing based on word class information Dynamic pronunciation modeling for ASR adaptation Amdahl law, 50 most common words What kind of pronunciation variations hard for tri-phones to model? IEEE ICASSP 2001, Salt Lake city, USA Works in CSLR, CU
Project “Multidimensional Information Access using Multiple Modalities”, EU IST sponsored,2002-2003 Middleware between ASR engine and DM, XML Domain-specific N-gram LM generation based on a set of French language rules, PERL HMM-based acoustic modeling improvement Some issues on speech signal re-sampling at arbitrary rate, IEEE ISSPA, 2003, Paris, FRANCE An Effective Combination of Different Order N-Grams, The 17th Pacific Asia Conference on Language, Information and Computation, 2003, Singapore Comparison of speech signal resampling approaches, Proc. of ASJ, 2003, Tokyo, Japan Text-to-Pinyin conversion based on context knowledge and d-tree for Mandarin, IEEE NLP-KE, 2003, Beijing, China Works in INRIA-LORIA
Finished in 2003, speech signal analysis module was integrated into Snorri, LORIA Functions: speech signal analysis speech-to-text text-to-speech text-to-grapheme Spoken Language Toolkit
Based on the requirements analysis of human-machine communication, at least the following technologies should be included: Speech-to-Text Text-to-Speech Natural Language Processing Dialogue Management Middlewares & Protocols Core Technologies
A tractable reformulation of the problem is: Acoustic model Language model Daunting search task The Speech-to-Text Problem Find the most likely word sequence Ŵ among all possible sequences given acoustic evidence A
Language Model Acoustic Model Dictionary Speech Recognition Architecture Recognition Front End Decoder Best Word Sequence Analog Speech Observation Sequence
Front-End Processing Feature Extraction Dynamic features K.F. Lee
Overlapping Sample Windows Speech signal is non-stationary signal short-term approximation: viewed as stationary signal
Cepstrum is the inverse Fourier transform of the log spectrum Cepstrum Computation IDFT takes form of weighted DCT in computation, see in HTK
Construct mel-frequency domain using a triangularly-shaped weighting function applied to mel-transformed log-magnitude spectral samples: Mel Cepstral Coefficients Filter-bank, under 1k hz, linear, above 1k hz, log Motivated by human auditory response characteristics Most common feature set for recognizers
LPC Linear predictive coefficients PLP Perceptual Linear Prediction Though MFCC has been successfully used, what is the robust speech feature? Features Used in ASR
Template-based AM, used in DTW, obsolete Acoustic states represented by Hidden Markov Models (HMMs) Probabilistic State Machines - state sequence unknown, only feature vector outputs observed Each state has output symbol distribution Each state has transition probability distribution Issues: what topology is proper? how many states in a model? How many mixtures in a state? Acoustic Models normal silence connected
HMMs assume the duration follows an exponential distribution The transition probability depends only on the origin and destination All observation frames are dependent only on the state that generated them, not on the neighboring observation frames (observation frames dependent) Paper: “Transition control in acoustic modeling and Viterbi search” Limitations of HMM
Create a set of HMM’s representing the basic sounds (phones) of a language? English has about 40 distinct phonemes Chinese has about 22 Initials + 37 Finials Need “lexicon” for pronunciations Letter to sound rules for unusual words Co-articulation effects must be modeled tri-phones - each phone modified by onset and trailing context phones (1k-2k used in English) e.g. pl-c+pr Basic Speech Unit Models
What is a language model? Quantitative ordering of the likelihood of word sequences (statistical viewpoint) A set of rule specifying how to create word sequences or sentences (grammar viewpoint) Why use language models? Not all word sequences equally likely Search space optimization (*) Improve accuracy (multiple passes) Wordlattice to n-best Language Models
Write Grammar of Possible Sentence Patterns Advantages: Long History/ Context Don’t Need Large Text Database (Rapid Prototyping) Integrated Syntactic Parsing Problem: Work to write grammars Words sequences not enabled do not exist Used in small vocabulary ASR, not for LVCASR the next page show me any picture display the last text file Finite-State Language Model
Statistical Language Models • Predict next word based on current and history • Probability of next word is given by • Trigram: P(wi | wi-1, wi-2) • Bigram: P(wi | wi-1) • Unigram: P(wi) • Advantage: • Trainable on Large Text Databases • ‘Soft’ Prediction (Probabilities) • Can be directly combined with AM in decoding • Problem: • Need Large Text Database for each Domain • Sparse problems, smoothing approaches • backoff approach • word class approach • Used in LVCASR
/w/ /ah/ /ts/ States Acoustic Models /ax/ /th/ Phonemes Dictionary /w/ -> /ah/ -> /ts/ /th/ -> /ax/ Words location willamette's what's the Language Model kirk's longitude Sentences display sterett's lattitude ASR Decoding Levels
Given observations, how to determine the most probable utterance/word sequence? (DTW in template-based match) Dynamic Programming ( DP) algorithm was proposed by Bellman in 50s for multistep decision process, the “principle of optimality” is divide and conquer. The DP-based search algorithms have been used in speech recognition decoder to return n-best paths or wordlattice through the acoustic model and the language model Complete search is usually impossible since the search space is too large, so beam search is required to prune less probable paths and save computation load. Issues: computation underflow, balance of LM, AM. Decoding Algorithms
Uses Viterbi decoding Takes MAX, not SUM (Viterbi vs. Forward) Finds the optimal state sequence, not optimal word sequence Computation load: O(T*N2) Time synchronous Extends all paths at each time step All paths have same length (no need to normalize to compare scores, but A* decoding needs) Viterbi Search
Viterbi Search Algorithm Function Viterbi(observations length T, state-graph) returns best-path Num-states<-num-of-states(state-graph) Create path prob matrix viterbi[num-states+2,T+2] Viterbi[0,0]<- 1.0 For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-viterbi[s,t]*at[s,s’]*bs’(ot) if ((viterbi[s’,t+1]=0) || (viterbi[s’,t+1]<new-score)) then viterbi[s’,t+1] <- new-score back-pointer[s’,t+1]<-s Backtrace from highest prob state in final column of viterbi[] & return
W2 W1 t 3 2 1 0 Viterbi Search Trellis
Word 1 Word 2 OldProb(S1) • OutProb • Transprob OldProb(S3) • P(W2 | W1) S1 S1 S2 S2 Word 1 S3 S3 S1 S1 score backptr parmptr S2 S2 Word 2 S3 S3 time t+1 time t Viterbi Search Insight
Find Best Association between Word and Signal Compose Words from Phones Using Dictionary Backtracking is to find the best state sequence /e/ /th/ t1 tn Bachtracking