880 likes | 902 Views
Learn about spoken dialog systems (SDS) for HRI, focusing on language processing rather than signal processing. This tutorial covers various aspects of SDS, including automatic speech recognition, spoken language understanding, dialog management, and more.
E N D
SPOKEN DIALOG SYSTEM FOR INTELLIGENT SERVICE ROBOTS Intelligent Software Lab. POSTECH Prof. Gary Geunbae Lee
This Tutorial • Introduction to Spoken Dialog System (SDS) for Human-Robot Interaction (HRI) • Brief introduction to SDS • Language processing oriented • But not signal processing oriented • Mainly based on papers at • ACL, NAACL, HLT, ICASSP, INTESPEECH, ASRU, SLT, SIGDIAL, CSL, SPECOM, IEEE TASLP
OUTLINES • INTRODUCTION • AUTOMATIC SPEECH RECOGNITION • SPOKEN LANGUAGE UNDERSTANDING • DIALOG MANAGEMENT • CHALLENGES & ISSUES • MULTI-MODAL DIALOG SYSTEM • DIALOG SIMULATOR • DEMOS • REFERENCES
What is HRI? • Wikipedia (http://en.wikipedia.org/wiki/Human_robot_interaction) Human-robot interaction (HRI) is the study of interactions between people and robots. HRI is multidisciplinary with contributions from the fields of human-computer interaction, artificial intelligence, robotics, natural language understanding, and social science. The basic goal of HRI is to develop principles and algorithms to allow more natural and effective communication and interaction between humans and robots.
Area of HRI Vision Learning Emotion Speech • Signal Processing • Speech Recognition • Speech Understanding • Dialog Management • Speech Synthesis Haptics
Home networking Car-navigation Tele-service Robot interface SDS APPLICATIONS
SCIENCE FICTION • Eagle Eye (2008, D.J. Caruso)
AUTOMATIC SPEECH RECOGNITION x y Speech Words Learning algorithm (x, y) A process by which an acoustic speech signal is converted into a set of words [Rabiner et al., 1993] Training examples
NOISY CHANNEL MODEL • GOAL • Find the most likely sequence w of “words” in language L given the sequence of acoustic observation vectors O • Treat acoustic input O as sequence of individual observations • O = o1,o2,o3,…,ot • Define a sentence as a sequence of words: • W = w1,w2,w3,…,wn Bayes rule Golden rule
TRADITIONAL ARCHITECTURE 버스 정류장이 어디에 있나요? 버스 정류장이 어디에 있나요? Feature Extraction Decoding Speech Signals Word Sequence Network Construction Speech DB Acoustic Model Pronunciation Model Language Model HMM Estimation G2P Text Corpora LM Estimation
25ms . . . 10ms a1a2a3 FEATURE EXTRACTION • The Mel-Frequency Cepstrum Coefficients (MFCC) is a popular choice [Paliwal, 1992] • Frame size : 25ms / Frame rate : 10ms • 39 feature per 10ms frame • Absolute : Log Frame Energy (1) and MFCCs (12) • Delta : First-order derivatives of the 13 absolute coefficients • Delta-Delta : Second-order derivatives of the 13 absolute coefficients X(n) Preemphasis/ Hamming Window FFT (Fast Fourier Transform) Mel-scale filter bank log|.| DCT (Discrete Cosine Transform) MFCC (12-Dimension)
bj(x) codebook ACOUSTIC MODEL • Provide P(O|Q) = P(features|phone) • Modeling Units [Bahl et al., 1986] • Context-independent : Phoneme • Context-dependent : Diphone, Triphone, Quinphone • pL-p+pR : left-right context triphone • Typical acoustic model [Juang et al., 1986] • Continuous-density Hidden Markov Model • Distribution : Gaussian Mixture • HMM Topology : 3-state left-to-right model for each phone, 1-state for silence or pause
PRONUCIATION MODEL • Provide P(Q|W) = P(phone|word) • Word Lexicon [Hazen et al., 2002] • Map legal phone sequences into words according to phonotactic rules • G2P (Grapheme to phoneme) : Generate a word lexicon automatically • Several word may have multiple pronunciations • Example • Tomato • P([towmeytow]|tomato) = P([towmaatow]|tomato) = 0.1 • P([tahmeytow]|tomato) = P([tahmaatow]|tomato) = 0.4 [ow] [ey] 0.5 1.0 0.2 1.0 1.0 [m] [t] [ow] [t] 0.8 1.0 0.5 1.0 [ah] [aa]
LANGUAGE MODEL • Provide P(W) ; the probability of the sentence [Beaujard et al., 1999] • We saw this was also used in the decoding process as the probability of transitioning from one word to another. • Word sequence : W = w1,w2,w3,…,wn • The problem is that we cannot reliably estimate the conditional word probabilities, for all words and all sequence lengths in a given language • n-gram Language Model • n-gram language models use the previous n-1 words to represent the history • Bi-grams are easily incorporated in a viterbi search
LANGUAGE MODEL • Example • Finite State Network (FSN) • Context Free Grammar (CFG) • Bigram 세시 네시 서울 부산 에서 기차 버스 출발 하는 대구 대전 출발 도착 $time = 세시|네시; $city = 서울|부산|대구|대전; $trans = 기차|버스; $sent = $city (에서 $time 출발 | 출발 $city 도착) 하는 $trans P(에서|서울)=0.2 P(세시|에서)=0.5 P(출발|세시)=1.0 P(하는|출발)=0.5 P(출발|서울)=0.5 P(도착|대구)=0.9 …
I L 일 I L I 이 S S A M 삼 A S A 사 M 이 일 사 삼 NETWORK CONSTRUCTION • Expanding every word to state level, we get a search network [Demuynck et al., 1997] Acoustic Model Pronunciation Model Language Model Search Network Intra-word transition Word transition start end 이 I P(이|x) LM is applied 일 P(일|x) I L P(사|x) 사 Between-word transition A S P(삼|x) 삼 S M A
DECODING • Find • Viterbi Search : Dynamic Programming • Token Passing Algorithm [Young et al., 1989] • Initialize all states with a token with a null history and the likelihood that it’s a start state • For each frame ak • For each token t in state s with probability P(t), history H • For each state r • Add new token to s with probability P(t) Ps,r Pr(ak), and history s.H
HTK • Hidden Markov Model Toolkit (HTK) • A portable toolkit for building and manipulating hidden Markov models [Young et al., 1996] - HShell : User I/O & interaction with OS - HLabel : Label files - HLM : Language model - HNet : Network and lattices - HDic : Dictionaries - HVQ : VQ codebooks - HModel : HMM definitions - HMem : Memory management - HGrf : Graphics - HAdapt : Adaptation - HRec : Main recognition processing functions
I L 일 I L I 이 S S A M 삼 A S A 사 M 이 일 사 삼 SUMMARY x y Decoding Speech Words Search Network Construction Acoustic Model Pronunciation Model Language Model Learning algorithm (x, y) Training examples
SPEECH UNDERSTANDING (in SDS) x y Input Speech or Words Output Intentions Learning algorithm (x, y) A process by which natural langauge speech is mapped to frame structure encoding of its meanings [Mori et al., 2008] Training examples
Semantic Frame Speech ASR Text SLU SQL Database Response SQL Generate LANGUAGE UNDERSTANDING • What’s difference between NLU and SLU? • Robustness; noise and ungrammatical spoken language • Domain-dependent; further deep-level semantics (e.g. Person vs. Cast) • Dialog; dialog history dependent and utt. by utt. Analysis • Traditional approaches; natural language to SQL conversion A typical ATIS system (from [Wang et al., 2005])
REPRESENTATION • Semantic frame (slot/value structure) [Gildea and Jurafsky, 2002] • An intermediate semantic representation to serve as the interface between user and dialog system • Each frame contains several typed components called slots. The type of a slot specifies what kind of fillers it is expecting. “Show me flights from Seattle to Boston” ShowFlight <frame name=‘ShowFlight’ type=‘void’> <slot type=‘Subject’>FLIGHT</slot> <slot type=‘Flight’/> <slot type=‘DCity’>SEA</slot> <slot type=‘ACity’>BOS</slot> </slot> </frame> Subject Flight FLIGHT Departure_City Arrival_City Semantic representation on ATIS task; XML format (left) and hierarchical representation (right) [Wang et al., 2005] SEA BOS
SEMANTIC FRAME • Meaning Representations for Spoken Dialog System • Slot type 1: Intent, Subject Goal, Dialog Act (DA) • The meaning (intention) of an utt. at the discourse level • Slot type 2: Component Slot, Named Entity (NE) • The identifier of entity such as person, location, organization, or time. In SLU, it represents domain-specific meaning of a word (or word group). <frame domain=`RestaurantGuide'> <slot type=`DA' name=`SEARCH_RESTAURANT'/> <slot type=`NE' name=`CITY'>Pohang</slot> <slot type=`NE' name=`ADDRESS'>Daeyidong</slot> <slot type=`NE' name=`FOOD_TYPE'>Korean</slot> </frame> Ex) Find Korean restaurants in Daeyidong, Pohang
HOW TO SOLVE • Two Classification Problems Input: Find Korean restaurants in Daeyidong, Pohang Dialog Act Identification SEARCH_RESTAURANT Output: Input: Find Korean restaurants in Daeyidong, Pohang Named Entity Recognition FOOD_TYPE ADDRESS CITY Output:
PROBLEM FORMALIZATION • Encoding: • x is an input (word), y is an output (NE), and z is another output (DA). • Vector x = {x1, x2, x3, …, xT} • Vector y = {y1, y2, y3, …, yT} • Scalar z • Goal: modeling the functions y=f(x) and z=g(x)
CASCADE APPROACH I • Named Entity Dialog Act
CASCADE APPROACH II • Dialog Act Named Entity • Improve NE, but not DA.
JOINT APPROACH • Named Entity ↔ Dialog Act [Jeong and Lee, 2006]
MACHINE LEARNING FOR SLU • Relational Learning (RL) or Structured Prediction (SP) [Dietterich, 2002; Lafferty et al., 2004, Sutton and McCallum, 2006] • Structured or relationalpatterns are important because they can be exploited to improve the prediction accuracy of our classier • Argmax search (e.g. Sum-Max, Belief propagation, Viterbi etc) • Basically, RL for language processing is to use a left-to-right structure (a.k.a linear-chain or sequence structure) • Algorithms: CRFs, Max-Margin Markov Net (M3N), SVM for Independent and Structured Output (SVM-ISO), Structured Perceptron, etc.
z yt-1 yt yt+1 x xt-1 xt xt+1 MACHINE LEARNING FOR SLU • Background: Maximum Entropy (a.k.a logistic regression) • Conditional and discriminative manner • Unstructured! (no dependency in y) • Dialog act classification problem • Conditional Random Fields [Lafferty et al. 2001] • Structured versions of MaxEnt (argmax search in inference) • Undirected graphical models • Popular in language and text processing • Linear-chain structure for practical implementation • Named entity recognition problem hk fk gk
SUMMARY Solve by isolate (or independent) classifier such as Naïve Bayes, and MaxEnt Input: Find Korean restaurants in Daeyidong, Pohang Dialog Act Identification SEARCH_RESTAURANT Output: Input: Find Korean restaurants in Daeyidong, Pohang Named Entity Recognition FOOD_TYPE ADDRESS CITY Output: Solve by structured (or relational) classifier such as HMM, and CRFs
DIALOG MANAGEMENT x y Input Words or Intentions Output System Response Learning algorithm (x, y) A central component of a dialog system to produce system responses with external knowledge sources [McTear, 2004] Training examples
DIALOG MANAGEMENT • GOAL • Answer your query (e.g., question and order) given the task domain • It includes : • Provide query results • Ask further slot information • Confirm user utterance • Notify invalid query • Suggest the alternative • Related to dialog complexity and task complexity. • In practice • Find the best system action a given the dialog state s
DESIGN ISSUES • Task complexity • How hard the task is? • How much the system has domain knowledge? Weather Information Automatic Banking Conversational English Tutoring Call Routing Collaborative Planning Simple Complex
DESIGN ISSUES • Dialog complexity • What dialog phenomena are allowed • Initiative strategies • e.g., system-initiative vs. user-initiative vs. mixed-initiative • Meta-dialogs; the dialog itself • e.g., Could you hold on for a minute? • Subdialogs; clarification/confirmation • e.g., You selected KE airlines, is it right? • Multiple dialog threads • e.g., domain switching
DIALOG EXAMPLES • Example 1 S: Where are you going? U: Gangnam, Seoul S: What kinds of food do you have in Gangnam, Seoul? U: Please African food S: Sorry, there are no African restaurants. Please select another food. • System-initiative • Implicit confirmation • Merely informs user of failed query • Rigid dialog flow
DIALOG EXAMPLES • Example 2 S: May I help you? U: I’d like to have African food in Gannam, Seoul S: Sorry, there are no African restaurants. S: There are Korean, Japanese, Indian, and American restaurants in Gangnam. U: Please Korean restaurants. S: Would you like to find Korean restaurants? U: Yes pleases • Mixed-initiative • Implicit/Explicit confirmation • Suggests the alternative when query fails • More natural dialog flow
DIALOG EXAMPLES • Example 3 U: I’d like to have African food in Gangnam, Seoul S: Sorry, there are no African restaurants. S: How about American restaurants in Gangnam, Seoul? U: No I don’t like it. S: What is your favorite food? U: I like grilled and seasoned beef S: So, how about Korean restaurants? U: Good. • Mixed-initiative • Implicit/Explicit confirmation • Recommends the alternative when query fails • Most natural dialog flow
DIALOG CONTROL • Finite-state based approach • Input : Single word or phrase • State transition network (or graph) • It can allow all legal dialog flow which is pre-defined in the state diagram. • Frame-based approach • Input : Natural language with concept spotting • Form-filling tasks to access information source • But the questions do not have to be asked in a predetermined sequence • Plan-based approach • Input : Unrestricted natural language • The modeling of dialog as collaboration between intelligent agents to solve some problems or task. • For more complex task, such as negotiation and problem solving.