370 likes | 857 Views
Spoken Language Understanding for Conversational Dialog Systems. Michael McTear University of Ulster. IEEE/ACL 2006 Workshop on Spoken Language Technology Aruba, December 10-13, 2006. Overview. Introductory definitions Task-based and conversational dialog systems
E N D
Spoken Language Understanding for Conversational Dialog Systems Michael McTear University of Ulster IEEE/ACL 2006 Workshop on Spoken Language Technology Aruba, December 10-13, 2006
Overview • Introductory definitions • Task-based and conversational dialog systems • Spoken language understanding • Issues for spoken language understanding • Coverage • Robustness • Overview of spoken language understanding • Hand-crafted approaches • Data-driven methods • Conclusions
Semantic representation Speech Recognition Spoken Language Understanding Words Audio Dialogue Manager HMM Acoustic Model N-Gram Language Model Back end Audio Text to Speech Synthesis Language Generation Concepts Words Basic dialog system architecture
Task-based Dialog Systems • Mainly interact with databases to get information or support transactions • SLU module creates a database query from user’s spoken input by extracting relevant concepts • System initiative: constrains user input • Keyword / keyphrase extraction • User-initiative: less constrained input • Call-routing: call classification with named entity extraction • Question answering
Conversational Dialog • AI (agent-based systems) e.g. TRIPS • User can take initiative, e.g. raise new topic, ask for clarification (TRIPS) • More complex interactions involving recognition of the user’s intentions, goals, beliefs or plans • Deep understanding of the user’s utterance, taking into account contextual information • Information State Theory, Planning Theory, User Modelling, Belief Modelling… • Simulated conversation e.g. CONVERSE • Conversational companions, chatbots, help desk • Does not require deep understanding • SLU involves identifying system utterance type and determining a suitable response
Defining Spoken Language Understanding extracting the meaning from speech utterances a transduction of the recognition result to an interpretable representation Meaning (in human–computer interactive systems): a representation that can be executed by an interpreter in order to change the state of the system Bangelore et al. 2006
Topic: Flight Origen: BFS Destination: AGP SLU for task based systems a flight from Belfast to Malaga uh I’d like uh um could you uh is there a flight from Bel- uh Belfast to um Gran- I mean Malaga I would like to find a flight from Pittsburgh to Boston on Wednesday and I have to be in Boston by one so I would like a flight out of here no later than 11 a.m.
SLU for advanced conversational systems (TRIPS) • Interpretation requires intention recognitioncan we use a helicopter to get the people from: Abyss (request to modify plan) Barnacle (include sub-goal and suggest solution) Delta (extension of a solution) • Six possible interpretations with only change of city name • Requires reasoning about task and current context to identify most plausible interpretation • Requires more than concept spotting to identify structure and meaning of utterance as basis for reasoning
employees based in city centre companies based in city centre I mail check to ?where You mail check to me • Stop word removal will transform both strings to “mail check” Role of syntax in deep understanding • List all employees of the companies who/which are based in the city centre • I would like to know where to mail the check. • I would like you to mail the check to me
? SLU for simulated dialog C: Are you still a friend of XXX? H: I am not sure if he has any real friends. He has achieved a lot and has left a lot of people behind. C: You really don’t like the guy - well, at least that’s definite - I thought most people were just indifferent or numb by now.
Coverage Possible requirement: “The system should be able to understand everything the user might say” Predicting user input • Analysis of corpora and iterative design of hand-crafted grammars • Use of carefully designed prompts to constrain user input is constrained • Learning grammar from data
Robustness • Characteristics of spontaneous spoken language • Disfluencies and filled pauses – not just errors, reflect cognitive aspects of speech production and interaction management • Output from speech recognition component • Words and word boundaries not known with certainty • Recognition errors • Approaches • Use of semantic grammars and robust parsing for concepts spotting • Data-driven approaches – learn mappings between input strings and output structures
Developing the SLU component • Hand-crafted approaches • Grammar development • Parsing • Data-driven approaches • Learning from data • Statistical models rather than grammars • Efficient decoding
ASR n-best list, word lattice, … Parsing parse tree Frame Generation semantic frame Discourse Processing frame in context DB Query SQL query Hand-crafting grammars • Traditional software engineering approach of design and iterative refinement • Decisions about type of grammar required • Chomsky hierarchy • Flat v hierarchical representations • Processing issues (parsing) • Dealing with ambiguity • Efficiency
ASR word string Semantic Parser meaning representation Semantic Grammar and Robust Parsing: PHOENIX (CMU/CU) The Phoenix parser maps input word strings on to a sequence of semantic frames. • named set of slots, where the slots represent related pieces of information. • each slot has an associated Context-Free Grammar that specifies word string patterns that match the slot • chart parsing with path pruning: e.g. path that accounts for fewer words is pruned
I would like a coca cola and three large pizzas with pepperoni and mushrooms { drink: "coke", pizza: { number: "3", size: "large", topping: [ "pepperoni", "mushrooms" ] } } Deriving Meaning directly from ASR output: VoiceXML Uses finite state grammars as language models for recognition and semantic tags in the grammars for semantic parsing ASR meaning representation
Deep understanding • Requirements for deep understanding • advanced grammatical formalisms • syntax-semantics issues • parsing technologies • Example: TRIPS • Uses feature-based augmented CFG with agenda-driven best-first chart parser • Combined strategy: combining shallow and deep parsing (Swift et al. )
Combined strategies: TINA (MIT) Grammar rules include mix of syntactic and semantic categories • Context free grammar using probabilities trained from user utterances to estimate likelihood of a parse • Parse tree converted to a semantic frame that encapsulates the meaning Robust parsing strategy • Sentences that fail to parse are parsed using fragments that are combined into a full semantic frame • When all things fail, word spotting is used
Problems with hand-crafted approaches Hand-crafted grammars are • not robust to spoken language input • require linguistic and engineering expertise to develop if grammar is to have good coverage and optimised performance • time consuming to develop • error prone • subject to designer bias • difficult to maintain
Statistical modelling for SLU SLU as pattern matching problemGiven word sequence W, find semantic representation of meaning M that has maximum a posteriori probability P(M|W) P(M): semantic prior model – assigns probability to underlying semantic structure P(W|M): lexicalisation model – assigns probability to word sequence W given the semantic structure
Early Examples • CHRONUS (AT&T: Pieraccini et al, 1992; Levin & Pieraccini, 1995) • Finite state semantic tagger • ‘Flat-concept’ model: simple to train but does not represent hierarchical structure • HUM (Hidden Understanding Model) (BBN: Miller et al, 1995) • Probabilistic CFG using tree structured meaning representations • Grammatical constraints represented in networks rather than rules • Ordering of constituents unconstrained - increases robustness • Transition probabilities constrain over-generation • Requires fully annotated treebank data for training
Using Hidden State Vectors (He & Young) • Extends ‘flat-concept’ HMM model • Represents hierarchical structure (right-branching) using hidden state vectors • Each state expanded to encode stack of a push down automaton • Avoids computational tractability issues associated with hierarchical HMMs • Can be trained using lightly annotated data • Comparison with FST model and with hand-crafted SLU systems using ATIS test sets and reference parse results
Problem with long-distance dependency between ‘Saturday’ and ‘arrive’ ‘Saturday’ associated with ‘FROMLOC’ Hierarchical model allows ‘Saturday’ to be associated with ‘ARRIVE’ Also: more expressive, allows sharing of sub-structures Which flights arrive in Burbank from Denver on Saturday?
SLU Evaluation: Performance • Statistical models competitive with approaches based on handcrafted rules • Hand-crafted grammars better for full understanding and for users familiar with system’s coverage, statistical model better for shallow and more robust understanding for naïve users • Statistical systems more robust to noise and more portable
SLU Evaluation: Software Development “Cost of producing training data should be less than cost of hand-crafting a semantic grammar” (Young, 2002) • Issues • Availability of training data • Maintainability • Portability • Objective metrics? e.g. time, resources, lines of code, … • Subjective issues e.g. designer bias, designer control over system • Few concrete results, except … • HVS model (He & Young) can be robustly trained from only minimally annotated corpus data • Model is robust to noise and portable to other domains
Additional technologies Named entity extraction • Rule-based methods: e.g. using grammars in form of regular expressions compiled into finite state acceptors (AT&T SLU system) – higher precision • Statistical methods e.g. HMIHY, learn mappings between strings and NEs – higher recall as more robust Call routing Question Answering
Additional Issues 1 • ASR/SLU coupling • Post-processing results from ASR • noisy channel model of ASR errors (Ringger & Allen) • Combining shallow and deep parsing • major gains in speed, slight gains in accuracy (Swift et al.) • Use of context, discourse history, prosodic information • re-ordering n-best hypotheses • determining dialog act based on combinations of features at various levels: ASR and parse probabilities, semantic and contextual features (Purver et al, Lemon)
Additional Issues 2 • Methods for learning from sparse data or without annotation • e.g. AT&T system uses ‘active learning’ (Tur et al, 2005) to reduce effort of human data labelling – uses only those data items that improve classifier performance the most • Development tools e.g. SGStudio (Wang & Acero) – build semantic grammar with little linguistic knowledge
Additional Issues 3 Some issues addressed in poster session • Using SLU for: • Dialog act tagging • Prosody labelling • User satisfaction analysis • Topic segmentation and labelling • Emotion prediction
Conclusions 1 SLU approach is determined by • type of application • finite state dialog with single word recognition • frame based dialog with topic classification and named entity extraction • advanced dialog requiring deep understanding • simulated conversation, …
Conclusions 2 SLU approach is determined by • type of output required • syntactic / semantic parse trees • semantic frames • speech / dialog acts, … • intentions, beliefs, emotions, …
Conclusions 3 SLU approach is determined by • Deployment and usability issues • applications requiring accurate extraction of information • applications involving complex processing of content • applications involving shallow processing of content (e.g. conversational companions, interactive games)
Selected References Bangalore, S., Hakkani-Tür, D., Tur, G. (eds), (2006) Special Issue on Spoken Language Understanding in Conversational Systems. Speech Communication 48. Gupta, N., Tur, G., Hakkani-Tür, D., Bangalore, S., Riccardi, G., Gilbert, M. (2006) The AT&T Spoken Language Understanding System. IEEE Transactions on Speech and Audio Processing 14:1, 213-222. Allen, JF, Byron, DK, Dzikovska, O, Ferguson, G, Galescu, L, Stent, A. (2001) Towards conversational human-computer interaction. AI Magazine, 22(4):27–35. Jurafsky, D. & Martin, J. (2000) Speech and Language Processing, Prentice-Hall Huang, X, Acero, A, Hon, H-W. (2001) Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice-Hall