830 likes | 861 Views
Spoken Language Understanding. Error Correction and Adaptation. ASR Engine. Post-Processor. Acoustic Model. ECLM. Language Model. Semantic LM. ASR outputs. Transcripts. Dialog System and Adaptation. Dialog System. ASR output. Adaptation or Re-ranking. Corpus A (Adaptation).
E N D
Spoken Language Understanding Intelligent Robot Lecture Note
Error Correction and Adaptation Intelligent Robot Lecture Note
ASR Engine Post-Processor Acoustic Model ECLM Language Model Semantic LM ASR outputs Transcripts Dialog System and Adaptation Dialog System ASR output Adaptation or Re-ranking Intelligent Robot Lecture Note
Corpus A (Adaptation) Information Extraction Task-specific Knowledge Adapted P(w1…wn) SLM Adaptation Corpus B (Background) SLM Estimation Initial PB(w1…wn) Error Correction and Adaptation • ‘Error correction’ is to automatically recover ASR errors by post-processing [Ringger and Allen, 1996]. • ‘Adaptation’ is a process to fit ASR to new domain, speaker, and speech styles [Bellegarda, 2004]. *From [Bellegarda, 2004] Intelligent Robot Lecture Note
Channel Model (Translation) Language Model (Source) Error Correction and Adaptation • Error Correction • Rule-based approach: pattern mining / transformation rule learning • Statistical model: Noisy-Channel framework as a statistical machine translation • Adaptation • Re-ranking with complicate linguistic information (e.g. POStag, phrase chunk, parse tree, topic, WordNet, etc.) [Bellegarda, 2004, Rosenfeld 1996] • Interpolated LM: • Error Corrective Language Model (ECLM) is a marriage of two methods. • Domain-specific / User-oriented adaptation • ECLM is a statistical re-ranking method to interactively adapt speech recognizer’s language models [Jeong et al. 2005]. Intelligent Robot Lecture Note
Error Corrective Adaptation • Two-stage re-ranking • CORRECTOR: generating corrected candidates by translation model • RERANKER: re-rankingwith semantic information An Example for Error Corrective Re-ranking Intelligent Robot Lecture Note
Word mapping Fertility model Error Correction • Correcting the ASR hypotheses as MAP process • Noisy-Channel correction model • Modeling the word-to-word translation pairs • (e.g.) p(TAKE A | TICKET), p(TRAIN | TRAIN), p(FROM | FROM), p(CHICAGO | CHICAGO), p(TO | TO), p(TOLEDO | TO LEAVE) • Fertility model (IBM model 4 [Brown et al. 1993]) Intelligent Robot Lecture Note
Estimating the Channel Model • Maximum Likelihood Estimates • Counting the pair (oi,wi) in training data (A) • Discounting methods • To reduce over-fitting • Applying well-known techniques to smooth language models • Absolute (abs), Good-Turing (GT), and modified Kneser-Ney (KN)[Chen and Goodman, 1998] Intelligent Robot Lecture Note
Re-ranking with Language Understanding • Re-ranking the n-best hypotheses or word lattice • PLM can be any type of language models which are trained using background and domain-specific text data. • Two baselines: trigram and class-based n-gram (CLM) • Exploiting the Semantic Information • Semantic Class-based Language Model (SCLM) • s={s1,…,sn} : semantic class ~ (domain specific) named entity categories • Estimation as same as class-based n-gram [Brown et al. 1992] Intelligent Robot Lecture Note
Re-ranking with Language Understanding • Exploiting the Semantic Information • Another method for estimating SCLM = MaxEnt exponential model • xi={xi1,xi2,xi3,…} : lexical, syntactic and semantic features Defined Semantic Classes for Two Domains Intelligent Robot Lecture Note
Reading Lists • J. R. Bellegarda. January 2004. Statistical language model adaptation: review and perspectives. Speech Communication, 42(1):93-108. • S. F. Chen, and J. Goodman. 1998. An empirical study of smoothing techniques for language modeling. Technical Report TR.10.98, Harvard University. • M. Jeong, J. Eun, S. Jung, and G. G. Lee. 2005. An error-corrective language-model adaptation for automatic speech recognition. Interspeech 2005-Eurospeech, Lisbon. • M. Jeong, and G. G. Lee. 2006. Experiments on error-corrective language model adaptation. WESPAC-9, Seoul. • E. K. Ringger, and J. F. Allen. 1996. Error correction via a post-processor for continuous speech recognition. In ICASSP. • R. Rosenfeld, 1996. A Maximum Entropy approach to adaptive statistical language modeling. Computer, Speech and Language, 10:187-228. Intelligent Robot Lecture Note
Spoken Language Understanding Intelligent Robot Lecture Note
Semantic Frame Speech ASR Text SLU SQL Database Response SQL Generate Spoken Language Understanding • Spoken language understanding is to map natural language speech to frame structure encoding of its meanings. • What’s difference between NLU and SLU? • Robustness; noise and ungrammatical spoken language • Domain-dependent; further deep-level semantics (e.g. Person vs. Cast) • Dialog; dialog history dependent and utt. by utt. analysis • Traditional approaches; natural language to SQL conversion A typical ATIS system (from [Wang et al., 2005]) Intelligent Robot Lecture Note
“Show me flights from Seattle to Boston” ShowFlight <frame name=‘ShowFlight’ type=‘void’> <slot type=‘Subject’> FLIGHT</slot> <slot type=‘Flight’/> <slot type=‘DCity’>SEA</slot> <slot type=‘ACity’>BOS</slot> </slot> </frame> Subject Flight FLIGHT Departure_City Arrival_City SEA BOS Semantic Representation • Semantic frame (frame and slot/value structure) [Gildeaand Jurafsky, 2002] • An intermediate semantic representation to serve as the interface between user and dialog system • Each frame contains several typed components called slots. The type of a slot specifies what kind of fillers it is expecting. Semantic representation on ATIS task; XML format (left) and hierarchical representation (right) [Wang et al., 2005] Intelligent Robot Lecture Note
Semantic Representation • Two common components in semantic frame • Dialog acts (DA); the meaning of an utt. At the discourse level, and it it approximately the equivalent of intent or subject slot in the practice. • Named entities (NE); the identifier of entity such as person, location, organization, or time. In SLU, it represents domain-specific meaning of a word (or word group). • Example (ATIS and EPG domain, simplified representation) Intelligent Robot Lecture Note
Info. Source Feature Extraction / Selection + Dialog Act Identification Frame-Slot Extraction Relation Extraction + + + Unification + Semantic Frame Extraction • Semantic Frame Extraction (~Information ExtractionApproach) • Dialog act / Main action Identification ~ Classification • Frame-Slot Object Extraction ~ Named Entity Recognition • Object-Attribute Attachment ~ Relation Extraction • 1) + 2) + 3) ~ Unification Examples of semantic frame structure Overall architecture for semantic analyzer Intelligent Robot Lecture Note
Knowledge-based Systems • Knowledge-based systems: • Developers write a syntactic/semantic grammar • A robust parser analyzes the input text with the grammar • Without a large amount of training data • Previous works • MIT: TINA (natural language understanding) [Seneff, 1992] • CMU: PHEONIX [Pellom et al., 1999] • SRI: GEMINI [Dowding et al., 1993] • Disadvantages • Grammar development is an error-prone process • It takes multiple rounds to fine-tune a grammar • Combined linguistic and engineering expertise is required to construct a grammar with good coverage and optimized performance • Such a grammar is difficult and expensive to maintain Intelligent Robot Lecture Note
Statistical Systems • Statistical SLU approaches: • System can automatically learn from example sentences with their corresponding semantics • The annotation are much easier to create and do not require specialized knowledge • Previous works • Microsoft: HMM/CFG composite model [Wang et al., 2005] • AT&T: CHRONUS (Finite-state transducers) [Levin and Pieraccini, 1995] • Cambridge Univ: Hidden vector state model [He and Young, 2005] • Postech: Semantic frame extraction using statistical classifiers [Eun et al., 2004; Eun et al., 2005; Jeong and Lee, 2006] • Disadvantages • Data-sparseness problem; system requires a large amount of corpus • Lack of domain knowledge Intelligent Robot Lecture Note
Machine Learning for SLU • Relational Learning (RL) or Structured Prediction (SP) [Dietterich, 2002; Sutton and McCallum, 2004] • Structured or relationalpatterns are important because they can be exploited to improve the prediction accuracy of our classier • Argmax search (e.g. Sum-Max, Belief propagation, Viterbi etc) • Basically, RL for language processing is to use a left-to-right structure (a.k.a linear-chain or sequence structure) • Algorithms: CRFs, Max-Margin Markov Net (M3N), SVM for Independent and Structured Output (SVM-ISO), Structured Perceptron, etc. Intelligent Robot Lecture Note
z yt-1 yt yt+1 x xt-1 xt xt+1 Machine Learning for SLU • Background: Maximum Entropy (a.k.a logistic regression) • Conditional and discriminative manner • Unstructured! (no dependency in y) • Dialog act classification problem • Conditional Random Fields [Lafferty et al. 2001] • Structured versions of MaxEnt (argmax search in inference) • Undirected graphical models • Popular in language and text processing • Linear-chain structure for practical implementation • Named entity recognition problem hk fk gk Intelligent Robot Lecture Note
Semantic NER as Sequence Labeling • Relational Learning for language processing • Left-to-right n-th order Markov model (linear-chain or sequence) • E.g. Part-of-speech tagging, Noun Phrase chunking, Information Extraction, Speech Recognition, etc. • Very large size of feature space (e.g. state-of-the-art NP chunking > 1M) • Open problem is how to reduce the training cost (even 1st order Markov) • Transformation to BIO representation [Ramshaw and Marcus, 1995] • Begin of entity, Inside of entity, and Outside Intelligent Robot Lecture Note
Theoretical Background for SLU Intelligent Robot Lecture Note
Maximum Entropy • (Conditional) Maximum Entropy= log-linear, exponential model, multinomial logistic regression • Criterion is based on Information Theory • We want a distribution which is uniform with constraints = high entropy • Given the parametric form, fitting an Maximum Entropy to a collection of training data entails finding values for the parameter vector Λ which or Intelligent Robot Lecture Note
MaxEnt - Derivation • Introduce the Lagrangian • Derivative the Lagrangian and Get the parametric form • Dual form Intelligent Robot Lecture Note
inner product MaxEnt - Parametric Form • Parametric form • Log-likelihood function is concave global maximum • Constrained optimization Unconstrained optimization • Regularization (= MAP with Gaussian prior, ) • Why Maximum Entropy Classifier? • Efficient calculation (inner product of the parameter and a feature vector) • e.g. SVM learning (1 week) vs. MaxEnt (1 hour) • Handling of overlapping features • Automatically inducing feature (for exponential model) Intelligent Robot Lecture Note
MaxEnt - Parameter Estimation • Generalized Iterative Scaling • one popular method for iteratively refining the parameters • from Darroch and Ratcliff (1972), Deming and Stephan (1940) • It only require the expected values but is very slowly convergent • Conjugate Gradient (Fletcher-Reeves and Polak-Ribiere-Positive) • The expected values required by GIS essentially is the gradient, so we can use the gradient information. • Limited memory quasi-Newton (L-BFGS) • For large-scale problem, the evaluation of the Hessian matrix is computationally impractical quasi-Newton method • Furthermore, limited memory version of BFGS algorithm saves the storage in large-scale problem. Intelligent Robot Lecture Note
yt-1 yt yt+1 xt-1 xt xt+1 Conditional Random Fields • An undirected graphical model [Lafferty et al. 2001] • Linear-chain structure • A label bias problem • State whose following-state dist. have low entropy will be preferred Intelligent Robot Lecture Note
CRFs - Parameter Estimation • A 2-norm penalized (conditional) log-likelihood: • To maximize log-likelihood, we get • Optimization techniques: • GIS or IIS, Conjugate Gradient or L-BFGS • Voted perceptron, Stochastic meta gradient Empirical distribution Expectation Marginal distribution Intelligent Robot Lecture Note
CRFs - Inference • In training • Compute marginal distribution p(y,y’,xt(i)) • It’s forward-backward recursion. • In testing • Viterbi recursion Intelligent Robot Lecture Note
Exploiting Non-Local Informationfor SLU Intelligent Robot Lecture Note
fly from denver to chicago on dec. 10th 1999 … … DEPART.MONTH return from denver to chicago on dec. 10th 1999 … … RETURN.MONTH Long-distance Dependency Problem • Most practical NLP models employ the local feature set • Local context feature (sliding window) • E.g.) for “dec.”; current = dec., prev-2 = on, prev-1 = chicago, next+1 = 10th, next+2 = 1999, POS-tag = NN, chunk = PP • However, there is exactly same feature set for two states “dec.” (even different labels) • Non-local feature or high-order dependency should be considered Intelligent Robot Lecture Note
Using Non-local Information • Syntactic parser-based approach • From fields of Semantic Role Labeling and Relation Extraction • Parse tree path, governing category, or head word • Adv : global structure of language • Disadv : informal language – speech, email • Data-driven approach • From fieldsof Information Retrieval and Lang. Modeling • Identical word, lexical co-occurrence, or triggering • Adv : easy to extract, fit to ungrammatical form • Disadv : depends on data set Intelligent Robot Lecture Note
Hidden Vector State Model • CFG-like Model [He and Young, 2005] • Extends ‘flat-concept’ HMM model • Represents hierarchical structure (right-branching) using hidden state vectors • Each state expanded to encode stack of a push down automaton Intelligent Robot Lecture Note
Previous Slot Context • CRF-based Model [Wang et al., 2005] • Conditional model for ATIS domain SLU system • Efficient heuristic functions that encode non-local information • But, it should have a domain-specific lexicon and knowledge • More general alternatives; skip-chain CRF [Sutton and McCallum, 2006] and two-step approach [Jeong and Lee, 2007] Intelligent Robot Lecture Note
Tree Path and Head Word • Using Syntactic Parse Tree [Gildea and Jurafsky, 2002] • Motivated by semantic role labeling • Full structural information of language • Disadvantage • Reliable for only TEXT process but not SPEECH • Heavy computation to parse a sentence Intelligent Robot Lecture Note
Trigger Feature Selection • Using Concurrence Information [Jeong and Lee, 2006] • Finding non-local and long-distance dependency • Based on a feature selection method for exponential models Definition 1 (Trigger Pair)Let two elements a ∈ A and b ∈ B, Ais a set of history features and B is a set of target features in trainingexamples. If a feature element a is significantly correlated with anotherelement b, then a → b is considered as a trigger, with a being the triggerelement and b the triggered element. Intelligent Robot Lecture Note
Trigger Features • Trigger word pair [Rosenfeld, 1992] • the basic element for extracting information from the long-distance document history • (wi→wj) : word wi and wj are significantly correlated • When wioccurs in the document, it triggers wj, causing its probability estimate to change. • Selecting the triggers • E.g) dec. return : 0.001179, dec. fly : < 0.0001 (reject) • Is it enough to extract trigger features from the training data? NO! Intelligent Robot Lecture Note
Inducing (Trigger) Features • Basic idea • Incrementally adding only bundle of (trigger) features which increase log-likelihood of exponential model [Della Pietra et al., 1997] • Feature gain • Measure to evaluate the (trigger) features using Kullback-Leibler divergence [Della Pietra et al., 1997] • Current model (CRFs) + a new feature [McCallum, 2003] • Mean field approximation Intelligent Robot Lecture Note
Inducing (Trigger) Features • Feature Gain (final form) • No closed form solution Newton’s method • MI vs FI? Intelligent Robot Lecture Note
Efficient Feature Induction Algorithm • How to reduce time cost for FI? • Local feature set as base-model • P) Original algorithm iteratively includes some of all features to empty set • S) We only consider to induce non-local trigger features • Maximum entropy inducer • P) CRFs are heavy to learn! • S) We extract non-local trigger features from light-weight MaxEnt model, and then only learn CRFs at last time. Outline of our feature induction algorithm Intelligent Robot Lecture Note
Effect of Trigger Feature Intelligent Robot Lecture Note
Joint Prediction for SLU Intelligent Robot Lecture Note
Motivation: Independent System • DA and NE tasks are totally independent. • System produces the DA (z) and NE (y) given input words (x), and passes them to DM. • MaxEnt for DA classification, CRFs for NE recognition • Preliminary result in previous section Intelligent Robot Lecture Note
Motivation: Cascaded System • Current state-of-the-art system design [Gupta et al. 2006, AT&T system] • Training the NE module, and use its prediction as a feature for the DA module (or vice versa) • Significant drawback; cannot take advantage of information from DA task (or vice versa) • Cascaded system can improve only single task performance rather than both. Intelligent Robot Lecture Note
Motivation: Joint System • Joint Prediction of DA and NE [Jeong and Lee, 2006] • DA and NE are mutually dependent • An integration of DA and NE model encoding inter-dependency • GOAL is to improve both performances of DA and NE task. Intelligent Robot Lecture Note
hk xz z f2k yt-1 yt yt+1 f1k gk xt-1 xt xt+1 Triangular-chain CRFs • Modeling the inter-dependency (y↔ z) • Factorizing the potential edge-transition NE-observation DA-observation y,y dependency y,z dependency In general, fk can be a function of triangulated cliques. However, we assume that NE state transition is independent from DA, i.e., DA operates as an observation feature to identify NE labels. Intelligent Robot Lecture Note
CRFs Family Graphical illustrations of linear-chain CRF (left), factorial CRF [Sutton et al. 2004] (middle) and triangular-chain CRF (right) Intelligent Robot Lecture Note
Joint Inference of Tri-CRFs • Performed by multiple exact inferences (of linear-chain CRFs) • Imagine the |Z| planes of linear-chain CRFs. • Forward-backward recursion and final alpha yields the mass of all state seq. • Beam search (for z variables) • Truncate z planes with p < ε(=0.001) where Intelligent Robot Lecture Note
Parameter Estimation of Tri-CRFs • Log-likelihood (for joint task optimization) given D={x,y,z}i=1,…,N • Derivatives • Numerical optimization; L-BFGS • Gaussian regularization (σ2=20) Intelligent Robot Lecture Note
Reducing the Human Effort Intelligent Robot Lecture Note