390 likes | 475 Views
The Hidden Vector State Language Model. Vidura Senevitratne, Steve Young Cambridge University Engineering Department. Reference. Young, S. J., “The Hidden Vector State language model”, Tech. Report CUED/F-INFENG/TR.467, Cambridge University Engineering Department, 2003.
E N D
The Hidden Vector State Language Model Vidura Senevitratne, Steve Young Cambridge University Engineering Department
Reference • Young, S. J., “The Hidden Vector State language model”, Tech. Report CUED/F-INFENG/TR.467, Cambridge University Engineering Department, 2003. • He, Y. and Young S.J., “Hidden Vector State Model for hierarchical semantic parsing”, In Proc. of the ICASSP, Hong Kong, 2003. • Fine, S., Singer Y., and Tishby N., “The Hierarchical Hidden Markov Model: Analysis and applications”, Machine Learning 32(1): 41-62, 1998.
Outline • Introduction • HVS Model • Experiments • Conclusion
Introduction • Language model: • Issue of data sparseness, inability to capture long distance dependencies and model the nested structural information • Class-based language model • POS tag information • Structured language model • Syntactic information
Hierarchical Hidden Markov Model • HHMM is structured multi-level stochastic process. • Each state is an HHMM • Internal state: hidden state that do not emit observable symbols directly • Production state: leaf state • States of HMM are production states of HHMM.
HHMM (cont.) • Parameters of HHMM:
HHMM (cont.) • Transition probability: horizontal • Initial probability: vertical • Observation probability:
HHMM (cont.) • Current node is root: • Choose child according to initial probability • Child is production state: • Produce an observation • Transit within the same level • When it reaches end-state, back to parent of end-state • Child is internal state: • Choose child • Wait until control is back from children • Transit within the same level • When it reaches end-state, back to parent of end-state
HHMM (cont.) • Other application: trend of stocks (IDEAL 2004)
Hidden Vector State Model (cont.) The semantic information relating to any single word can be stored as a vector of semantic tag names
Hidden Vector State Model (cont.) • If state transitions were unconstrained • Fully HHMM • Transitions between states can be factored into a stack shift: two stage, pop, push • Stack size is limited, # of new concept to be pushed is limited to one • More efficient
Hidden Vector State Model (cont.) • The joint probability is defined:
Hidden Vector State Model (cont.) • Approximation (assumption): • So,
Hidden Vector State Model (cont.) • Generative process associated with this constrained version of HVS models consists of three step for each position t: 1. choose a value for nt 2. Select preterminal concept tag ct[1] 3. Select a word wt
Hidden Vector State Model (cont.) • It is reasonable to ask an application designer to provide examples of utterances which would yield each type of semantic schema. • It is not reasonable to require utterances with manually transcribed parse trees. • Assume abstract semantic annotations and availability of a set of domain specific lexical classes.
Hidden Vector State Model (cont.) Abstract semantic annotations: • show me flights arriving in X at T. • List flights arriving around T in X. • Which flight reaches X before T. = FLIGHT(TOLOC(CITY(X),TIME_RELATIVE(TIME(T)))) Class set: CITY: Boston, New York, Denver…
Experiments Experimental Setup Training set: ATIS-2, ATIS-3 Test set: ATIS-3 NOV93, DEC94 Baseline: FST (Finite Semantic Tagger) GT for FST, Witten-Bell for HVS Show me flights from Boston to New York Goal: FLIGHT Slots: FROMLOC.CITY = Boston TOLOC.CITY = New York
Experiments Dash line: goal detection accuracy, Solid line: F-measure
Conclusion • The key features of HVS model • Its ability for representing hierarchical information in a constrained way • Its capability for training directly from target semantics without explicit word-level annotation.
HVS Language Model • The basic HVS model is a regular HMM in which each state encodes history in a fixed dimension stack-like structure. • Each state consists of a stack where each element of the stack is a label chosen from a finite set of cardinality M+1: C={c1,…,cM,c#} • A D depth HVS model state can be characterized by a vector of dimension D with most recently pushed element at index 1 and the oldest at index D
HVS Language Model (cont.) • Each HVS model state transition is restricted: (i) exactly nt class label are popped off the stack (ii) exactly one new class label ct is pushed into the stack • The number of elements to pop nt and the choice of new class label to push ct are determined:
HVS Language Model (cont.) • nt is conditioned on all the class labels that are in the stack at t-1 but ct is conditioned only on the class labels that remain on the stack after the pop operation • Former distribution can encode embedding, whereas the latter focuses on modeling long-range dependencies.
HVS Language Model (cont.) • Joint probability: • Assumption:
HVS Language Model (cont.) • Training: EM algorithm • C,N: latent data, W: observed data • E-step:
HVS Language Model (cont.) • M-Step: • Q function (auxiliary): • Substituting P(W,C,N|λ)
HVS Language Model (cont.) • Calculate probability distributions separately.
HVS Language Model (cont.) • State space S, if fully populated: • |S|=MD states, for M=100+, D=3 to 4 • Due to data sparseness, backoff is needed.
HVS Language Model (cont.) • Backoff weight: • Modified version of absolute discounting
Experiments • Training set: • ATIS-3,276K words, 23K sentences. • Development set: • ATIS -3 Nov93 • Test set : • ATIS-3 Dec94, 10K words, 1K sentences. • OOV were removed • k=850
Conclusion • The HVS language model is able to make better use of context than standard class n-gram models. • HVS model is trainable using EM.