270 likes | 372 Views
עיבוד שפות טבעיות - שיעור חמישי Hidden Markov Models. אורן גליקמן המחלקה למדעי המחשב אוניברסיטת בר אילן. Stochastic POS Tagging. POS tagging: For a given sentence W = w 1 …w n Find the matching POS tags T = t 1 …t n In a statistical framework: T' = arg max P(T|W) T. Bayes’ Rule.
E N D
עיבוד שפות טבעיות - שיעור חמישיHidden Markov Models אורן גליקמן המחלקה למדעי המחשב אוניברסיטת בר אילן 88-680
Stochastic POS Tagging • POS tagging:For a given sentence W = w1…wnFind the matching POS tags T = t1…tn • In a statistical framework:T' = arg max P(T|W) T 88-680
Bayes’ Rule Words are independent of each other A words presence only depends on its tag Markovian assumptions 88-680
The Markovian assumptions • Limited Horizon • P(Xi+1 = tk |X1,…,Xi) = P(Xi+1 = tk | Xi) • Time invariant • P(Xi+1 = tk | Xi) = P(Xj+1 = tk | Xj) 88-680
Maximum Likelihood Estimations • In order to estimate P(wi|ti), P(ti|ti-1)we can use the maximum likelihood estimation • P(wi|ti) = c(wi,ti) / c(ti) • P(ti|ti-1) = c(ti-1ti) / c(ti-1) 88-680
Viterbi • Finding the most probable tag sequence can be done with the viterbi algorithm. • No need to calculate every single possible tag sequence (!) 88-680
Hmms • Assume a state machine with • Nodes that correspond to tags • A start and end state • Arcs corresponding to transition probabilities - P(ti|ti-1) • A set of observations likelihoods for each state - P(wi|ti) 88-680
P(likes)=0.3P(flies)=0.1…P(eats)=0.5 P(like)=0.2P(fly)=0.3…P(eat)=0.36 VBZ RB VB NN 0.6 P(the)=0.4P(a)=0.3P(an)=0.2… NNS AT 0.4 88-680
HMMs • An HMM is similar to an Automata augmented with probabilities • Note that the states in an HMM do not correspond to the input symbols. • The input symbols don’t uniquely determine the next state. 88-680
HMM definition • HMM=(S,K,A,B) • Set of states S={s1,…sn} • Output alphabet K={k1,…kn} • State transition probabilities A={aij} i,jS • Symbol emission probabilities B=b(i,k) iS,kK • start and end states (Non emitting) • Note: for a given i- aij=1 & b(i,k)=1 88-680
Why Hidden • Because we only observe the input - the underlying states are hidden 88-680
Decoding • The problem of part-of-speech tagging can be viewed as a decoding problem: Given an observation sequence W=w1,…,wn find a state sequence T=t1,…,tn that best explains the observation. 88-680
Viterbi • A dynamic programming algorithm: • For every state j in the HMM j(i) = the probability of the best path that leads to node j given observation o1,…,oi • For every state j in the HMM j(i) = back-pointers… 88-680
Viterbi… • Initialization: • j(0) = j • Induction: • j(i+1) = maxk k(i) akj b(j,oi+1) • Set j(i+1) accordingly • Termination: • Backtrace from end state using 88-680
A*, N-best decoding • Sometimes one wants not just the best state sequence for a given input but rather the top – n best sequences.e.g. as input for a different model • A* / stack decoding is an alternative to viterbi. 88-680
Finding the probability of an observation • Given an HMM how do we efficiently compute how likely a certain observation is? • Why would we want this? • For speech decoding, language modeling • Not trivial because the observation can result from different paths. 88-680
Naïve approach 88-680
The forward algorithm • A dynamic programming algorithm similar to the viterbi can be applied to efficiently calculate the probability of a given observation. • The algorithm can work forward from beginning of the observation or backward from its end. 88-680
Up from bigrams • The POS tagging model we described used an history of just the previous tag: P(ti|t1,…,ti-1) = P(ti|ti-1) i.e. a First Order Markovian Assumption • In this case each state in the HMM corresponds to a POS tag • One can build an HMM for POS trigrams P(ti|t1,…,ti-1) = P(ti|ti-2,ti-1) 88-680
POS Trigram HMM Model • More accurate then a bigram model • He clearly marked • is clearly marked • In such a model the HMM states do NOT correspond to POS tags. • Why not 4-grams? • Too many states, not enough data! 88-680
Question • Is the HMM based tagging a supervised algorithm? • Yes, because we need a tagged corpus to estimate the transition and emission probabilities (!) • What do we do if we don’t have an annotated corpus but, • Have a dictionary • Have an annotated corpus from a different domain and an un-annotated corpus in desired domain. 88-680
Baum-Welch Algorithm • also known as the Forward-Backward Algorithm • An EM algorithm for HMMs. • Maximization by Iterative hill climbing • The algorithm iteratively improves the model parameters based on un-annotated training data. 88-680
Baum-Welch Algorithm… • Start of with parameters based on the dictionary: • P(w|t) = 1 if t is possible tag for w • P(w|t) = 0 otherwise • Uniform distribution on state transitions • This is enough to bootstrap from. • Could also be used to tune a system to a new domain. 88-680
Unknown Words • Many words will not appear in the training corpus. • Unknown words are a major problem for taggers (!) • Solutions – • Incorporate Morphological Analysis • Consider words appearing once in training data as UNKOWNs 88-680
Completely unsupervised • What if there is no dictionary and no annotated corpus? 88-680
Evaluation 88-680
Homework 88-680