260 likes | 419 Views
Learning, Uncertainty, and Information. Big Ideas November 8, 2004. Roadmap. Turing, Intelligence, and Learning Noisy-channel model Uncertainty, Bayes’ Rule, and Applications Hidden Markov Models The Model Decoding the best sequence Training the model (EM)
E N D
Learning, Uncertainty, and Information Big Ideas November 8, 2004
Roadmap • Turing, Intelligence, and Learning • Noisy-channel model • Uncertainty, Bayes’ Rule, and Applications • Hidden Markov Models • The Model • Decoding the best sequence • Training the model (EM) • N-gram models: Modeling sequences • Shannon, Information Theory, and Perplexity • Conclusion
Turing & Intelligence • Turing (1950): • Computing Machinery and Intelligence • “Imitation Game” (aka Turing test) • Functional definition of intelligence as indistinguishable from human • Key question raised: Learning • Can a system be intelligent if only knows program? • Learning necessary for intelligence • 1) Programmed knowledge • 2) Learning mechanism • Knowledge, reasoning, learning, communication
Noisy-Channel Model • Original message not directly observable • Passed through some channel b/t sender, receiver + noise • From telephone (Shannon), Word sequence vs acoustics (Jelinek), genome sequence vs CATG, object vs image • Derive most likely original input based on observed
Bayesian Inference • P(W|O) difficult to compute • W – input, O – observations • Generative and Sequence
Applications • AI: Speech recognition!, POS tagging, sense tagging, dialogue, image understanding, information retrieval • Non-AI: • Bioinformatics: gene sequencing • Security: intrusion detection • Cryptography
Probabilistic Reasoning over Time • Issue: Discrete models • Many processes continuously changing • How do we make observations? States? • Solution: Discretize • “Time slices”: Make time discrete • Observations, States associated with time: Ot, Qt • Observations can be discrete or continuous • Here focus on discrete for clarity
Modelling Processes over Time • Infer underlying state sequence from observed • Issue: New state depends on preceding states • Analyzing sequences • Problem 1: Possibly unbounded # prob tables • Observation+State+Time • Solution 1: Assume stationary process • Rules governing process same at all time • Problem 2: Possibly unbounded # parents • Markov assumption: Only consider finite history • Common: 1 or 2 Markov: depend on last couple
Hidden Markov Models (HMMs) • An HMM is: • 1) A set of states: • 2) A set of transition probabilities: • Where aij is the probability of transition qi -> qj • 3)Observation probabilities: • The probability of observing ot in state i • 4) An initial probability dist over states: • The probability of starting in state i • 5) A set of accepting states
Three Problems for HMMs • Find the probability of an observation sequence given a model • Forward algorithm • Find the most likely path through a model given an observed sequence • Viterbi algorithm (decoding) • Find the most likely model (parameters) given an observed sequence • Baum-Welch (EM) algorithm
Bins and Balls Example • Assume there are two bins filled with red and blue balls. Behind a curtain, someone selects a bin and then draws a ball from it (and replaces it). They then select either the same bin or the other one and then select another ball… • (Example due to J. Martin)
Bins and Balls Example .6 .7 .4 Bin 1 Bin 2 .3
Bins and Balls • Π Bin 1: 0.9; Bin 2: 0.1 • A • B
Bins and Balls • Assume the observation sequence: • Blue Blue Red (BBR) • Both bins have Red and Blue • Any state sequence could produce observations • However, NOT equally likely • Big difference in start probabilities • Observation depends on state • State depends on prior state
Bins and Balls Blue Blue Red
Answers and Issues • Here, to compute probability of observed • Just add up all the state sequence probabilities • To find most likely state sequence • Just pick the sequence with the highest value • Problem: Computing all paths expensive • 2T*N^T • Solution: Dynamic Programming • Sweep across all states at each time step • Summing (Problem 1) or Maximizing (Problem 2)
Forward Probability Where α is the forward probability, t is the time in utterance, i,j are states in the HMM, aij is the transition probability, bj(ot) is the probability of observing ot in state bj N is the max state, T is the last time
Pronunciation Example • Observations: 0/1
Onset Mid End Final Acoustic Model • 3-state phone model for [m] • Use Hidden Markov Model (HMM) • Probability of sequence: sum of prob of paths 0.3 0.9 0.4 Transition probabilities 0.7 0.1 0.6 C3: 0.3 C5: 0.1 C6: 0.4 C1: 0.5 C3: 0.2 C4: 0.1 C2: 0.2 C4: 0.7 C6: 0.5 Observation probabilities
Forward Algorithm • Idea: matrix where each cell forward[t,j] represents probability of being in state j after seeing first t observations. • Each cell expresses the probability: forward[t,j] = P(o1,o2,...,ot,qt=j|w) • qt = j means "the probability that the tth state in the sequence of states is state j. • Compute probability by summing over extensions of all paths leading to current cell. • An extension of a path from a state i at time t-1 to state j at t is computed by multiplying together: i. previous path probability from the previous cell forward[t-1,i], ii. transition probabilityaij from previous state i to current state j iii. observation likelihood bjt that current state j matches observation symbol t.
Forward Algorithm Function Forward(observations length T, state-graph) returns best-path Num-states<-num-of-states(state-graph) Create path prob matrix forwardi[num-states+2,T+2] Forward[0,0]<- 1.0 For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-Forward[s,t]*at[s,s’]*bs’(ot) Forward[s’,t+1] <- Forward[s’,t+1]+new-score
Viterbi Code Function Viterbi(observations length T, state-graph) returns best-path Num-states<-num-of-states(state-graph) Create path prob matrix viterbi[num-states+2,T+2] Viterbi[0,0]<- 1.0 For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-viterbi[s,t]*at[s,s’]*bs’(ot) if ((viterbi[s’,t+1]==0) || (viterbi[s’,t+1]<new-score)) then viterbi[s’,t+1] <- new-score back-pointer[s’,t+1]<-s Backtrace from highest prob state in final column of viterbi[] & return
Modeling Sequences, Redux • Discrete observation values • Simple, but inadequate • Many observations highly variable • Gaussian pdfs over continuous values • Assume normally distributed observations • Typically sum over multiple shared Gaussians • “Gaussian mixture models” • Trained with HMM model
Learning HMMs • Issue: Where do the probabilities come from? • Solution: Learn from data • Trains transition (aij) and emission (bj) probabilities • Typically assume structure • Baum-Welch aka forward-backward algorithm • Iteratively estimate counts of transitions/emitted • Get estimated probabilities by forward comput’n • Divide probability mass over contributing paths