360 likes | 464 Views
Lecture 25: CS573 Advanced Artificial Intelligence. Milind Tambe Computer Science Dept and Information Science Inst University of Southern California Tambe@usc.edu. Surprise Quiz II: Part I. P(A) = 0.05. A. C. B. Questions: Surprise . Markov. Markov. Dynamic Belief Nets. X t.
E N D
Lecture 25:CS573Advanced Artificial Intelligence Milind Tambe Computer Science Dept and Information Science Inst University of Southern California Tambe@usc.edu
Surprise Quiz II: Part I P(A) = 0.05 A C B Questions: Surprise
Dynamic Belief Nets Xt Xt+1 Xt+2 Et Et+1 Et+2 • In each time slice: • Xt = Observable state variables • Et = Observable evidence variables
Types of Inference • Filtering or monitoring: P(Xt | e1, e2…et) • Keep track of probability distribution over current states • Like POMDP belief state • P(@ISI | c1,c2….ct) and P(N@ISI | c1,c2…ct) • Prediction: P(Xt+k | e1,e2…et) for some k > 0 • P(@ISI 3 hours from now | c1,c2…ct) • Smoothing or hindsight: P(Xk | e1, e2…et) for 0 <= k < t • What is the state of the user at 11 Am, if observations at 9AM,10AM,11AM, 1pm, 2 pm • Most likely explanation: Given a sequence of observations, find the sequence of states that is most likely to have generated the observations (speech recognition) • Argmaxx1:t P(X1:t|e1:t)
Filtering: P(Xt+1 | e1,e2…et+1) RECURSION P(Xt+1 | e1:t+1) = f1:t+1 = Norm * P(et+1 | Xt+1) * P(Xt+1 | xt) * P(xt|e1:t) xt • e1:t+1 = e1, e2…et+1 • P(xt|e1:t) = f1:t • f1:t+1 = Norm-const * FORWARD (f1:t, et+1)
Computing Forward f1:t+1 • For our example of tracking user location: • f1:t+1 = Norm-const * FORWARD (f1:t, ct+1) • Actually it is a vector, not a single quantity • f1:2 = P(L2 | c1, c2) implies computing for both < P(L2 = @ISI | c1, c2), P(L2 = N@ISI | c1, c2) > Then normalize Hope you tried out all the computations from the last lecture at home!
Robotic Perception At-1 At At+1 Xt Xt+1 Xt+2 Et Et+1 Et+2 • At = action at time t (observed evidence) • Xt = State of the environment at time t • Et = Observation at time t (observed evidence)
Robotic Perception • Similar to filtering task seen earlier • Differences: • Must take into account action evidence Norm * P(et+1 | Xt+1) * P(Xt+1 | xt, at) * P(xt|e1:t) xt POMDP belief update? • Must note that the variables are continuous P(Xt+1 | e1:t+1, a1:t) = Norm * P(et+1 | Xt+1) * ∫P(Xt+1 | xt,at) * P(xt|e1:t, a1:t-1)
Prediction Computed in the last lecture Computed in the last lecture • Filtering without incorporating new evidence • P(Xt+k | e1,e2…et) for some k > 0 • E.g., P( L3 | c1) = P(L3 | L2) * P(L2 | c1) = (P(L3=@ISI|L2=@ISI)*P(L2=@ISI|c1) + P(L3=@ISI|L2=N@ISI)*P(L2=N@ISI|c1) = 0.7 * 0.6272 + 0.3 * 3728 = 0.43904 + 0.1118 = 0.55 • P(L4 | c1) = P(L4 | L3) * P(L3 | c1) = 0.7 * 0.55 + 0.3 * 0.45 = 0.52
Prediction • P(L5 | c1) = 0.7 * 0.52 + 0.3* 0.48 = 0.508 • P(L6 | c1) = 0.7 * 0.5 + 0.3 * 0.5 = 0.5… (converging to 0.5) • Predicted distribution of user location converges to a fixed point • Stationary distribution of the markov process • Mixing time: Time taken to reach the fixed point • Prediction useful if K << mixing time • The more uncertainty there is in the transition model • The shorter the mixing time; more difficult to make predictions
Smoothing • P(Xk | e1, e2…et) for 0 <= k < t • P(Lk | c1,c2…ct) = Norm * P(Lk | c1,c2..ck) * P(ck+1..ct | Lk) • = Norm * f1:k * bk+1:t • bk+1:t is a backward message, like our earlier forward message • Hence algorithm called forward-backward algorithm
bk+1:t backward message Xk Xk+1 Xk+2 Ek Ek+1 Ek+2 bk+1:t = P(ek+1:t | Xk) = P(ek+1,ek+2….et| Xk) = P(ek+1,ek+2….et| Xk, Xk+1) P (xk+1 | Xk) xk+1
bk+1:t backward message • bk+1:t = P(ek+1:t | Xk) = P(ek+1,ek+1….et| Xk) = P(ek+1,ek+1….et| Xk, Xk+1) P (xk+1 | Xk) xk+1 = P(ek+1,ek+1….et| Xk+1) P (xk+1 | Xk) xk+1 = P(ek+1| Xk+1) P(ek+2:t | Xk+1) P (xk+1 | Xk) xk+1
bk+1:t backward message P(ek+1:t | Xk) = bk+1:t = P(ek+1| Xk+1) P(ek+2:t | Xk+1) P (xk+1 | Xk) xk+1 • bk+1:t = BACKWARD(bk+2:t, ek+1:t) • bk+1:t = P(ek+1:t | Xk) = P(ek+1,ek+1….et| Xk) = P(ek+1| Xk+1) P(ek+2:t | Xk+1) P (xk+1 | Xk) xk+1
Example of Smoothing • P(L1 = @ISI | c1, c2) = Norm * P(Lk | c1,c2..ck) * P(ck+1..ct | Lk) = Norm * P(L1 | c1) * P(c2 | L1) = Norm * 0.818 * P(c2 | L1) P(c2 | L1 = @ISI) = P(ek+1:t | Xk) = P(ek+1| Xk+1) P(ek+2:t | Xk+1) P (xk+1 | Xk) xk+1 => P(c2 | L2) * P(c3:2|L2) * P(L2 | L1) L2 = [ (0.9 * 1* 0.7) + (0.2 * 1* 0.3)] = 0.69
Example of Smoothing P(c2 | L1 = @ISI) = P(c2 | L2) * P(L2 | L1) L2 = [ (0.9 * 0.7) + (0.2 * 0.3)] = 0.69 • P(L1 = @ISI | c1, c2) = Norm * 0.818 * 0.69 = Norm * 0.56442 • P(L1 = N@ISI | c1, c2) = Norm * 0.182 * 0.41 = Norm * 0.074 • After normalization: P(L1 = @ISI | c1, c2) = .883 Smoothed estimate .883 > Filtered estimate P(L1=@ISI | c1)! • WHY?
HMM • Hidden Markov Models • Speech recognition perhaps the most popular application • Any speech recognition researcher in class? • Waibel and Lee • Dominance of HMMs in speech recognition from 1980s • For ideal isolated conditions they say 99% accuracy • Accuracy drops with noise, multiple speakers • Find applications everywhere just try putting in HMM in google • First we gave Bellman update to AI (and other sciences) • Now we make our second huge contribution to AI: Viterbi algorithm!
HMM • Simple nature of HMM allow simple and elegant algorithms • Transition model P(Xt+1 | Xt) for all values of Xt • Represented as a matrix |S| * |S| • For our example: Matrix “T” • Tij = P(Xt= j | Xt-1 = i) • Sensor model also represented as a Diagonal matrix • Diagonal entries give P(et | Xt = i) • et is the evidence, e.g., ct = true • Matrix Ot
HMM • f1:t+1 = Norm-const * FORWARD (f1:t, ct+1) = Norm-const * P(ct+1 | Lt+1) * P(Lt+1 | Lt) * P(Lt|c1,c2…ct) = Norm-const * Ot+1 * TT *f1:t f1:2 = P (L2 | c1, c2) = Norm-const * O2 * TT * f1:1 = Norm-const * * *
HMM • f1:2 = P (L2 | c1, c2) = Norm-const * O2 * TT * f1:1 = Norm-const * * * = Norm-const * * = Norm * <(0.63*0.818 + 0.27 * .182) (0.06*0.818 + 0.14 * .182)> = Norm * <0.564, 0.074> after normalization = <0.883, 0.117>
Backward in HMM P(ek+1:t | Xk) = bk+1:t = P(ek+1| Xk+1) P(ek+2:t | Xk+1) P (xk+1 | Xk) xk+1 = T*Ok+1 * bk+2:t P(c2 | L1 = @ISI) = b2:2 = b3:2 * *
Backward • bk+1:t = T*Ok+1 * bk+2:t • b3:2 = T*O2 • = * * • = ( 0.69 0.41 )
Key Results for HMMs • f1:t+1 = Norm-const * Ot+1 * TT *f1:t • bk+1:t = T*Ok+1 * bk+2:t
Inference in DBN Xt Xt+1 Xt+2 Xt+3 Xt+1 Xt+2 Et Et+1 Et+2 Et+3 Et+1 Et+2 • How to do inference in a DBN in general? • Could unroll the loop forever… • Slices added beyond the last observation have no effect on inference • WHY? • So only keep slices within the observation period
Inference in DBN Xt Xt+1 Alarm Mary Xt+1 Xt+2 Et Et+1 JOHN Et+3 Et+1 Et+2 • Slices added beyond the last observation have no effect on inference • WHY? • P(Alarm | JohnCalls) independent of MaryCalls
Complexity of inference in DBN • Keep almost two slices in memory • Start with slice 0 • Add slice 1 • “Sum out” slice 0 (get a probability distribution over slice 1 state; don’t need to go back to slice 0 anymore – like POMDPs) • Add slice 2, sum out slice 1… • Constant time and space per update • Unfortunately, update exponential in the number of state variables • Need approximate inference algorithms
Solving DBNs in General • Exact methods: • Compute intensive • Variable elimination from Chapter 14 • Approximate methods: • Particle filtering popularity • Run N samples together through slices of the DBN network • All N samples constitute the forward message • Highly efficient • Hard to provide theoretical guarantees
Next Lecture • Continue with Chapter 15
Surprise Quiz II: Part II Xt Xt+1 Et Et+1 Question: E’t+1
Most Likely Path • Given a sequence of observations, find the sequence of states that most likely have generated these observations • E.g., in the E-elves example, suppose [activity, activity, no-activity, activity, activity] • What is the most likely explanation of the presence of the user at ISI over the course of the day? • Did the user step out at time = 3? • Was the user present all the time, but was in a meeting at time 3 • Argmaxx1:t P (X1:t| e1:t)
Not so simple… • Use smoothing to find the posterior distribution at each time step • E.g., compute P(L1=@ISI | c1:5), P(L1=N@ISI | c1:5), find max • Do the same for P(L2=@ISI|c1:5) vs P(L2=N@ISI|c1:5) find max • Find the maximum this way • Why might this be different from computing what we want (the most likey sequence)? • maxx1:t+1 P (X1:t+1| e1:t+1) via viterbi algorithm Norm * P(et+1 | Xt+1) * max (P(Xt+1 | xt) max P(x1….xt-1,xt|e1..et)) xt x1..xt-1