1 / 36

Lecture 25: CS573 Advanced Artificial Intelligence

Lecture 25: CS573 Advanced Artificial Intelligence. Milind Tambe Computer Science Dept and Information Science Inst University of Southern California Tambe@usc.edu. Surprise Quiz II: Part I. P(A) = 0.05. A. C. B. Questions: Surprise . Markov. Markov. Dynamic Belief Nets. X t.

mulan
Download Presentation

Lecture 25: CS573 Advanced Artificial Intelligence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 25:CS573Advanced Artificial Intelligence Milind Tambe Computer Science Dept and Information Science Inst University of Southern California Tambe@usc.edu

  2. Surprise Quiz II: Part I P(A) = 0.05 A C B Questions: Surprise 

  3. Markov

  4. Markov

  5. Dynamic Belief Nets Xt Xt+1 Xt+2 Et Et+1 Et+2 • In each time slice: • Xt = Observable state variables • Et = Observable evidence variables

  6. Types of Inference • Filtering or monitoring: P(Xt | e1, e2…et) • Keep track of probability distribution over current states • Like POMDP belief state • P(@ISI | c1,c2….ct) and P(N@ISI | c1,c2…ct) • Prediction: P(Xt+k | e1,e2…et) for some k > 0 • P(@ISI 3 hours from now | c1,c2…ct) • Smoothing or hindsight: P(Xk | e1, e2…et) for 0 <= k < t • What is the state of the user at 11 Am, if observations at 9AM,10AM,11AM, 1pm, 2 pm • Most likely explanation: Given a sequence of observations, find the sequence of states that is most likely to have generated the observations (speech recognition) • Argmaxx1:t P(X1:t|e1:t)

  7. Filtering: P(Xt+1 | e1,e2…et+1) RECURSION P(Xt+1 | e1:t+1) = f1:t+1 = Norm * P(et+1 | Xt+1) *  P(Xt+1 | xt) * P(xt|e1:t) xt • e1:t+1 = e1, e2…et+1 • P(xt|e1:t) = f1:t • f1:t+1 = Norm-const * FORWARD (f1:t, et+1)

  8. Computing Forward f1:t+1 • For our example of tracking user location: • f1:t+1 = Norm-const * FORWARD (f1:t, ct+1) • Actually it is a vector, not a single quantity • f1:2 = P(L2 | c1, c2) implies computing for both < P(L2 = @ISI | c1, c2), P(L2 = N@ISI | c1, c2) > Then normalize Hope you tried out all the computations from the last lecture at home!

  9. Robotic Perception At-1 At At+1 Xt Xt+1 Xt+2 Et Et+1 Et+2 • At = action at time t (observed evidence) • Xt = State of the environment at time t • Et = Observation at time t (observed evidence)

  10. Robotic Perception • Similar to filtering task seen earlier • Differences: • Must take into account action evidence Norm * P(et+1 | Xt+1) *  P(Xt+1 | xt, at) * P(xt|e1:t) xt POMDP belief update? • Must note that the variables are continuous P(Xt+1 | e1:t+1, a1:t) = Norm * P(et+1 | Xt+1) * ∫P(Xt+1 | xt,at) * P(xt|e1:t, a1:t-1)

  11. Prediction Computed in the last lecture Computed in the last lecture • Filtering without incorporating new evidence • P(Xt+k | e1,e2…et) for some k > 0 • E.g., P( L3 | c1) =  P(L3 | L2) * P(L2 | c1) = (P(L3=@ISI|L2=@ISI)*P(L2=@ISI|c1) + P(L3=@ISI|L2=N@ISI)*P(L2=N@ISI|c1) = 0.7 * 0.6272 + 0.3 * 3728 = 0.43904 + 0.1118 = 0.55 • P(L4 | c1) =  P(L4 | L3) * P(L3 | c1) = 0.7 * 0.55 + 0.3 * 0.45 = 0.52

  12. Prediction • P(L5 | c1) = 0.7 * 0.52 + 0.3* 0.48 = 0.508 • P(L6 | c1) = 0.7 * 0.5 + 0.3 * 0.5 = 0.5… (converging to 0.5) • Predicted distribution of user location converges to a fixed point • Stationary distribution of the markov process • Mixing time: Time taken to reach the fixed point • Prediction useful if K << mixing time • The more uncertainty there is in the transition model • The shorter the mixing time; more difficult to make predictions

  13. Smoothing • P(Xk | e1, e2…et) for 0 <= k < t • P(Lk | c1,c2…ct) = Norm * P(Lk | c1,c2..ck) * P(ck+1..ct | Lk) • = Norm * f1:k * bk+1:t • bk+1:t is a backward message, like our earlier forward message • Hence algorithm called forward-backward algorithm

  14. bk+1:t backward message Xk Xk+1 Xk+2 Ek Ek+1 Ek+2 bk+1:t = P(ek+1:t | Xk) = P(ek+1,ek+2….et| Xk) = P(ek+1,ek+2….et| Xk, Xk+1) P (xk+1 | Xk) xk+1

  15. bk+1:t backward message • bk+1:t = P(ek+1:t | Xk) = P(ek+1,ek+1….et| Xk) = P(ek+1,ek+1….et| Xk, Xk+1) P (xk+1 | Xk) xk+1 = P(ek+1,ek+1….et| Xk+1) P (xk+1 | Xk) xk+1 =  P(ek+1| Xk+1) P(ek+2:t | Xk+1) P (xk+1 | Xk) xk+1

  16. bk+1:t backward message P(ek+1:t | Xk) = bk+1:t = P(ek+1| Xk+1) P(ek+2:t | Xk+1) P (xk+1 | Xk) xk+1 • bk+1:t = BACKWARD(bk+2:t, ek+1:t) • bk+1:t = P(ek+1:t | Xk) = P(ek+1,ek+1….et| Xk) =  P(ek+1| Xk+1) P(ek+2:t | Xk+1) P (xk+1 | Xk) xk+1

  17. Example of Smoothing • P(L1 = @ISI | c1, c2) = Norm * P(Lk | c1,c2..ck) * P(ck+1..ct | Lk) = Norm * P(L1 | c1) * P(c2 | L1) = Norm * 0.818 * P(c2 | L1) P(c2 | L1 = @ISI) = P(ek+1:t | Xk) = P(ek+1| Xk+1) P(ek+2:t | Xk+1) P (xk+1 | Xk) xk+1 =>  P(c2 | L2) * P(c3:2|L2) * P(L2 | L1) L2 = [ (0.9 * 1* 0.7) + (0.2 * 1* 0.3)] = 0.69

  18. Example of Smoothing P(c2 | L1 = @ISI) =  P(c2 | L2) * P(L2 | L1) L2 = [ (0.9 * 0.7) + (0.2 * 0.3)] = 0.69 • P(L1 = @ISI | c1, c2) = Norm * 0.818 * 0.69 = Norm * 0.56442 • P(L1 = N@ISI | c1, c2) = Norm * 0.182 * 0.41 = Norm * 0.074 • After normalization: P(L1 = @ISI | c1, c2) = .883 Smoothed estimate .883 > Filtered estimate P(L1=@ISI | c1)! • WHY?

  19. HMM

  20. HMM • Hidden Markov Models • Speech recognition  perhaps the most popular application • Any speech recognition researcher in class? • Waibel and Lee • Dominance of HMMs in speech recognition from 1980s • For ideal isolated conditions they say 99% accuracy • Accuracy drops with noise, multiple speakers • Find applications everywhere  just try putting in HMM in google • First we gave Bellman update to AI (and other sciences) • Now we make our second huge contribution to AI: Viterbi algorithm!

  21. HMM • Simple nature of HMM allow simple and elegant algorithms • Transition model P(Xt+1 | Xt) for all values of Xt • Represented as a matrix |S| * |S| • For our example: Matrix “T” • Tij = P(Xt= j | Xt-1 = i) • Sensor model also represented as a Diagonal matrix • Diagonal entries give P(et | Xt = i) • et is the evidence, e.g., ct = true • Matrix Ot

  22. HMM • f1:t+1 = Norm-const * FORWARD (f1:t, ct+1) = Norm-const * P(ct+1 | Lt+1) *  P(Lt+1 | Lt) * P(Lt|c1,c2…ct) = Norm-const * Ot+1 * TT *f1:t f1:2 = P (L2 | c1, c2) = Norm-const * O2 * TT * f1:1 = Norm-const * * *

  23. Transpose

  24. HMM • f1:2 = P (L2 | c1, c2) = Norm-const * O2 * TT * f1:1 = Norm-const * * * = Norm-const * * = Norm * <(0.63*0.818 + 0.27 * .182) (0.06*0.818 + 0.14 * .182)> = Norm * <0.564, 0.074> after normalization = <0.883, 0.117>

  25. Backward in HMM P(ek+1:t | Xk) = bk+1:t = P(ek+1| Xk+1) P(ek+2:t | Xk+1) P (xk+1 | Xk) xk+1 = T*Ok+1 * bk+2:t P(c2 | L1 = @ISI) = b2:2 = b3:2 * *

  26. Backward • bk+1:t = T*Ok+1 * bk+2:t • b3:2 = T*O2 • = * * • = ( 0.69 0.41 )

  27. Key Results for HMMs • f1:t+1 = Norm-const * Ot+1 * TT *f1:t • bk+1:t = T*Ok+1 * bk+2:t

  28. Inference in DBN Xt Xt+1 Xt+2 Xt+3 Xt+1 Xt+2 Et Et+1 Et+2 Et+3 Et+1 Et+2 • How to do inference in a DBN in general? • Could unroll the loop forever… • Slices added beyond the last observation have no effect on inference •  WHY? • So only keep slices within the observation period

  29. Inference in DBN Xt Xt+1 Alarm Mary Xt+1 Xt+2 Et Et+1 JOHN Et+3 Et+1 Et+2 • Slices added beyond the last observation have no effect on inference •  WHY? • P(Alarm | JohnCalls)  independent of MaryCalls

  30. Complexity of inference in DBN • Keep almost two slices in memory • Start with slice 0 • Add slice 1 • “Sum out” slice 0 (get a probability distribution over slice 1 state; don’t need to go back to slice 0 anymore – like POMDPs) • Add slice 2, sum out slice 1… • Constant time and space per update • Unfortunately, update exponential in the number of state variables • Need approximate inference algorithms

  31. Solving DBNs in General • Exact methods: • Compute intensive • Variable elimination from Chapter 14 • Approximate methods: • Particle filtering popularity • Run N samples together through slices of the DBN network • All N samples constitute the forward message • Highly efficient • Hard to provide theoretical guarantees

  32. Next Lecture • Continue with Chapter 15

  33. Student Evaluations

  34. Surprise Quiz II: Part II Xt Xt+1 Et Et+1 Question: E’t+1

  35. Most Likely Path • Given a sequence of observations, find the sequence of states that most likely have generated these observations • E.g., in the E-elves example, suppose [activity, activity, no-activity, activity, activity] • What is the most likely explanation of the presence of the user at ISI over the course of the day? • Did the user step out at time = 3? • Was the user present all the time, but was in a meeting at time 3 • Argmaxx1:t P (X1:t| e1:t)

  36. Not so simple… • Use smoothing to find the posterior distribution at each time step • E.g., compute P(L1=@ISI | c1:5), P(L1=N@ISI | c1:5), find max • Do the same for P(L2=@ISI|c1:5) vs P(L2=N@ISI|c1:5) find max • Find the maximum this way • Why might this be different from computing what we want (the most likey sequence)? • maxx1:t+1 P (X1:t+1| e1:t+1) via viterbi algorithm Norm * P(et+1 | Xt+1) * max (P(Xt+1 | xt) max P(x1….xt-1,xt|e1..et)) xt x1..xt-1

More Related