Lecture 25: CS573 Advanced Artificial Intelligence

Lecture 25:CS573Advanced Artificial Intelligence Milind Tambe Computer Science Dept and Information Science Inst University of Southern California Tambe@usc.edu

Surprise Quiz II: Part I P(A) = 0.05 A C B Questions: Surprise 

Markov

Dynamic Belief Nets Xt Xt+1 Xt+2 Et Et+1 Et+2 • In each time slice: • Xt = Observable state variables • Et = Observable evidence variables

Types of Inference • Filtering or monitoring: P(Xt | e1, e2…et) • Keep track of probability distribution over current states • Like POMDP belief state • P(@ISI | c1,c2….ct) and P(N@ISI | c1,c2…ct) • Prediction: P(Xt+k | e1,e2…et) for some k > 0 • P(@ISI 3 hours from now | c1,c2…ct) • Smoothing or hindsight: P(Xk | e1, e2…et) for 0 <= k < t • What is the state of the user at 11 Am, if observations at 9AM,10AM,11AM, 1pm, 2 pm • Most likely explanation: Given a sequence of observations, find the sequence of states that is most likely to have generated the observations (speech recognition) • Argmaxx1:t P(X1:t|e1:t)

Computing Forward f1:t+1 • For our example of tracking user location: • f1:t+1 = Norm-const * FORWARD (f1:t, ct+1) • Actually it is a vector, not a single quantity • f1:2 = P(L2 | c1, c2) implies computing for both < P(L2 = @ISI | c1, c2), P(L2 = N@ISI | c1, c2) > Then normalize Hope you tried out all the computations from the last lecture at home!

Robotic Perception At-1 At At+1 Xt Xt+1 Xt+2 Et Et+1 Et+2 • At = action at time t (observed evidence) • Xt = State of the environment at time t • Et = Observation at time t (observed evidence)

Prediction Computed in the last lecture Computed in the last lecture • Filtering without incorporating new evidence • P(Xt+k | e1,e2…et) for some k > 0 • E.g., P( L3 | c1) =  P(L3 | L2) * P(L2 | c1) = (P(L3=@ISI|L2=@ISI)*P(L2=@ISI|c1) + P(L3=@ISI|L2=N@ISI)*P(L2=N@ISI|c1) = 0.7 * 0.6272 + 0.3 * 3728 = 0.43904 + 0.1118 = 0.55 • P(L4 | c1) =  P(L4 | L3) * P(L3 | c1) = 0.7 * 0.55 + 0.3 * 0.45 = 0.52

Prediction • P(L5 | c1) = 0.7 * 0.52 + 0.3* 0.48 = 0.508 • P(L6 | c1) = 0.7 * 0.5 + 0.3 * 0.5 = 0.5… (converging to 0.5) • Predicted distribution of user location converges to a fixed point • Stationary distribution of the markov process • Mixing time: Time taken to reach the fixed point • Prediction useful if K << mixing time • The more uncertainty there is in the transition model • The shorter the mixing time; more difficult to make predictions

Smoothing • P(Xk | e1, e2…et) for 0 <= k < t • P(Lk | c1,c2…ct) = Norm * P(Lk | c1,c2..ck) * P(ck+1..ct | Lk) • = Norm * f1:k * bk+1:t • bk+1:t is a backward message, like our earlier forward message • Hence algorithm called forward-backward algorithm

bk+1:t backward message Xk Xk+1 Xk+2 Ek Ek+1 Ek+2 bk+1:t = P(ek+1:t | Xk) = P(ek+1,ek+2….et| Xk) = P(ek+1,ek+2….et| Xk, Xk+1) P (xk+1 | Xk) xk+1

Example of Smoothing P(c2 | L1 = @ISI) =  P(c2 | L2) * P(L2 | L1) L2 = [ (0.9 * 0.7) + (0.2 * 0.3)] = 0.69 • P(L1 = @ISI | c1, c2) = Norm * 0.818 * 0.69 = Norm * 0.56442 • P(L1 = N@ISI | c1, c2) = Norm * 0.182 * 0.41 = Norm * 0.074 • After normalization: P(L1 = @ISI | c1, c2) = .883 Smoothed estimate .883 > Filtered estimate P(L1=@ISI | c1)! • WHY?

HMM

HMM • Hidden Markov Models • Speech recognition  perhaps the most popular application • Any speech recognition researcher in class? • Waibel and Lee • Dominance of HMMs in speech recognition from 1980s • For ideal isolated conditions they say 99% accuracy • Accuracy drops with noise, multiple speakers • Find applications everywhere  just try putting in HMM in google • First we gave Bellman update to AI (and other sciences) • Now we make our second huge contribution to AI: Viterbi algorithm!

HMM • Simple nature of HMM allow simple and elegant algorithms • Transition model P(Xt+1 | Xt) for all values of Xt • Represented as a matrix |S| * |S| • For our example: Matrix “T” • Tij = P(Xt= j | Xt-1 = i) • Sensor model also represented as a Diagonal matrix • Diagonal entries give P(et | Xt = i) • et is the evidence, e.g., ct = true • Matrix Ot

HMM • f1:t+1 = Norm-const * FORWARD (f1:t, ct+1) = Norm-const * P(ct+1 | Lt+1) *  P(Lt+1 | Lt) * P(Lt|c1,c2…ct) = Norm-const * Ot+1 * TT *f1:t f1:2 = P (L2 | c1, c2) = Norm-const * O2 * TT * f1:1 = Norm-const * * *

Transpose

HMM • f1:2 = P (L2 | c1, c2) = Norm-const * O2 * TT * f1:1 = Norm-const * * * = Norm-const * * = Norm * <(0.63*0.818 + 0.27 * .182) (0.06*0.818 + 0.14 * .182)> = Norm * <0.564, 0.074> after normalization = <0.883, 0.117>

Backward • bk+1:t = T*Ok+1 * bk+2:t • b3:2 = T*O2 • = * * • = ( 0.69 0.41 )

Key Results for HMMs • f1:t+1 = Norm-const * Ot+1 * TT *f1:t • bk+1:t = T*Ok+1 * bk+2:t

Inference in DBN Xt Xt+1 Xt+2 Xt+3 Xt+1 Xt+2 Et Et+1 Et+2 Et+3 Et+1 Et+2 • How to do inference in a DBN in general? • Could unroll the loop forever… • Slices added beyond the last observation have no effect on inference •  WHY? • So only keep slices within the observation period

Inference in DBN Xt Xt+1 Alarm Mary Xt+1 Xt+2 Et Et+1 JOHN Et+3 Et+1 Et+2 • Slices added beyond the last observation have no effect on inference •  WHY? • P(Alarm | JohnCalls)  independent of MaryCalls

Complexity of inference in DBN • Keep almost two slices in memory • Start with slice 0 • Add slice 1 • “Sum out” slice 0 (get a probability distribution over slice 1 state; don’t need to go back to slice 0 anymore – like POMDPs) • Add slice 2, sum out slice 1… • Constant time and space per update • Unfortunately, update exponential in the number of state variables • Need approximate inference algorithms

Solving DBNs in General • Exact methods: • Compute intensive • Variable elimination from Chapter 14 • Approximate methods: • Particle filtering popularity • Run N samples together through slices of the DBN network • All N samples constitute the forward message • Highly efficient • Hard to provide theoretical guarantees

Next Lecture • Continue with Chapter 15

Student Evaluations

Surprise Quiz II: Part II Xt Xt+1 Et Et+1 Question: E’t+1

Most Likely Path • Given a sequence of observations, find the sequence of states that most likely have generated these observations • E.g., in the E-elves example, suppose [activity, activity, no-activity, activity, activity] • What is the most likely explanation of the presence of the user at ISI over the course of the day? • Did the user step out at time = 3? • Was the user present all the time, but was in a meeting at time 3 • Argmaxx1:t P (X1:t| e1:t)

Not so simple… • Use smoothing to find the posterior distribution at each time step • E.g., compute P(L1=@ISI | c1:5), P(L1=N@ISI | c1:5), find max • Do the same for P(L2=@ISI|c1:5) vs P(L2=N@ISI|c1:5) find max • Find the maximum this way • Why might this be different from computing what we want (the most likey sequence)? • maxx1:t+1 P (X1:t+1| e1:t+1) via viterbi algorithm Norm * P(et+1 | Xt+1) * max (P(Xt+1 | xt) max P(x1….xt-1,xt|e1..et)) xt x1..xt-1

Lecture 25: CS573 Advanced Artificial Intelligence

Lecture 25: CS573 Advanced Artificial Intelligence

Presentation Transcript

Artificial Intelligence II

Artificial Intelligence

Lectures on Artificial Intelligence – CS364 Introduction to Uncertainty Management

CS B551: Elements of Artificial Intelligence

Technical Issues: Artificial Intelligence

CS 541: Artificial Intelligence

Artificial Intelligence The different levels of language analysis

CptS 440 / 540 Artificial Intelligence

Artificial Intelligence: Planning

Artificial Intelligence Technologies for Web Intelligence

CS 541: Artificial Intelligence

CS 4700: Foundations of Artificial Intelligence

Artificial Gravity

74.419 Artificial Intelligence

Artificial Intelligence and Software that Learns and Evolves

CS347 – Introduction to Artificial Intelligence

DCP 1172 Introduction to Artificial Intelligence

CS621 : Artificial Intelligence

Artificial Life

Artificial Intelligence