ITCS 3153 Artificial Intelligence

ITCS 3153Artificial Intelligence Lecture 24 Statistical Learning Chapter 20

AI: Creating rational agents • The pursuit of autonomous, rational, agents • It’s all about search • Varying amounts of model information • tree searching (informed/uninformed) • simulated annealing • value/policy iteration • Searching for an explanation of observations • Used to develop a model

Searching for explanation of observations • If I can explain observations… • can I predict the future? • Can I explain why ten coin tosses are 6 H and 4 T? • Can I predict the 11th coin toss

Running example: Candy • Surprise Candy • Comes in two flavors • cherry (yum) • lime (yuk) • All candy is wrapped in same opaque wrapper • Candy is packaged in large bags containing five different allocations of cherry and lime

Statistics • Given a bag of candy, what distribution of flavors will it have? • Let H be the random variable corresponding to your hypothesis • H1 = all cherry, H2 = all lime, H3 = 50/50 cherry/lime • As you open pieces of candy, let each observation of data: D1, D2, D3, … be either cherry or lime • D1 = cherry, D2 = cherry, D3 = lime, … • Predict the flavor of the next piece of candy • If the data caused you to believe H1 was correct, you’d pick cherry

Bayesian Learning • Use available data to calculate the probability of each hypothesis and make a prediction • Because each hypothesis has an independent likelihood, we use all their relative likelihoods when making a prediction • Probabilistic inference using Bayes’ rule: • P(hi | d) = aP(d | hi) P(hi) • The probability of of hypothesis hi being active given you observed sequence d equals the probability of seeing data sequence d generated by hypothesis hi multiplied by the likelihood of hypothesis i being active hypothesis prior likelihood

Prediction of an unknown quantity X • The likelihood of X happening given d has already happened is a function of how much each hypothesis predicts X can happen given d has happened • Even though a hypothesis has a high prediction that X will happen, this prediction will be discounted if the hypothesis itself is unlikely to be true given the observation of d

Details of Bayes’ rule • All observations within d are • independent • identically distributed • The probability of a hypothesis explaining a series of observations, d • is the product of explaining each component

Example • Prior distribution across hypotheses • h1 = 100% cherry = 0.1 • h2 = 75/25 cherry/lime = 0.2 • h3 = 50/50 cherry/lime = 0.5 • h4 = 25/75 cherry/lime = 0.2 • h5 = 100% lime = 0.1 • Prediction • P(d|h3) = (0.5)10

Example • Probabilities for each hypothesis starts at prior value <.1, .2, .4, .2, .1> • Probability of h3 hypothesis as 10 lime candies are observed • P(d|h3)*P(h3) = (0.5)10*(0.4)

Prediction of 11th candy • If we’ve observed 10 lime candies, is 11th lime? • Build weighted sum of each hypothesis’s prediction • Weighted sum can become expensive to compute • Instead use most probable hypothesis and ignore others • MAP: maximum a posteriori from hypothesis from observations

Overfitting • Remember overfitting from NN discussion? • The number of hypotheses influences predictions • Too many hypotheses can lead to overfitting

Overfitting Example • Say we’ve observed 3 cherry and 7 lime • Consider our 5 hypotheses from before • prediction is a weighted average of the 5 • Consider having 11 hypotheses, one for each permutation • The 3/7 hypothesis will be 1 and all others will be 0

Learning with Data • First talk about parameter learning • Let’s create a hypothesis for candies that says the probability a cherry is drawn is q, hq • If we unwrap N candies and c are cherry, what is q? • The (log) likelihood is:

Learning with Data • We want to find q that maximizes log-likelihood • differentiate L with respect to q and set to 0 • This solution process may not be easily computed and iterative and numerical methods may be used

ITCS 3153 Artificial Intelligence