440 likes | 618 Views
Machine Learning and Review. Reading: C. 18. Bayesian Approach. Each observed training example can incrementally decrease or increase probability of hypothesis instead of eliminate an hypothesis Prior knowledge can be combined with observed data to determine hypothesis
E N D
Machine Learning and Review Reading: C. 18
Bayesian Approach • Each observed training example can incrementally decrease or increase probability of hypothesis instead of eliminate an hypothesis • Prior knowledge can be combined with observed data to determine hypothesis • Bayesian methods can accommodate hypotheses that make probabilistic predictions • New instances can be classified by combining the predictions of multiple hypotheses, weighted by their probabilities
Applying Bayes Theorem • Best hypothesis = most probable hypothesis • Maximum a posteriori (MAP) hypothesis • Variables • h = hypothesis • D = data • Prior probability • h: P(h) • training data observed: P(D) • P(D|h) = probability of observing data D given some world where hypothesis holds • Bayes theorem: • P(h|D) = P(D|h)*P(h) P(D)
Defining the MAP hypothesis • hMAP=argmax P(h|D) hεH • hMAP=argmax P(D|h)*P(h) hεH P(D) (Using Bayes Theorem) • hMAP=argmax P(D|h)*P(h) hεH (P(D) is a constant independent of h) • hMAP=argmax P(D|h) hεH(when we can make the assumption that each hypothesis h is equally probable)
Bayes Optimal Classifier • The most probable classification of the new instance by combining the predictions of all hypotheses weighted by their posterior probabilities • Possible classifications: vjεV • Argmax ∑ P(vj|hi)P(hi|D)vjεVhiεH
Example • V = {p, n} • P(h1|D)=.4 P(p|h1)=0 P(n,h1)=1 • P(h2|D)=.3 P(p|h2)=1 P(n,h2)=0 • P(h3|D)=.3 P(p|h3)=1 P(n,h3)=0 • ∑ P(n|hi)P(hi|D) = .4hiεH • ∑ P(p|hi)P(hi|D) = .6 hiεH • Argmax ∑ P(vj|hi)P(hi|D) = p vjε{p,n}hiεH
Properties of Bayesian Approach • Bayesian learning is optimal • Easy to estimate P(h) by counting in training data • Estimating P(D|h) not feasible • Why?
Naïve Bayes • Assume independence of attributes • D = a1,a2,…an • P(a1,a2,…an|vj)=∏P(ai|vj)i • Substitute into VMAP formula • VNB=argmax P(vj)∏P(ai|vj) vjV i
Estimating Probabilities • What happens when the number of data elements is small? • Suppose true P(S-length=high|verginica)=.05 • There are only 2 instances with C=Verginica • We estimate probability by nc/n or #S-length|Verginica/C-Verginica • #S-length|Verginica must = 0 • Then, instead of .05 we use estimated probability of 0 • Two problems • Biased underestimate of probability • This probability term will dominate
Instead • Use priors as well • nc+mp n+m • Where p = prior estimate • M is a constant called the equivalent sample size • Determines how heavily to weight p relative to observed data • Typical method: assume a uniform prior
Benefits of Naïve Bayes • Practical • As effective and in some cases, more so, than other machine learners
Review for Midterm • Concepts you should know • Search algorithms • Depth-first, breadth-first, iterative deepening, A*, greedy, hill-climbing, beam • Constraint propagation • Game playing • Bayesian Nets • A little on machine learning
Midterm format • Multiple choice • Short answer questions • Problem solving • Essay • An example midterm will be posted under links
Concepts • Any words in yellow or light blue or pink on slides
Uninformed Search • Depth-first • Breadth-first • Iterative Deepening
Formulating Problems as Search Given an initial state and a goal, find the sequence of actions leading through a sequence of states to the final goal state. Terms: • Successor function: given action and state, returns {action, successors} • State space: the set of all states reachable from the initial state • Path: a sequence of states connected by actions • Goal test: is a given state the goal state? • Path cost: function assigning a numeric cost to each path • Solution: a path from initial state to goal state
Breadth first • OPEN = start node; CLOSED = empty • While OPEN is not empty do • Remove leftmost state from OPEN, call it X • If X = goal state, return success • Put X on CLOSED • SUCCESSORS = Successor function (X) • Remove any successors on OPEN or CLOSED • Put remaining successors on right end of OPEN • End while
Depth-first • OPEN = start node; CLOSED = empty • While OPEN is not empty do • Remove leftmost state from OPEN, call it X • If X = goal state, return success • Put X on CLOSED • SUCCESSORS = Successor function (X) • Remove any successors on OPEN or CLOSED • Put remaining successors on left end of OPEN • End while
Can we combine benefits of both? • Depth limited • Select some limit in depth to explore the problem using DFS • How do we select the limit? • Iterative deepening • DFS with depth 1 • DFS with depth 2 up to depth d
Complexity Analysis • Completeness: is the algorithm guaranteed to find a solution when there is one? • Optimality: Does the strategy find the optimal solution? • Time: How long does it take to find a solution? • Space: How much memory is needed to perform the search? Is this notion of completeness the same as completeness in logic?
Cost variables • Time: number of nodes generated • Space: maximum number of nodes stored in memory • Branching factor: b • Maximum number of successors of any node • Depth: d • Depth of shallowest goal node • Path length: m • Maximum length of any path in the state space
Informed Search • Best-first • A* • Greedy • Hill climbing • Variants • Randomness, Simulated annealing, Local beam search, • Online search will not be on midterm
Greedy Search • OPEN = start node; CLOSED = empty • While OPEN is not empty do • Remove leftmost state from OPEN, call it X • If X = goal state, return success • Put X on CLOSED • SUCCESSORS = Successor function (X) • Remove any successors on OPEN or CLOSED • Compute heuristic function for each node • Put remaining successors on either end of OPEN • Sort nodes on OPEN by value of heuristic function • End while
A* Search • Try to expand node that is on least cost path to goal • Evaluation function = f(n) • f(n)=g(n)+h(n) • h(n) is heuristic function: cost from node to goal • g(n) is cost from initial state to node • f(n) is the estimated cost of cheapest solution that passes through n • If h(n) is an underestimate of true cost to goal • A* is complete • A* is optimal • A* is optimally efficient: no other algorithm using h(n) is guaranteed to expand fewer states
Admissable heuristics • A heuristic that never overestimates the cost to the goal • h1 and h2 are admissable heuristics • Consistency: the estimated cost of reaching the goal from n is no greater than the step cost of getting to n’ plus estimated cost to goal from n’ • h(n) <=c(n,a,n’)+h(n’)
Local Search Algorithms • Operate using a single current state • Move only to neighbors of the state • Paths followed by search are not retained • Iterative improvement • Keep a single current state and try to improve it
Problems for hill climbing When the higher the heuristic function the better: maxima (objective fns); when the lower the function the better: minima (cost fns) • Local maxima: A local maximum is a peak that is higher than each of its neighboring states, but lower than the global maximum • Ridges: a sequence of local maxima • Plateaux: an area of the state space landscape where the evaluation function is flat
Some solutions • Stochastic hill-climbing • Chose at random from among the uphill moves • First-choice hill climbing • Generates successors randomly until one is generated that is better than current state • Random-restart hill climbing • Keep restarting from randomly generated initial states, stopping when goal is found • Simulated annealing • Generate a random move. Accept if improvement. Otherwise accept with continually decreasing probability. • Local beam search • Keep track of k states rather than just 1
CSP algorithm Depth-first search often used • Initial state: the empty assignment {}; all variables are unassigned • Successor fn: assign a value to any variable, provided no conflicts w/constraints • All CSP search algorithms generate successors by considering possible assignments for only a single variable at each node in the search tree • Goal test: the current assignment is complete • Path cost: a constant cost for every step
Local search • Complete-state formulation • Every state is a compete assignment that might or might not satisfy the constraints • Hill-climbing methods are appropriate
General purpose methods for efficient implementation • Which variable should be assigned next? • in what order should its values be tried? • Can we detect inevitable failure early? • Can we take advantage of problem structure?
Order • Choose the most constrained variable first • The variable with the fewest remaining values • Minimum Remaining Values (MRV) heuristic • What if there are >1? • Tie breaker: Most constraining variable • Choose the variable with the most constraints on remaining variables
Order on value choice • Given a variable, chose the least constraining value • The value that rules out the fewest values in the remaining variables
Forward Checking • Keep track of remaining legal values for unassigned variables • Terminate search when any variable has no legal values
Game Playing • Minimax • Alpha-beta pruning • Evaluation function (what is the difference between a cost function, a utility function, a heuristic function, an evaluation function?)
Bayesian nets • Example problem