1 / 44

Statistical Learning Methods

Statistical Learning Methods. Russell and Norvig: Chapter 20 (20.1,20.2,20.4,20.5) CMSC 421 – Fall 2006. Statistical Approaches. Statistical Learning (20.1) Naïve Bayes (20.2) Instance-based Learning (20.4) Neural Networks (20.5). Statistical Learning (20.1). Example: Candy Bags.

nibal
Download Presentation

Statistical Learning Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Learning Methods Russell and Norvig: Chapter 20 (20.1,20.2,20.4,20.5) CMSC 421 – Fall 2006

  2. Statistical Approaches • Statistical Learning (20.1) • Naïve Bayes (20.2) • Instance-based Learning (20.4) • Neural Networks (20.5)

  3. Statistical Learning (20.1)

  4. Example: Candy Bags • Candy comes in two flavors: cherry () and lime () • Candy is wrapped, can’t tell which flavor until opened • There are 5 kinds of bags of candy: • H1= all cherry • H2= 75% cherry, 25% lime • H3= 50% cherry, 50% lime • H4= 25% cherry, 75% lime • H5= 100% lime • Given a new bag of candy, predict H • Observations: D1, D2 ,D3, …

  5. Bayesian Learning • Calculate the probability of each hypothesis, given the data, and make prediction weighted by this probability (i.e. use all the hypothesis, not just the single best) • Now, if we want to predict some unknown quantity X

  6. Bayesian Learning cont. • Calculating P(h|d) • Assume the observations are i.i.d.—independent and identically distributed likelihood prior

  7. Example: • Hypothesis Prior over h1, …, h5 is {0.1,0.2,0.4,0.2,0.1} • Data: • Q1: After seeing d1, what is P(hi|d1)? • Q2: After seeing d1, what is P(d2= |d1)?

  8. Making Statistical Inferences • Bayesian – • predictions made using all hypothesis, weighted by their probabilities • MAP – maximum a posteriori • uses the single most probable hypothesis to make prediction • often much easier than Bayesian; as we get more and more data, closer to Bayesian optimal • ML – maximum likelihood • assume uniform prior over H • when

  9. Naïve Bayes (20.2)

  10. Naïve Bayes • aka Idiot Bayes • particularly simple BN • makes overly strong independence assumptions • but works surprisingly well in practice…

  11. we need: Bayesian Diagnosis • suppose we want to make a diagnosis D and there are n possible mutually exclusive diagnosis d1, …, dn • suppose there are m boolean symptoms, E1, …, Em how do we make diagnosis?

  12. Naïve Bayes Assumption • Assume each piece of evidence (symptom) is independent give the diagnosis • then what is the structure of the corresponding BN?

  13. Naïve Bayes Example • possible diagnosis: Allergy, Cold and OK • possible symptoms: Sneeze, Cough and Fever my symptoms are: sneeze & cough, what isthe diagnosis?

  14. Learning the Probabilities • aka parameter estimation • we need • P(di) – prior • P(ek|di) – conditional probability • use training data to estimate

  15. Maximum Likelihood Estimate (MLE) • use frequencies in training set to estimate: where nx is shorthand for the counts of events in training set

  16. Example: what is: P(Allergy)? P(Sneeze| Allergy)? P(Cough| Allergy)?

  17. Laplace Estimate (smoothing) • use smoothing to eliminate zeros: many other smoothing schemes… where n is number of possible values for d and e is assumed to have 2 possible values

  18. Comments • Generally works well despite blanket assumption of independence • Experiments show competitive with decision trees on some well known test sets (UCI) • handles noisy data

  19. Learning more complex Bayesian networks • Two subproblems: • learning structure: combinatorial search over space of networks • learning parameters values: easy if all of the variables are observed in the training set; harder if there are ‘hidden variables’

  20. Instance-based Learning

  21. Instance/Memory-based Learning • Non-parameteric • hypothesis complexity grows with the data • Memory-based learning • Construct hypotheses directly from the training data itself

  22. k=1 k=6 x Nearest Neighbor Methods • To classify a new input vector x, examine the k-closest training data points to x and assign the object to the most frequently occurring class

  23. Issues • Distance measure • Most common: euclidean • Better distance measures: normalize each variable by standard deviation • For discrete data, can use hamming distance • Choosing k • Increasing k reduces variance, increases bias • For high-dimensional space, problem that the nearest neighbor may not be very close at all! • Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large data sets. • Indexing the data can help; for example KD trees

  24. Neural Networks (20.5)

  25. Neural function • Brain function (thought) occurs as the result of the firing of neurons • Neurons connect to each other through synapses, which propagate action potential (electrical impulses) by releasing neurotransmitters • Synapses can be excitatory (potential-increasing) or inhibitory (potential-decreasing), and have varying activation thresholds • Learning occurs as a result of the synapses’ plasticicity: They exhibit long-term changes in connection strength • There are about 1011 neurons and about 1014 synapses in the human brain

  26. Biology of a neuron

  27. Brain structure • Different areas of the brain have different functions • Some areas seem to have the same function in all humans (e.g., Broca’s region); the overall layout is generally consistent • Some areas are more plastic, and vary in their function; also, the lower-level structure and function vary greatly • We don’t know how different functions are “assigned” or acquired • Partly the result of the physical layout / connection to inputs (sensors) and outputs (effectors) • Partly the result of experience (learning) • We really don’t understand how this neural structure leads to what we perceive as “consciousness” or “thought” • Our neural networks are not nearly as complex or intricate as the actual brain structure

  28. Comparison of computing power • Computers are way faster than neurons… • But there are a lot more neurons than we can reasonably model in modern digital computers, and they all fire in parallel • Neural networks are designed to be massively parallel • The brain is effectively a billion times faster

  29. Neural networks • Neural networks are made up of nodes or units, connected by links • Each link has an associated weight and activation level • Each node has an input function (typically summing over weighted inputs), an activation function, and an output

  30. Neural unit

  31. Linear Threshold Unit (LTU)

  32. Sigmoid Unit

  33. Neural Computation • McCollough and Pitt (1943)showed how LTU can be use to compute logical functions • AND? • OR? • NOT? • Two layers of LTUs can represent any boolean function

  34. Learning Rules • Rosenblatt (1959) suggested that if a target output value is provided for a single neuron with fixed inputs, can incrementally change weights to learn to produce these outputs using the perceptron learning rule • assumes binary valued input/outputs • assumes a single linear threshold unit

  35. Perceptron Learning rule • If the target output for unit j is tj • Equivalent to the intuitive rules: • If output is correct, don’t change the weights • If output is low (oj=0, tj=1), increment weights for all the inputs which are 1 • If output is high (oj=1, tj=0), decrement weights for all inputs which are 1 • Must also adjust threshold. Or equivalently assume there is a weight wj0 for an extra input unit that has o0=1

  36. Perceptron Learning Algorithm • Repeatedly iterate through examples adjusting weights according to the perceptron learning rule until all outputs are correct • Initialize the weights to all zero (or random) • Until outputs for all training examples are correct • for each training example e do • compute the current output oj • compare it to the target tj and update weights • each execution of outer loop is an epoch • for multiple category problems, learn a separate perceptron for each category and assign to the class whose perceptron most exceeds its threshold • Q: when will the algorithm terminate?

  37. Perceptron Video

  38. Representation Limitations of a Perceptron • Perceptrons can only represent linear threshold functions and can therefore only learn functions which linearly separate the data, I.e. the positive and negative examples are separable by a hyperplane in n-dimensional space

  39. Perceptron Learnability • Perceptron Convergence Theorem: If there are a set of weights that are consistent with the training data (I.e. the data is linearly separable), the perceptron learning algorithm will converge (Minksy & Papert, 1969) • Unfortunately, many functions (like parity) cannot be represented by LTU

  40. Layered feed-forward network Output units Hidden units Input units

  41. Backpropagation Algorithm

  42. “Executing” neural networks • Input units are set by some exterior function (think of these as sensors), which causes their output links to be activated at the specified level • Working forward through the network, the input function of each unit is applied to compute the input value • Usually this is just the weighted sum of the activation on the links feeding into this node • The activation function transforms this input function into a final value • Typically this is a nonlinear function, often a sigmoid function corresponding to the “threshold” of that node

  43. Neural Nets for Face Recognition 90% accurate learning head pose, and recognizing 1-of-20 faces

  44. Summary: Statistical Learning Methods • Statistical Inference • use likehood of data and prob of hypothesis to predict value for next instance • Bayesian • MAP • ML • Naïve Bayes • Nearest Neighbor • Neural Networks

More Related