Statistical Learning Methods

Statistical Learning Methods Russell and Norvig: Chapter 20 (20.1,20.2,20.4,20.5) CMSC 421 – Fall 2006

Statistical Approaches • Statistical Learning (20.1) • Naïve Bayes (20.2) • Instance-based Learning (20.4) • Neural Networks (20.5)

Statistical Learning (20.1)

Example: Candy Bags • Candy comes in two flavors: cherry () and lime () • Candy is wrapped, can’t tell which flavor until opened • There are 5 kinds of bags of candy: • H1= all cherry • H2= 75% cherry, 25% lime • H3= 50% cherry, 50% lime • H4= 25% cherry, 75% lime • H5= 100% lime • Given a new bag of candy, predict H • Observations: D1, D2 ,D3, …

Bayesian Learning • Calculate the probability of each hypothesis, given the data, and make prediction weighted by this probability (i.e. use all the hypothesis, not just the single best) • Now, if we want to predict some unknown quantity X

Bayesian Learning cont. • Calculating P(h|d) • Assume the observations are i.i.d.—independent and identically distributed likelihood prior

Example: • Hypothesis Prior over h1, …, h5 is {0.1,0.2,0.4,0.2,0.1} • Data: • Q1: After seeing d1, what is P(hi|d1)? • Q2: After seeing d1, what is P(d2= |d1)?

Making Statistical Inferences • Bayesian – • predictions made using all hypothesis, weighted by their probabilities • MAP – maximum a posteriori • uses the single most probable hypothesis to make prediction • often much easier than Bayesian; as we get more and more data, closer to Bayesian optimal • ML – maximum likelihood • assume uniform prior over H • when

Naïve Bayes (20.2)

Naïve Bayes • aka Idiot Bayes • particularly simple BN • makes overly strong independence assumptions • but works surprisingly well in practice…

we need: Bayesian Diagnosis • suppose we want to make a diagnosis D and there are n possible mutually exclusive diagnosis d1, …, dn • suppose there are m boolean symptoms, E1, …, Em how do we make diagnosis?

Naïve Bayes Assumption • Assume each piece of evidence (symptom) is independent give the diagnosis • then what is the structure of the corresponding BN?

Naïve Bayes Example • possible diagnosis: Allergy, Cold and OK • possible symptoms: Sneeze, Cough and Fever my symptoms are: sneeze & cough, what isthe diagnosis?

Learning the Probabilities • aka parameter estimation • we need • P(di) – prior • P(ek|di) – conditional probability • use training data to estimate

Maximum Likelihood Estimate (MLE) • use frequencies in training set to estimate: where nx is shorthand for the counts of events in training set

Example: what is: P(Allergy)? P(Sneeze| Allergy)? P(Cough| Allergy)?

Laplace Estimate (smoothing) • use smoothing to eliminate zeros: many other smoothing schemes… where n is number of possible values for d and e is assumed to have 2 possible values

Comments • Generally works well despite blanket assumption of independence • Experiments show competitive with decision trees on some well known test sets (UCI) • handles noisy data

Learning more complex Bayesian networks • Two subproblems: • learning structure: combinatorial search over space of networks • learning parameters values: easy if all of the variables are observed in the training set; harder if there are ‘hidden variables’

Instance-based Learning

Instance/Memory-based Learning • Non-parameteric • hypothesis complexity grows with the data • Memory-based learning • Construct hypotheses directly from the training data itself

k=1 k=6 x Nearest Neighbor Methods • To classify a new input vector x, examine the k-closest training data points to x and assign the object to the most frequently occurring class

Issues • Distance measure • Most common: euclidean • Better distance measures: normalize each variable by standard deviation • For discrete data, can use hamming distance • Choosing k • Increasing k reduces variance, increases bias • For high-dimensional space, problem that the nearest neighbor may not be very close at all! • Memory-based technique. Must make a pass through the data for each classification. This can be prohibitive for large data sets. • Indexing the data can help; for example KD trees

Neural Networks (20.5)

Neural function • Brain function (thought) occurs as the result of the firing of neurons • Neurons connect to each other through synapses, which propagate action potential (electrical impulses) by releasing neurotransmitters • Synapses can be excitatory (potential-increasing) or inhibitory (potential-decreasing), and have varying activation thresholds • Learning occurs as a result of the synapses’ plasticicity: They exhibit long-term changes in connection strength • There are about 1011 neurons and about 1014 synapses in the human brain

Biology of a neuron

Brain structure • Different areas of the brain have different functions • Some areas seem to have the same function in all humans (e.g., Broca’s region); the overall layout is generally consistent • Some areas are more plastic, and vary in their function; also, the lower-level structure and function vary greatly • We don’t know how different functions are “assigned” or acquired • Partly the result of the physical layout / connection to inputs (sensors) and outputs (effectors) • Partly the result of experience (learning) • We really don’t understand how this neural structure leads to what we perceive as “consciousness” or “thought” • Our neural networks are not nearly as complex or intricate as the actual brain structure

Comparison of computing power • Computers are way faster than neurons… • But there are a lot more neurons than we can reasonably model in modern digital computers, and they all fire in parallel • Neural networks are designed to be massively parallel • The brain is effectively a billion times faster

Neural networks • Neural networks are made up of nodes or units, connected by links • Each link has an associated weight and activation level • Each node has an input function (typically summing over weighted inputs), an activation function, and an output

Neural unit

Linear Threshold Unit (LTU)

Sigmoid Unit

Neural Computation • McCollough and Pitt (1943)showed how LTU can be use to compute logical functions • AND? • OR? • NOT? • Two layers of LTUs can represent any boolean function

Learning Rules • Rosenblatt (1959) suggested that if a target output value is provided for a single neuron with fixed inputs, can incrementally change weights to learn to produce these outputs using the perceptron learning rule • assumes binary valued input/outputs • assumes a single linear threshold unit

Perceptron Learning rule • If the target output for unit j is tj • Equivalent to the intuitive rules: • If output is correct, don’t change the weights • If output is low (oj=0, tj=1), increment weights for all the inputs which are 1 • If output is high (oj=1, tj=0), decrement weights for all inputs which are 1 • Must also adjust threshold. Or equivalently assume there is a weight wj0 for an extra input unit that has o0=1

Perceptron Learning Algorithm • Repeatedly iterate through examples adjusting weights according to the perceptron learning rule until all outputs are correct • Initialize the weights to all zero (or random) • Until outputs for all training examples are correct • for each training example e do • compute the current output oj • compare it to the target tj and update weights • each execution of outer loop is an epoch • for multiple category problems, learn a separate perceptron for each category and assign to the class whose perceptron most exceeds its threshold • Q: when will the algorithm terminate?

Perceptron Video

Representation Limitations of a Perceptron • Perceptrons can only represent linear threshold functions and can therefore only learn functions which linearly separate the data, I.e. the positive and negative examples are separable by a hyperplane in n-dimensional space

Perceptron Learnability • Perceptron Convergence Theorem: If there are a set of weights that are consistent with the training data (I.e. the data is linearly separable), the perceptron learning algorithm will converge (Minksy & Papert, 1969) • Unfortunately, many functions (like parity) cannot be represented by LTU

Layered feed-forward network Output units Hidden units Input units

Backpropagation Algorithm

“Executing” neural networks • Input units are set by some exterior function (think of these as sensors), which causes their output links to be activated at the specified level • Working forward through the network, the input function of each unit is applied to compute the input value • Usually this is just the weighted sum of the activation on the links feeding into this node • The activation function transforms this input function into a final value • Typically this is a nonlinear function, often a sigmoid function corresponding to the “threshold” of that node

Neural Nets for Face Recognition 90% accurate learning head pose, and recognizing 1-of-20 faces

Summary: Statistical Learning Methods • Statistical Inference • use likehood of data and prob of hypothesis to predict value for next instance • Bayesian • MAP • ML • Naïve Bayes • Nearest Neighbor • Neural Networks

Statistical Learning Methods