630 likes | 909 Views
Bayesian Learning. Pt 2. 6.7- 6.12 Machine Learning Promethea Pythaitha. Bayes Optimal Classifier. Gibbs Algorithm. Naïve Bayes Classifier. Bayesian Belief networks. EM algorithm. Bayesian Optimal classifier. So far we have asked: Which is the most likely-to be correct hypothesis:
E N D
Bayesian Learning. Pt 2. 6.7- 6.12 Machine LearningPromethea Pythaitha.
Bayes Optimal Classifier. • Gibbs Algorithm. • Naïve Bayes Classifier. • Bayesian Belief networks. • EM algorithm.
Bayesian Optimal classifier. • So far we have asked: Which is the most likely-to be correct hypothesis: • Which h is the M.A.P hypothesis. • Recall MAP = Maximum a-posteriori hypothesis. • Which hypothesis has the highest likelyhood of corectness given the sample data we have seen.
What we usually want is the classification of a specific instance x not in the training data D. • One way: Find hMAP and return it’s prediction for x. • Or decide what is the most probable classification for x.
Boolean classification: • Have hypotheses h1 h6 with posterior probabilities: • And h1 classifies x as - , the rest as +. • Then net support for – is 25%, and for + is 75%. • The “Bayes Optimal Classification” is + even though the hMAP says - .
Bayes Optimal Classifier. • Classifies an instance by taking the average of all the hypotheses predictions weighted by the credibility of the hypothesis. • Eqn 6.18.
Any system that classifies instances using this system is a “Bayes Optimal Classifier” • No other classification method, given the same hypothesis space and prior knowledge can outperform this method – on average!!! • A particularly interesting result is that the predictions made by a BOC can correspond to hypotheses that are not even in it’s hypothesis space. • This helps deal with the limitations imposed by overly restricted hypothesis spaces.
Best performance!!! ----- At what cost? • Bayes Optimal Classification is the best – on average – but it can be very computationally costly. • First it has to learn all the posterior probabilities for the hypotheses in H. • Then it has to poll the hypotheses to find what each one predicts for x’s classification. • Then it has to compute this big weighted sum (eqn 6.18) • But remember, hypothesis spaces get very large…. • Recall the reason why we used a specific and general boundarys in the CELA algorithm.
Gibbs Algorithm. • One way to avoid some of the cmputation is: • 1: Select h from H based on the posterior-probability distribution. • So more “credible” hypotheses are selected with higher probability than others. • 2: Return h(x). • This saves the time of computing the results hi(x) for all hi in H, and doing the big sum…. But it is less optimal.
How well can it do? • If we compute expected misclassification error of the Gibbs algorithm over target concepts drawn at random based on the a-priori probability distribution assumed by the learner, • Then this error will be at most twice that for the B.O.C. • ** On Average.
Naïve Bayes Classifier. [NBC] • Highly practical Bayesian learner. • Can, under right conditions, rank as well as Neural-nets or Decision-trees. • Applies to any learning task where each instance x is described by a conjunction of attributes (or attribute-value pairs) and where target function f(x) can take any value in finite set V. • We are given training data D, and asked to classify a new instance x = <a1, a2, …, an>
Bayesian approach: • Classify a new instance by assigning the most probable target value: vMAP, given the attribute values of the instance. • Eqn 6.19.
To get the classification we need to find the vj that maximizes P(a1, a2, …, an| vj )*P(vj ) • Second term: EASY. • It’s simply # instances with classification vj over total # instances. • First term: Hard!!! • Need a HUGE set of training data. • Suppose we have 10 attributes (with 2 possibilities each) and 15 classifications. 15,360 possibilities, • And we have say 200 instances with known classifications. • Cannot get a reliable estimate!!!
The Naïve assumption. • One way to ‘fix’ the problem is to assume the attributes are conditionally independent. • Assume P(a1, a2, …, an| vj) = Πi P(ai| vj) • Then the Naïve Bayes Classifier uses this for the prediction: • Eqn 6.20.
Naïve Bayes Algorithm. • 1: Learn the P(ai| vj) and P(ai| vj) for all a’s and v’s. (based on training data) • In our example this is 10(2)*15 = 300. • Sample size of 200 is plenty! • ** This set of numbers is the learned hypothesis. • 2: Use this hypothesis to find vNB. • IF our “naïve assumption”: Conditional independence is true then vNB = vMAP
Bayesian Learning vs. Other Machine Learning methods. • In Bayesian Learning, there is not explicit search through a hypothesis space. • Does not produce an inference rule (D-tree) or a weight vector (NN). • Instead it forms the hypothesis by observing frequencies of various data combinations.
Example. • There are four possible end-states of stellar evolution: • 1: White-Dwarf. • 2: Neutron Star. • 3: Black-hole. • 4: Brown dwarf.
White Dwarf. • About the size of the Earth, the mass of our Sun. [Up to 1.44 times solar mass] • The little white dot in the center of each is the White –Dwarf.
Neutron Stars. • About the size of the Gallatin Valley. • About twice the mass of our Sun (up to 2.9 Solar Masses) • Don’t go too close! They have huge enough gravitational and Electromagnetic fields, stretch you into spaghetti, rip out every metal atom in your body, and finally spread you across the whole surface!! • Form in Type II supernovae. The Neutron star is a tiny speck at the center of that cloud.
Black-Holes. (3 to 50 Solar masses) • The ultimate cosmic sink-hole, even devours light!! • Time-dilation, etc. come into effect near the event-horizon.
Brown-Dwarfs. • Stars that never got hot enough to start fusion (<.1 Solar masses)
Classification: • Because it is hard to get data and impossible to observe these from close up, we need an accurate way of identifying these remnants. • Two ways: • 1: Computer model of Stellar structure • Create a program that models a star, and has to estimate the equations-of-state governing the more bizarre remnants (such as Neutron Stars) as they involve super-nuclear densities and are not well predicted by Quantum mechanics. • Learning algorithms (such as NN’s) are sometimes used to tune the model based on known stellar remnants. • 2: group an unclassified remnant with others having similar attributes.
The latter is more like a Bayesian Classifier. • Define (for masses of progenitor stars) • .1 to 10 Solar Masses = Average. • 10 to 40 Solar Masses = Giant • 40 to 150 Solar Masses = Supergiant • 0 to .1 Solar Masses = Tiny. • Define (for masses of remnants) • < 1.44 Solar masses = Small • 1.44 to 2.9 Solar masses = Medium • > 2.9 Solar masses = Large. • Define classifications: • WD = White Dwarf. • NS = Neutron Star. • BH = Black hole • BD = Brown Dwarf.
If we find a new Stellar remnant with attributes <Average, Medium> we could certainly put it’s mass into a stellar model that has been fine-tuned by our Neural Net, or, we could simply use a Bayesian Classification: • Either would give the same result: • Comparing with data we have, and matching attributes, this has to be a Neutron star. • Similarly we can predict <Tiny, Small> Brown-Dwarf. • <Supergiant, large> Black Hole.
Quantitative example. • See table 3.2 pg59. • Possible target values = {no, yes} • P(no) = 9/14 = .64 • P(yes) = 1-P(no) = .36 • Want to know: PlayTennis? If <sunny, cool, high, strong> • Need P(sunny|no), P(sunny|yes), etc… • P(sunny|no) = #sunny’s in no category / #no’s = 3/5. • P(sunny|yes) = 2/9. etc… • NB classification: NO. • Support for no = P(no)*P(sunny|no)*P(cool|no)*P(high|no)*P(strong|no) =.0206. • Support for yes = …= .0053.
Estimating probabilities. • Usually • P(event) = (# times event occurred)/(total # trials) • Fair coin: 50/50 heads/tails. • So out of two tosses, we expect 1 head, 1 tail. • Don’t bet your life on it!!! • What about 10 tosses? • P(all tails) = ½^10 = 1/1024. • More likely than winning the lottery, which DOES happen!
Small sample bias. • Using the simple ratio induces a bias: • ex: P(heads) = 0/10 = 0. • NOT TRUE!! • Sample not representative of population. • And will dominate NB classifier. • Multiplication by 0.
M-estimate. • P(event e) = [#times e occurred + m*p]/[# trials +m] • m = “equivalent sample size.” • p = prior estimate of P(e). • Essentially assuming m virtual trials following the predicted distribution (usually uniform.) – in addition to the real ones. • Reduces the small-sample bias.
Text Classification using Naïve Bayes. • Used a Naïve Bayes approach to decide “like” or “dislike” based on words in the text. • Simplifying assumptions: • 1: Position of word did not matter. • 2: Most common 100 words were removed from consideration. • Overall performance = 89% accuracy • Versus 5% for random guessing.
Bayesian Belief network. • The assumption of Conditional independence may not be true!! • EX: • v = lives in community “k” • Suppose we know that a survey has been done indicating 90% of the people there are young-earth creationists. • h1 = Is a Young-earth Creationist. • h2 = Discredits Darwinian Evolution. • h3 = Likes Carrots. • Clearly (h1|v) and (h2|v) are not independent, but (h3|v) is unaffected.
Reality. • In any set of attributes, some will be conditionally independent. • And some will not.
Bayesian Belief Networks. • Allow conditional independence rules to be stated for certain subsets of the attributes. • Best of both worlds: • More realistic than assuming all attributes are conditionally independent. • Computationally cheaper than if we ignore the possibility of independent attributes.
Formally a Bayesian belief network describes a probability distribution over a set of variables. • If we have variables Y1, …, Yn • Yk has domain Vk • then the Bayesian belief network is a probability density distribution over V1xV2x….Vn.
Representation. • Each variable in the instance is a node in the BBN. • Every node has: • 1: Network arcs assert a variable is conditionally independent of it’s non-children, given it’s parents. • 2: Conditional probability tables define the distribution of a variable given those of it’s parents.
Strongly reminiscent of a Neural Net structure. Here we have conditional probabilities instead of weights. • See fig 6.3. pg 186.
Storm affects probability of someone lighting a campfire – not the other way around. • BBN’s allow us to state causality rules!! • Once we have learned the BBN, we can calculate the probability distribution of any attribute. • pg 186.
Learning the Network: • If structure is known and all attributes are visible, then learn probabilities like in NB classifier. • If structure is known, but not all attributes are visible, have hidden values. • Train using NN-type Gradient ascent • Or EM algorithm. • If structure is unknown… various methods.
Gradient ascent training. • Maximize P(D|h) by going in direction of steepest ascent: Gradient(ln P(D|h)) • Define ‘weight’ wijk as the conditional probability that Yi takes value yij with parents in the configuration uik. • General form: • Eqn 6.25.
Weight update: • Since for given i and k, the sum of wijk’s must be 1, and now can exceed 1, we update all wijk’s and then normalize. • wijk wijk / (Σj wijk)
Works very well in practice, though it can get stuck on local optima, much like Gradient descent for NN’s!!
EM algorithm. • Alternate way of learning hidden variables given training data. • In general, use the mean value for an attribute when it’s value is not known. • “The EM algorithm can be used even for variable whose value is never even directly observed, provided the general form of the probability distribution governing those variables is known.”
Learning the means of several unknown variables.(Normal distribution) • Guess h = <μ1,μ2> • Must know variance. • Estimate probability of data pt. I coming from each distribution assuming h is correct. • Update h.
What complicates the situation is that we have more than one unknown variable. • The general idea is: • Pick randomly your estimate of the means. • Assuming they are correct, and using the standard deviation (which must be known and equal for all variables) figure out which data points are most likely to have come from which distributions. They will have the largest effect on the revision of that mean.
Then redefine each approximate mean as a weighted sample mean, where all data points are considered, but their effect on the kth distribution mean is weighted by how likely they were to come from it. • Loop till we get convergence to a set of means.
In general, we use a quality test to measure the fit of the sample data with the learned distribution means. • For normally distributed variables, we could use a hypothesis test: • How certain can we be that true-mean = estimated value, given the sample data. • If the quality function has only one maximum, this method will find it. • Otherwise, it can get stuck on a local maximum, but in practice works quite well.
Example: Estimating means of 2 Gaussians. • Suppose we have two unknown variables, and we want their means μ1 and μ2. (true–mean 1, true-mean 2) • All we have is a few sample data points from these distributions {blue and green dots} We cannot see the actual distributions. • We know the standard deviation before hand.
Review of the normal distribution: • The inflection point is at one standard deviation: σ from the mean. • Centered at the mean, the probability of getting a point at most 1 σ away is 68% • This means the probability of drawing a point to the right of μ+1*σ is at most .5(1-.68) = 16% • At most 2 σ from the mean is 95% • This means the probability of drawing a point to the right of μ+2*σ is at most .5(1-.95) = 2.5% • At most 3 σ from the mean is 99.7% • This means the probability of drawing a point to the right of μ+3*σ is at most .5(1-.997) = 0.15%
In the above, the “to the right probability” is the same for the left, but I am doing one-sided for a reason. • The important part is to note how quickly the “tails” of the normal distribution “fall off”. In other words, the probability of getting a specific data point drops drastically as we go away from the mean.