570 likes | 706 Views
Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1. Go over Adaboost examples. Fix to C4.5 data formatting problem?. Quiz 4. Alternative simple (but effective) discretization method (Yang & Webb, 2001).
E N D
Alternative simple (but effective) discretization method(Yang & Webb, 2001) Let n = number of training examples. For each attribute Ai , create bins. Sort values of Ai in ascending order, and put of them in each bin. Don’t need add-one smoothing of probabilities This gives good balance between discretization bias and variance.
Alternative simple (but effective) discretization method(Yang & Webb, 2001) Let n = number of training examples. For each attribute Ai , create bins. Sort values of Ai in ascending order, and put of them in each bin. Don’t need add-one smoothing of probabilities This gives good balance between discretization bias and variance. Humidity: 25, 38, 50, 80, 93, 98, 98,, 99
Alternative simple (but effective) discretization method(Yang & Webb, 2001) Let n = number of training examples. For each attribute Ai , create bins. Sort values of Ai in ascending order, and put of them in each bin. Don’t need add-one smoothing of probabilities This gives good balance between discretization bias and variance. Humidity: 25, 38, 50, 80, 93, 98, 98,, 99
Alternative simple (but effective) discretization method(Yang & Webb, 2001) Let n = number of training examples. For each attribute Ai , create bins. Sort values of Ai in ascending order, and put of them in each bin. Don’t need add-one smoothing of probabilities This gives good balance between discretization bias and variance. Humidity: 25, 38, 50, 80, 93, 98, 98,, 99
Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifer(P. Domingos and M. Pazzani) Naive Bayes classifier is called “naive” because it assumes attributes are independent of one another.
This paper asks: why does the naive (“simple”) Bayes classifier, SBC, do so well in domains with clearly dependent attributes?
Experiments • Compare five classification methods on 30 data sets from the UCI ML database. SBC = Simple Bayesian Classifier Default = “Choose class with most representatives in data” C4.5 = Quinlan’s decision tree induction system PEBLS = An instance-based learning system CN2 = A rule-induction system
For SBC, numeric values were discretized into ten equal-length intervals.
Number of domains in which SBC was more accurate versus less accurate than corresponding classifier Same as line 1, but significant at 95% confidence Average rank over all domains (1 is best in each domain)
Measuring Attribute Dependence They used a simple, pairwise mutual information measure: For attributes Am and An,dependence is defined as where AmAnis a “derived attribute”, whose values consist of the possible combinations of values of Am and An Note: If Am and An are independent, then D(Am, An | C) = 0.
Results: (1) SBC is more successful than more complex methods, even when there is substantial dependence among attributes. (2) No correlation between degree of attribute dependence and SBC’s rank. But why????
An Example • Let C = {+, −}, and attributes = {A, B, C}. • Let P(+) = P(−) = 1/2. • Suppose A and C are completely independent, and A and B are completely dependent (e.g., A = B). • Optimal classification procedure:
This leads to the followingOptimal Classifier conditions: If P(A|+) P(C|+) > P(A | −) P(C| −) then class = + = else class = − • SBC conditions If P(A|+)2 P(C|+) > P(A | −)2 P(C| −) then class = + else class = −
In the paper, the authors use Bayes Theorem to rewrite these conditions, and plot the “decision boundaries” for the optimal classifier and for the SBC. + p = P(+ |A) q= P(+ | C) Optimal SBC −
Even though A and B are completely dependent, and the SBC assumes they are completely independent, the SBC gives the optimal classification in a very large part of the problem space! But why?
Explanation: SupposeC= {+,−} are the possible classes. Letxbe a new example with attributes <a1, a2, ..., an>. What the naive Bayes classifier does is calculates two probabilities, and returns the class that has the maximum probability givenx.
The probability calculations are correct only if the independence assumption is correct. • However, the classification is correct in all cases in which the relative ranking of the two probabilities, as calculated by the SBC, is correct! • The latter covers a lot more cases than the former. • Thus, the SBC is effective in many cases in which the independence assumption does not hold.
From http:// eecs.oregonstate.edu/~tgd/talks/BV.ppt Variance
From http:// eecs.oregonstate.edu/~tgd/talks/BV.ppt Sources of Bias and Variance • Bias arises when the classifier cannot represent the true function – that is, the classifier underfits the data • Variance arises when the classifier overfits the data • There is often a tradeoff between bias and variance
From knight.cis.temple.edu/~yates/cis8538/.../intro-text-classification.ppt Bias-Variance Tradeoff As a general rule, the more biased a learning machine, the less variance it has, and the more variance it has, the less biased it is.
From: http://www.ire.pw.edu.pl/~rsulej/NetMaker/index.php?pg=e06
From knight.cis.temple.edu/~yates/cis8538/.../intro-text-classification.ppt Bias-Variance Tradeoff As a general rule, the more biased a learning machine, the less variance it has, and the more variance it has, the less biased it is. Why?
From http:// eecs.oregonstate.edu/~tgd/talks/BV.ppt SVM Bias and Variance • Bias-Variance tradeoff controlled by s • Biased classifier (linear SVM) gives better results than a classifier that can represent the true decision boundary!
From http:// eecs.oregonstate.edu/~tgd/talks/BV.ppt Effect of Boosting • In the early iterations, boosting is primary a bias-reducing method • In later iterations, it appears to be primarily a variance-reducing method
Bayesian NetworksReading: S. Wooldridge, Bayesian belief networks(linked from class website)
A patient comes into a doctor’s office with a fever and a bad cough. Hypothesis space H: h1: patient has flu h2: patient does not have flu Data D: coughing= true, fever = true,, smokes = true
Naive Bayes Cause smokes cough fever flu Effects
Full joint probability distribution smokes Sum of all boxes is 1. In principle, the full joint distribution can be used to answer any question about probabilities of these combined parameters. However, size of full joint distribution scales exponentially with number of parameters so is expensive to store and to compute with. smokes
Bayesian networks • Idea is to represent dependencies (or causal relations) for all the variables so that space and computation-time requirements are minimized. smokes cough fever flu “GraphicalModels”
cough Conditional probability tables for each node flu smoke smoke flu flu smoke cough fever fever flu
Semantics of Bayesian networks • If network is correct, can calculate full joint probability distribution from network. where parents(Xi) denotes specific values of parents of Xi.
Example • Calculate
Another (famous, though weird) Example Rain Wet grass Question: If you observe that the grass is wet, what is the probability it rained?
Sprinkler Rain Wet grass Question: If you observe that the sprinkler is on, what is the probability that the grass is wet? (Predictive inference.)
Question: If you observe that the grass is wet, what is the probability that the sprinkler is on? (Diagnostic inference.) Note that P(S) = 0.2. So, knowing that grass is wet increased probability that sprinkler is on.
Now assume the grass is wet and it rained. What is the probability that the sprinkler was on? Knowing that it rained decreases the probability that the sprinkler was on, given that the grass is wet.
Cloudy Sprinkler Rain Wet grass Question: Given that it is cloudy, what is the probability that the grass is wet?
In general... • If network is correct, can calculate full joint probability distribution from network. where parents(Xi) denotes specific values of parents of Xi. But need efficient algorithms to do this (e.g., “belief propagation”, “Markov Chain Monte Carlo”).
Complexity of Bayesian Networks For n random Boolean variables: • Full joint probability distribution: 2n entries • Bayesian network with at most k parents per node: • Each conditional probability table: at most 2kentries • Entire network: n 2k entries