200 likes | 210 Views
Statistical Inference (By Michael Jordon). Bayesian perspective conditional perspective—inferences should be made conditional on the current data natural in the setting of a long-term project with a domain expert
E N D
Statistical Inference (By Michael Jordon) • Bayesian perspective • conditional perspective—inferences should be made conditional on the current data • natural in the setting of a long-term project with a domain expert • the optimist: let’s make the best possible use of our sophisticated inferential tool • Frequentist perspective • unconditional perspective—inferential methods should give good answers in repeated use • natural in the setting of writing software that will be used by many people with many data sets • the pessimist: let’s protect ourselves against bad decisions given that our inferential procedure is inevitably based on a simplification of reality
Bayes Classifier • A probabilistic framework for solving classification problems • Conditional Probability: • Bayes theorem:
Example of Bayes Theorem • Given: • A doctor knows that C causes A 50% of the time • Prior probability of any patient having C is 1/50,000 • Prior probability of any patient having A is 1/20 • If a patient has A, what’s the probability he/she has C?
Bayesian Classifiers • Consider each attribute and class label as random variables • Given a record with attributes (A1, A2,…,An) • Goal is to predict class C • Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An ) • Can we estimate P(C| A1, A2,…,An ) directly from data?
Bayesian Classifiers • Approach: • compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem • Choose value of C that maximizes P(C | A1, A2, …, An) • Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C) • How to estimate P(A1, A2, …, An | C )?
Naïve Bayes Classifier • Assume independence among attributes Ai when class is given: • P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj) • Can estimate P(Ai| Cj) for all Ai and Cj. • New point is classified to Cj if P(Cj) P(Ai| Cj) is maximal.
Training dataset Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair)
Naïve Bayesian Classifier: Example • Compute P(X|Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 , income =medium, student=yes, credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer=“no”) = 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”) = 0.028 P(X|buys_computer=“no”) * P(buys_computer=“no”) = 0.007 X belongs to class “buys_computer=yes”
Naïve Bayes Classifier • If one of the conditional probability is zero, then the entire expression becomes zero. • Probability estimation: c: number of classes p: prior probability m: parameter (equivalent sample size)
Naïve Bayes (Summary) • Robust to isolated noise points because such points are averaged out. • Handle missing values by ignoring the instance during probability estimate calculations • Robust to irrelevant attributes. If Ai is irrelevant, then P(Ai | Y) becomes almost uniformly distributed. • Independence assumption may not hold for some attributes • Use other techniques such as Bayesian Belief Networks (BBN)
Instance-Based Classifiers • Store the training records • Use training records to predict the class label of unseen cases
Compute Distance Test Record Training Records Choose k of the “nearest” records Nearest Neighbor Classifiers • Basic idea: • If it walks like a duck, quacks like a duck, then it’s probably a duck
Nearest-Neighbor Classifiers • Requires three things • The set of stored records • Distance Metric to compute distance between records • The value of k, the number of nearest neighbors to retrieve • To classify an unknown record: • Compute distance to other training records • Identify k nearest neighbors • Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)
Nearest Neighbor Classification • Compute distance between two points: • Euclidean distance • Determine the class from nearest neighbor list • take the majority vote of class labels among the k-nearest neighbors • Weigh the vote according to distance • weight factor, w = 1/d2
Definition of Nearest Neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x
Nearest Neighbor Classification… • Choosing the value of k: • If k is too small, sensitive to noise points • If k is too large, neighborhood may include points from other classes
Distance functions • Dsum(A,B) = Dgender(A,B) + Dage(A,B) + Dsalary(A,B) • Dnorm(A,B) = Dsum(A,B)/max(Dsum) • Deuclid(A,B) = sqrt(Dgender(A,B)2 + Dage(A,B)2 + Dsalary(A,B)2
Remarks on Lazy vs. Eager Learning • Instance-based learning: lazy evaluation • Decision-tree and Bayesian classification: eager evaluation • Key differences • Lazy method may consider query instance xq when deciding how to generalize beyond the training data D • Eager method cannot since they have already chosen global approximation when seeing the query • Efficiency: Lazy - less time training but more time predicting • Accuracy • Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function • Eager: must commit to a single hypothesis that covers the entire instance space