1 / 20

Statistical Inference (By Michael Jordon)

Statistical Inference (By Michael Jordon). Bayesian perspective conditional perspective—inferences should be made conditional on the current data natural in the setting of a long-term project with a domain expert

cfarley
Download Presentation

Statistical Inference (By Michael Jordon)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Inference (By Michael Jordon) • Bayesian perspective • conditional perspective—inferences should be made conditional on the current data • natural in the setting of a long-term project with a domain expert • the optimist: let’s make the best possible use of our sophisticated inferential tool • Frequentist perspective • unconditional perspective—inferential methods should give good answers in repeated use • natural in the setting of writing software that will be used by many people with many data sets • the pessimist: let’s protect ourselves against bad decisions given that our inferential procedure is inevitably based on a simplification of reality

  2. Bayes Classifier • A probabilistic framework for solving classification problems • Conditional Probability: • Bayes theorem:

  3. Example of Bayes Theorem • Given: • A doctor knows that C causes A 50% of the time • Prior probability of any patient having C is 1/50,000 • Prior probability of any patient having A is 1/20 • If a patient has A, what’s the probability he/she has C?

  4. Bayesian Classifiers • Consider each attribute and class label as random variables • Given a record with attributes (A1, A2,…,An) • Goal is to predict class C • Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An ) • Can we estimate P(C| A1, A2,…,An ) directly from data?

  5. Bayesian Classifiers • Approach: • compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem • Choose value of C that maximizes P(C | A1, A2, …, An) • Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C) • How to estimate P(A1, A2, …, An | C )?

  6. Naïve Bayes Classifier • Assume independence among attributes Ai when class is given: • P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj) • Can estimate P(Ai| Cj) for all Ai and Cj. • New point is classified to Cj if P(Cj)  P(Ai| Cj) is maximal.

  7. Training dataset Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair)

  8. Naïve Bayesian Classifier: Example • Compute P(X|Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 , income =medium, student=yes, credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer=“no”) = 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”) = 0.028 P(X|buys_computer=“no”) * P(buys_computer=“no”) = 0.007 X belongs to class “buys_computer=yes”

  9. Naïve Bayes Classifier • If one of the conditional probability is zero, then the entire expression becomes zero. • Probability estimation: c: number of classes p: prior probability m: parameter (equivalent sample size)

  10. Naïve Bayes (Summary) • Robust to isolated noise points because such points are averaged out. • Handle missing values by ignoring the instance during probability estimate calculations • Robust to irrelevant attributes. If Ai is irrelevant, then P(Ai | Y) becomes almost uniformly distributed. • Independence assumption may not hold for some attributes • Use other techniques such as Bayesian Belief Networks (BBN)

  11. Instance-Based Classifiers • Store the training records • Use training records to predict the class label of unseen cases

  12. Compute Distance Test Record Training Records Choose k of the “nearest” records Nearest Neighbor Classifiers • Basic idea: • If it walks like a duck, quacks like a duck, then it’s probably a duck

  13. Nearest-Neighbor Classifiers • Requires three things • The set of stored records • Distance Metric to compute distance between records • The value of k, the number of nearest neighbors to retrieve • To classify an unknown record: • Compute distance to other training records • Identify k nearest neighbors • Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote)

  14. Nearest Neighbor Classification • Compute distance between two points: • Euclidean distance • Determine the class from nearest neighbor list • take the majority vote of class labels among the k-nearest neighbors • Weigh the vote according to distance • weight factor, w = 1/d2

  15. Definition of Nearest Neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x

  16. Nearest Neighbor Classification… • Choosing the value of k: • If k is too small, sensitive to noise points • If k is too large, neighborhood may include points from other classes

  17. K-nearest neighbors

  18. Distance functions • Dsum(A,B) = Dgender(A,B) + Dage(A,B) + Dsalary(A,B) • Dnorm(A,B) = Dsum(A,B)/max(Dsum) • Deuclid(A,B) = sqrt(Dgender(A,B)2 + Dage(A,B)2 + Dsalary(A,B)2

  19. Remarks on Lazy vs. Eager Learning • Instance-based learning: lazy evaluation • Decision-tree and Bayesian classification: eager evaluation • Key differences • Lazy method may consider query instance xq when deciding how to generalize beyond the training data D • Eager method cannot since they have already chosen global approximation when seeing the query • Efficiency: Lazy - less time training but more time predicting • Accuracy • Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function • Eager: must commit to a single hypothesis that covers the entire instance space

More Related