90 likes | 236 Views
Decision Theory Naïve Bayes ROC Curves. Generative vs Discriminative Methods. Logistic regression: h: x y. When we only learn a mapping xy it is called a discriminative method . Generative methods learn p(x,y) = p(x|y) p(y), i.e. for every class
E N D
Generative vs DiscriminativeMethods • Logistic regression: h: xy. • When we only learn a mapping xy it is called a discriminative method. • Generative methods learn p(x,y) = p(x|y) p(y), i.e. for every class • we learn a model over the input distribution. • Advantage: leads to regularization for small datasets (but when N is large • discriminative methods tend to work better). • We can easily combine various sources of information: say we have learned • a model for attribute I, and now receive additional information about attribute II, • then: • Disadvantage: you model more than necessary for making decisions, and • input space (x-space) can be very high dimensional. • This is called “conditional independence of x|y”. • The corresponding classifier is called “Naïve Bayes Classifier”.
Naïve Bayes: decisions • This is the “posterior distribution” and it can be used to make a decision • on what label to assign to a new data-case. • Note that to make a decision you do not need the denominator. • If we computed the posterior p(y|xI) first, we can use it as a new prior • for the new information xII (prove this as home):
Naïve Bayes: learning • What do we need to learn from data? • p(y) • p(xk|y) for all k • A very simple rule is to look at the frequencies in the data: (assuming discrete states) • p(y) = [ nr. data-cases with label y] / [ total nr. data-cases ] • p(xk=i|y) =[ nr. data-cases in state xk=i and y] / [nr. data-cases with label y ] • To regularize we imagine that each state i has a small fractional number • of data-cases to begin with (K = total nr. of classes). • p(xk=i|y) =[ c + nr. data-cases in state xk=i and y] / [Kc + nr. data-cases with label y ] • What difficulties do you expect if we do not assume conditional independence? • Does NB over-estimate or under-estimate the uncertainty of its predictions? • Practical guideline: work in log-domain:
Loss functions • What if it is much more costly to make an error on predicting y=1 vs y=0? • Example: y=1 is “patient has cancer”, y=0 means “patient is healthy. • Introduce “expected loss function”: Total probability of predicting class j while true class is k with probability p(y=k,x) Rj is the region of x-space where an example is assigned to class j.
Decision surface • How shall we choose Rj ? • Solution: mimimize E[L] over {Rj}. • Take an arbitrary point “x”. • Compute for all j and minimize over “j”. • Since we minimize for every “x” separately, the total integral is minimal • Places where the decision switches belong to the “decision surface”. • What matrix L corresponds to the decision rule on slide 2 using the posterior?
ROC Curve • Assume 2 classes and 1 attribute. • Plot class conditional densities p(xk|y) • Shift decision boundary from right to left. • As you move the loss will change, so you • want to find the point where it is minimized. • If L=[0 1; 1 0] where is L minimal? • As you shift the true true positive rate (TP) • and the false positive rate (FP) change. • By plotting the entire curve you can see • the tradeoffs. • Easily generalized to more attributes if you • can find a decision threshold to vary. y=1 y=0 x
Evaluation: ROC curves moving threshold class 1 (positives) class 0 (negatives) TP = true positive rate = # positives classified as positive divided by # positives FP = false positive rate = # negatives classified as positives divided by # negatives TN = true negative rate = # negatives classified as negatives divided by # negatives FN = false negatives = # positives classified as negative divided by # positives Identify a threshold in your classifier that you can shift. Plot ROC curve while you shift that parameter.