1 / 79

Classification

Explore the principles of Bayes classification, decision theory, and pattern recognition in the context of digital image processing. Understand the importance of feature extraction, classification, and verification. Learn how to apply Occam's razor and overfitting prevention techniques in image analysis. Gain insights into discriminative vs. generative modeling for pattern classification challenges in computer vision.

jcurt
Download Presentation

Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification ECE 847:Digital Image Processing Stan Birchfield Clemson University

  2. Classification problems • Detection – Search set, find all instances of class • Recognition – Given instance, label its identity • Verification – Given instance and hypothesized identity, verify whether correct • Tracking – Like detection, but local search and fixed identity

  3. Classification issues • Feature extraction – needed for practical reasons; distinction is somewhat arbitrary: • Perfect feature extraction  classification is trivial • Perfect classifier  no need for feature extraction • occlusion (missing features) • mereology – study of part/whole relationshipsPOLOPONY, BEATS (not BE EATS) • segmentation – how can we classify before segmenting? how can we segment before classifying? • context • computational complexity: 20x20 binary input is 10120 patterns!

  4. Mereology example What does this say?

  5. Decision theory • Decision theory – goal is to make a decision (i.e., set a decision boundary) so as to minimize cost • Pattern classification is perhaps most important subfield of decision theory • Supervised learning: features, data sets, algorithm o o o x o o o x o x x x o decision boundary x x x

  6. Overfitting Could separate perfectly using nearest neighbors But poor generalization (overfitting) – will not work well on new data o o o x o o o x o x x x o decision boundary x x x Occam’s razor – The simplest explanation is the best(Philosophical principle based upon the orderliness of the creation)

  7. Bayes decision theory Problem: Given a feature x, determine the most likely class: w1 or w2 1 class-conditional pdfs 0 Easy to measure with enough examples http://www.cc.gatech.edu/~rehg/Classes/Computer_Vision_4495_7495.htm

  8. Bayes’ rule likelihood (class-conditional pdf) prior evidence(normalization factor) posterior 1 0 1 0 http://www.cc.gatech.edu/~rehg/Classes/Computer_Vision_4495_7495.htm

  9. P(w1|x) P(w1|x)+P(w2|x)=1 ! What is this P(w1|x) ? • Probability of class 1 given data x 1.0 P(w2|x) ? 0.0 x Note: Area under each curve is not 1 http://www.cc.gatech.edu/~rehg/Classes/Computer_Vision_4495_7495.htm

  10. Bayes Classifier • Classifier: Select • Decision boundaries occur where 1.0 P(1|x) P(2|x) 0.0 selectw2 selectw1 selectw2 http://www.cc.gatech.edu/~rehg/Classes/Computer_Vision_4495_7495.htm

  11. Bayes Risk The total risk is the expected loss when using the classifier: where (We’re assuming loss is constant here) 1.0 P(1|x) P(2|x) 0.0 The shaded area is called the Bayes risk http://www.cc.gatech.edu/~rehg/Classes/Computer_Vision_4495_7495.htm

  12. Discriminative vs. Generative Finding a decision boundary is not the same as modeling a conditional density. Note: Bug in Forsyth-Ponce book: P(1|x)+P(2|x) != 1 http://www.cc.gatech.edu/~rehg/Classes/Computer_Vision_4495_7495.htm

  13. Bayes’ rule W A B A∩B P(A|B) = P(A∩B) / P(B) P(B|A) + P(A∩B) / P(A) P(A|B) P(B) = P(B|A) P(A) P(A|B) = P(B|A) P(A) / P(B)

  14. Bayes’ rule applied to classification W w1 w2 Each instance must be in one of the categories: Si P(x єwi) = 1 w3 x (an instance)

  15. Bayes’ rule applied to classification W w1 w2 Now suppose we take measurement, and Z is the set of all instances that produce this measurement. Which is the most likely category? Z w3 P(x єwi | x є Z) = P(x є Z | x єwi) P(x єwi) / P(x є Z) from Bayes’ rule: P(wi|z) = P(z|wi) P(wi) / P(z) simpler notation:

  16. Bayes’ rule applied to classification maximum a posteriori (MAP) estimation: Choose arg maxi P(wi|z) = arg maxi P(z|wi) P(wi) / P(z) = arg maxi P(z|wi) P(wi) same for all i if uniform prior, then maximum likelihood (ML) estimation: Choose arg maxi P(z|wi) because P(wi) same for all i

  17. Bayes’ rule applied to two-category classification ? P(w1|z) > P(w2|z) ? P(z|wi) P(wi) / P(z) > P(z|wi) P(wi) / P(z) ? P(z|wi) P(wi) > P(z|wi) P(wi) Choose w1 if P(z|w1) P(w1) > P(z|w2) P(w2) w2 otherwise MAP

  18. Overview w1 w2 switch w2 w1 classifier measurement z INFERENCE WORLD

  19. Estimating density histogram

  20. Estimating density Gaussian Parzen window

  21. Estimating density histogram with partial voting

  22. Class-conditional density Posterior probabilities from Duda, Hart, Stork, Pattern Classification, 2nd ed., 2000

  23. Multiple categories from Duda, Hart, Stork, Pattern Classification, 2nd ed., 2000

  24. Example (assuming equal prior probabilities) Set g1(x)=-g2(x) from Duda, Hart, Stork, Pattern Classification, 2nd ed., 2000

  25. Overfitting from Duda, Hart, Stork, Pattern Classification, 2nd ed., 2000

  26. Parzen windows window width number of samples convergence regardless of window width from Duda, Hart, Stork, Pattern Classification, 2nd ed., 2000

  27. Parzen windows window width number of samples convergence regardless of window width from Duda, Hart, Stork, Pattern Classification, 2nd ed., 2000

  28. Histograms • One way to compute class-conditional pdfs is to collect a bunch of examples and store a histogram • Then normalize http://www.cc.gatech.edu/~rehg/Classes/Computer_Vision_4495_7495.htm

  29. Application: Skin Histograms • Skin has a very small range of (intensity independent) colours, and little texture • Compute colour measure, check if colour is in this range, check if there is little texture (median filter) • See this as a classifier - we can set up the tests by hand, or learn them. • get class conditional densities (histograms), priors from data (counting) • Classifier is http://www.cc.gatech.edu/~rehg/Classes/Computer_Vision_4495_7495.htm

  30. Finding skin color 3D histogram in RGB space M. J. Jones and J. M. Rehg, Statistical Color Models with Application to Skin Detection, Int. J. of Computer Vision, 46(1):81-96, Jan 2002.

  31. Histogram skin non-skin

  32. Results Note: We have assumed that all pixels are independent!Context is ignored http://www.cc.gatech.edu/~rehg/Classes/Computer_Vision_4495_7495.htm

  33. sensitivity= true positive rate = hit rate = recallTPR = TP / (TP+FN) false negative rate FNR = FN / (TP+FN) false positive rate = false alarm rate= falloutFPR = FP / (FP+TN) specificitySPC = TN / (FP+TN) Confusion matrix Predicted true positive = hit false negative= miss = false dismissal = Type II error false positive = false alarm = false detection = Type I error Actual TPR + FNR = 1 FPR + SPC = 1

  34. Precision and recall precision = TP / (TP + FP) recall = TP / (TP+FN) = sensitivity = TPR F-measure combines precision and recall: since precision = positive predictive value

  35. Receiver operating characteristic (ROC) curve TPR equal error rate(EER) = 88% FPR confusion matrix for image classifier:

  36. Precision-recall graph http://nlp.stanford.edu/IR-book/roc.html

  37. Cross-validation • How well will results generalize to different dataset? • Partition sample set into • training set • test set • Repeat procedure for multiple partitions, average result

  38. Naïve Bayes • Quantize image patches, then compute a histogram of patch types within a face • But histograms suffer from the curse of dimensionality • Histogram in N dimensions is intractable with N>5 • To solve this, assume independence among the pixels • Features are the patch typesP(image|face) = P(label 1 at (x1,y1)|face)...P(label k at (xk,yk)|face)

  39. Histograms applied to faces and cars H. Schneiderman, T. Kanade. "A Statistical Method for 3D Object Detection Applied to Faces and Cars". IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2000)

  40. Density estimation binomial distributions peak at k/n=P=0.7 for large n

  41. Use k/n to estimate mean If we assume p(x) continuous, and R is small so that p does not vary much inside it, then putting these together: • Finite number of samples means we cannot shrink V to zero • 2 solutions: • specify volume Vn as function of n, e.g., then show Vn converges to VThis is Parzen window • specify kn as function of n, e.g.,This is kNN

  42. Alternative: Kernel density estimation (Parzen windows) K/N is fraction of samples that fall into volume V http://www.cc.gatech.edu/~rehg/Classes/Computer_Vision_4495_7495.htm

  43. Parzen windows • Non-parametric technique • Center kernel at each data point, sum results (and normalize) to get pdf http://www.cc.gatech.edu/~rehg/Classes/Computer_Vision_4495_7495.htm

  44. Parzen windows http://www.cc.gatech.edu/~rehg/Classes/Computer_Vision_4495_7495.htm

  45. Gaussian Parzen Windows http://www.cc.gatech.edu/~rehg/Classes/Computer_Vision_4495_7495.htm

  46. Parzen Window Density Estimation http://www.cc.gatech.edu/~rehg/Classes/Computer_Vision_4495_7495.htm

  47. Histograms non-parametric smoothing parameter = # of bins discard data afterwards discontinuous boundaries arbitrary d dimensions  Md bins (curse of dimensionality) Parzen windows non-parametric smoothing parameter = size of kernel need data always discontinuous (box) or continuous (Gaussian) boundaries data driven (box) or no boundaries (Gaussian) dimensionality not as much of a curse Comparison

  48. Another alternative: Locally Weighted Averaging (LWA) • Keep instance database • At each query point, form locally weighted average • Equivalent to Parzen windows • Memory based, lazy learning, applicable to any kernel, can be slow f(i) = 1 for positive examples, 0 for negative examples http://www.cc.gatech.edu/~rehg/Classes/Computer_Vision_4495_7495.htm

  49. LWA Classifier, Circular Kernel Data, 2 classes All Data LWA Posterior Kernel Weights http://www.cc.gatech.edu/~rehg/Classes/Computer_Vision_4495_7495.htm

  50. K-Nearest Neighbors Classification = majority vote of K nearest neighbors http://www.cc.gatech.edu/~rehg/Classes/Computer_Vision_4495_7495.htm

More Related