1 / 60

ICPR 2006 Tin Kam Ho Bell Laboratories Lucent Technologies

Principles of Stochastic Discrimination and Ensemble Learning. ICPR 2006 Tin Kam Ho Bell Laboratories Lucent Technologies. Supervised Classification Discrimination, Anomaly Detection. Training Data in a Feature Space Given to learn the class boundary.

kura
Download Presentation

ICPR 2006 Tin Kam Ho Bell Laboratories Lucent Technologies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Principles of Stochastic Discrimination and Ensemble Learning ICPR 2006 Tin Kam Ho Bell Laboratories Lucent Technologies

  2. Supervised Classification Discrimination, Anomaly Detection

  3. Training Data in a Feature SpaceGiven to learn the class boundary

  4. Modeling the Classes Stochastic Discrimination

  5. Stochastic Discrimination • Make random guesses of the class models • Use the training set to evaluate them • Select and combine them to build a classifier … “decision fusion” But, does it work? Will any set of random guesses do it? Under what conditions does it work?

  6. History • Mathematical theory [Kleinberg 1990 AMAI, 1996 AoS, 2000 MCS] • Development of theory [Berlind 1994 Thesis, Chen 1997 Thesis] • Algorithms, experimentation, variants: [Kleinberg, Ho, Berlind, Bowen, Chen, Favata, Shekhawat, 1993 -] • Outlines of algorithm [Kleinberg 2000 PAMI] • Papers available athttp://kappa.math.buffalo.edu/sd

  7. Part I. The SD Principles.

  8. Key Concepts and Tools in SD • Set-theoretic abstraction • Symmetry of probabilities in model space and feature space • Enrichment / Uniformity / Projectability • Convergence of discriminant by the law of large numbers

  9. Set-Theoretic Abstraction • Study classifiers by their decision regions • Ignore all algorithmic details • Two classifiers are equivalent if their decision regions are the same

  10. The Combinatorics:set covering

  11. Covering5 points{A,B,C,D,E} using size 3 models{m1,…}

  12. Promoting uniformity of coverage

  13. Promoting uniformity of coverage

  14. A Uniform Cover of5 points{A,B,C,D,E} using 10 models{m1,…,m10}

  15. Uniformity Implies Symmetry:The Counting Argument Count the number of pairs (q,m) such that “model m covers point q”, call this number N If each point is covered by the same number  of models (the collection is a uniform cover), N = 5 points in space x  covering models for each point = 3 points in each model x  models in the collection 5 = 3  =>  /  = 3 / 5 ( = 6,  = 10)

  16. Example • Given a feature space F containing a set A with 10 points: • Consider all subsets m of F that cover exactly 5 points of A, e.g., m = {q1, q2, q6, q8, q9} • Each model m has captured 5/10 = 0.5 of A ProbF (q m| q  A) = 0.5 • Call this set of models M 0.5, A q0 q1 q2 q3 q4 q5 q6 q7 q8 q9

  17. Some Members of M 0.5, A

  18. q0 q1 q2 q3 q4 q5 q6 q7 q8 q9 01234 01235 01236 23456 12678 13579

  19. There are C(10,5) = 252 models in M 0.5, A • Permute this set randomly to give m1,m2,…,m252

  20. First 10 Items Listed by the indices i of qi

  21. Make collections of increasing size M1 = {m1} M2 = {m1, m2} … M252 = {m1, m2, …, m252} • For each point q in A, count how many members of each Mt cover q • Normalize the count by size of Mt, obtain Y(q,Mt) = ---  Cmk(q) = ProbM (q m| m  Mt) where Cm(q) = 1 if q  m, = 0 otherwise t 1 t k=1

  22. The Y table continues … As t goes to 252, Y values become …

  23. Trace the value of Y(q, Mt) for each q as t increases • Values of Y converge to 0.5 • They are very close to 0.5 far before t=252

  24. When t is large, we have Y(q,Mt) = ProbM (q m| m  Mt) = 0.5 = ProbF (q m| q  A) • We have a symmetry of probabilities in two different spaces: M and F • This is due to the uniform coverage of Mt on A i.e., any two points in A are covered by the same number of models in Mt

  25. Two-class discrimination • Label points in A with 2 classes: TR1= {q0, q1, q2, q7, q8} TR2 = {q3, q4, q5, q6, q9} • Calculate a rating of each model m for each class: r1= ProbF (q m| q TR1) r2 = ProbF (q m| q TR2) q0 q1 q2q3 q4 q5 q6q7 q8q9

  26. Enriched Models • Ratings r1and r2 describe how well m is in capturing classes c1 and c2 as observed with TR1 and TR2 r1 (m) = ProbF (q m| q TR1) r2(m) = ProbF (q m| q TR2) e.g. m = {q1, q2, q6, q8, q9} r1 (m) = 3/5 enrichment degree d12(m) = r2(m) = 2/5 r1(m)-r2(m) = 0.2 q0 q1 q2q3 q4 q5 q6q7 q8q9

  27. The Discriminant • Recall Cm(q) = 1 if q  m, = 0 otherwise • Define X12 (q,m) = ------------------ • Define a discriminant Y12 (q,Mt) = --- X12 (q,mk) Cm(q) – r2(m) r1(m) – r2(m) t 1 t k=1

  28. The Y table continues … q0 q1 q2q3 q4 q5 q6q7 q8q9 As t goes to 252, Y values become …

  29. Trace the value of Y(q, Mt) for each q as t increases • Values of Y converge to 1 or 0 (1 for TR1, 0 for TR2) • They are very close to 1 or 0 far before t=252

  30. Why? Cm(q) – r2(m) X12 (q,m) = Cm(q) X12 (q,m) = ------------------ r1(m) – r2(m) t Y12 (q,Mt) = --- X12 (q,mk) 1 t k=1

  31. Profile of Coverage • Find the fraction of models of each rating that cover a fixed point q f Mt, r1, TR1(q) and f Mt, r2, TR2(q) • Since Mt is expanded in a uniform way, as t increases, for all x, f Mt, x, TRi(q)  x

  32. Ratings of m in Mt We have models of 6 different “types”

  33. Profile of Coverage of q0 at t=10 m3,m5.m10 m2,m8

  34. Ratings of m (repeated for reference)

  35. Profile of Coverage for a fixed point q in TRi f Mt, r, TRi(q) = r f(q) r t

  36. Profile of coverage as a function of r1: f Mt, r1, TR1(q) q0 q1 q2 q3 q4 q5 q6q7 q8q9 Profile of coverage as a function of r2: f Mt, r2, TR2(q) q0 q1 q2 q3 q4 q5 q6q7 q8q9

  37. Profile of coverage as a function of r1: f Mt, r1, TR1(q) q0 q1 q2 q3 q4 q5 q6q7 q8q9 Profile of coverage as a function of r2: f Mt, r2, TR2(q) q0 q1 q2 q3 q4 q5 q6q7 q8q9

  38. Decomposition of Y Duality due to uniformity Can be shown to be 0 for qTR2 in a similar way.

  39. Projectability of Models • If F has more than the training points q: • If the models m are larger – not only including the q points but also their neighboring p, the same discriminant Y12can be used to classify the p points • The points p and q are Mt-indiscernible q0,p0 q1,p1 q2,p2q3,p3 q4,p4 q5,p5 q6,p6q7,p7 q8,p8q9,p9

  40. Example Definition of a Model q0,p0 q1,p1 q2,p2q3,p3 q4,p4 q5,p5 q6,p6q7,p7 q8,p8q9,p9   Points within m are m-indiscernible

  41. M-indiscernibility • Two points cannot be distinguished w.r.t. the definition of m • Within the descriptive power of m, the two points are considered the same • Berlind’s hierarchy of indiscernibility m B A Points A, B within m are m-indiscernible

  42. M-Indiscernibility:similarity w.r.t. a specific criterion • Give me a book similar to this one. • A clone? • A photocopy on loose paper? • A translation? • Another adventure story? • Another paperback book? • …

  43. Simulated vs. Real-World Data • Model based anomaly detection • How much of the real-world do we need to reproduce in the simulator? • What properties is our discriminator sensitive to by construction? • Which ones are don’t-cares? • How do we measure the success? “Transductive learning”: no estimation of underlying distributions. Good as long as you can make correct predictions.

  44. In the SD method:Model Size and Projectability • Points within the same model share the same interpretation • Larger models -> more stable the ratings are w.r.t. sampling differences -> more similar classification between TR and TE • Tradeoff with ease to achieve uniformity

  45. Some Possible Forms of Weak Models Choice of model form can be based on domain knowledge: What type of local generalization is expected?

  46. Enrichment and Convergence • Larger enrichment degree -> smaller variance in X -> Y converges faster • But, models with large enrichment degree are more difficult to obtain • Thus it is more difficult to achieve uniformity

  47. Enrichment Uniformity Projectability The 3-way Tension

  48. The 3-way Tension, in more familiar terms Enrichment: Discriminating Power Uniformity: Projectability: Complementary Information Generalization Power

More Related