1 / 69

Kernel Embedding for Nonlinear Discrimination: FLD, GLR, and MD in HDLSS Space

Explore the concept of kernel embedding for nonlinear discrimination, including its advantages and drawbacks compared to classical discrimination methods such as Fisher Linear Discrimination (FLD), Generalized Linear Regression (GLR), and Maximum Discrimination (MD). The visualization of toy examples demonstrates the effects of polynomial embedding and the challenges in certain data distributions. The introduction of radial basis functions and support vector machines as generalizations of kernel embedding are also discussed.

hchristie
Download Presentation

Kernel Embedding for Nonlinear Discrimination: FLD, GLR, and MD in HDLSS Space

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Object Orie’d Data Analysis, Last Time • Classical Discrimination (aka Classification) • FLD & GLR very attractive • MD never better, sometimes worse • HDLSS Discrimination • FLD & GLR fall apart • MD much better • Maximal Data Piling • HDLSS space is a strange place

  2. Kernel Embedding Aizerman, Braverman and Rozoner (1964) • Motivating idea: Extend scope of linear discrimination, By adding nonlinear components to data (embedding in a higher dim’al space) • Better use of name: nonlinear discrimination?

  3. Kernel Embedding Stronger effects for higher order polynomial embedding: E.g. for cubic, linear separation can give 4 parts (or fewer)

  4. Kernel Embedding General View: for original data matrix: add rows: i.e. embed in Then Higher slice Dimensional with a Space hyperplane

  5. Kernel Embedding Embedded Fisher Linear Discrimination: Choose Class 1, for any when: in embedded space. • image of class boundaries in original space is nonlinear • allows more complicated class regions • Can also do Gaussian Lik. Rat. (or others) • Compute image by classifying points from original space

  6. Kernel Embedding Visualization for Toy Examples: • Have Linear Disc. In Embedded Space • Study Effect in Original Data Space • Via Implied Nonlinear Regions Approach: • Use Test Set in Original Space (dense equally spaced grid) • Apply embedded discrimination Rule • Color Using the Result

  7. Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds

  8. Kernel Embedding Polynomial Embedding, Toy Example 1: Parallel Clouds • PC 1: • always bad • finds “embedded greatest var.” only) • FLD: • stays good • GLR: • OK discrimination at data • but overfitting problems

  9. Kernel Embedding Polynomial Embedding, Toy Example 2: Split X

  10. Kernel Embedding Polynomial Embedding, Toy Example 2: Split X • FLD: • Rapidly improves with higher degree • GLR: • Always good • but never ellipse around blues…

  11. Kernel Embedding Polynomial Embedding, Toy Example 3: Donut

  12. Kernel Embedding Polynomial Embedding, Toy Example 3: Donut • FLD: • Poor fit for low degree • then good • no overfit • GLR: • Best with No Embed, • Square shape for overfitting?

  13. Kernel Embedding Drawbacks to polynomial embedding: • too many extra terms create spurious structure • i.e. have “overfitting” • HDLSS problems typically get worse

  14. Kernel Embedding Hot Topic Variation: “Kernel Machines” Idea: replace polynomials by other nonlinear functions e.g. 1: sigmoid functions from neural nets e.g. 2: radial basis functions Gaussian kernels Related to “kernel density estimation” (recall: smoothed histogram)

  15. Kernel Embedding Radial Basis Functions: Note: there are several ways to embed: • Naïve Embedding (equally spaced grid) • Explicit Embedding (evaluate at data) • Implicit Emdedding (inner prod. based) (everybody currently does the latter)

  16. Kernel Embedding Naïve Embedding, Radial basis functions: At some “grid points” , For a “bandwidth” (i.e. standard dev’n) , Consider ( dim’al) functions: Replace data matrix with:

  17. Kernel Embedding Naïve Embedding, Radial basis functions: For discrimination: Work in radial basis space, With new data vector , represented by:

  18. Kernel Embedding Naïve Embedd’g, Toy E.g. 1: Parallel Clouds • Good at data • Poor outside

  19. Kernel Embedding Naïve Embedd’g, Toy E.g. 2: Split X • OK at data • Strange outside

  20. Kernel Embedding Naïve Embedd’g, Toy E.g. 3: Donut • Mostly good • Slight mistake for one kernel

  21. Kernel Embedding Naïve Embedding, Radial basis functions: Toy Example, Main lessons: • Generally good in regions with data, • Unpredictable where data are sparse

  22. Kernel Embedding Toy Example 4: Checkerboard Very Challenging! Linear Method? Polynomial Embedding?

  23. Kernel Embedding Toy Example 4: Checkerboard Polynomial Embedding: • Very poor for linear • Slightly better for higher degrees • Overall very poor • Polynomials don’t have needed flexibility

  24. Kernel Embedding Toy Example 4: Checkerboard Radial Basis Embedding + FLD Is Excellent!

  25. Kernel Embedding Drawbacks to naïve embedding: • Equally spaced grid too big in high d • Not computationally tractable (gd) Approach: • Evaluate only at data points • Not on full grid • But where data live

  26. Kernel Embedding Other types of embedding: • Explicit • Implicit Will be studied soon, after introduction to Support Vector Machines…

  27. Kernel Embedding generalizations of this idea to other types of analysis & some clever computational ideas. E.g. “Kernel based, nonlinear Principal Components Analysis” Ref: Schölkopf, Smola and Müller (1998)

  28. Support Vector Machines Motivation: • Find a linear method that “works well” for embedded data • Note: Embedded data are very non-Gaussian • Suggests value of really new approach

  29. Support Vector Machines Classical References: • Vapnik (1982) • Boser, Guyon & Vapnik (1992) • Vapnik (1995) Excellent Web Resource: • http://www.kernel-machines.org/

  30. Support Vector Machines Recommended tutorial: • Burges (1998) Recommended Monographs: • Cristianini & Shawe-Taylor (2000) • Schölkopf & Alex Smola (2002)

  31. Support Vector Machines Graphical View, using Toy Example: • Find separating plane • To maximize distances from data to plane • In particular smallest distance • Data points closest are called support vectors • Gap between is called margin

  32. SVMs, Optimization Viewpoint Formulate Optimization problem, based on: • Data (feature) vectors • Class Labels • Normal Vector • Location (determines intercept) • Residuals (right side) • Residuals (wrong side) • Solve (convex problem) by quadratic programming

  33. SVMs, Optimization Viewpoint Lagrange Multipliers primal formulation (separable case): • Minimize: Where are Lagrange multipliers Dual Lagrangian version: • Maximize: Get classification function:

  34. SVMs, Computation Major Computational Point: • Classifier only depends on data through inner products! • Thus enough to only store inner products • Creates big savings in optimization • Especially for HDLSS data • But also creates variations in kernel embedding (interpretation?!?) • This is almost always done in practice

  35. SVMs, Comput’n & Embedding For an “Embedding Map”, e.g. Explicit Embedding: Maximize: Get classification function: • Straightforward application of embedding • But loses inner product advantage

  36. SVMs, Comput’n & Embedding Implicit Embedding: Maximize: Get classification function: • Still defined only via inner products • Retains optimization advantage • Thus used very commonly • Comparison to explicit embedding? • Which is “better”???

  37. SVMs & Robustness Usually not severely affected by outliers, But a possible weakness: Can have very influential points Toy E.g., only 2 points drive SVM Notes: • Huge range of chosen hyperplanes • But all are “pretty good discriminators” • Only happens when whole range is OK??? • Good or bad?

  38. SVMs & Robustness Effect of violators (toy example): • Depends on distance to plane • Weak for violators nearby • Strong as they move away • Can have major impact on plane • Also depends on tuning parameter C

  39. SVMs, Computation Caution: available algorithms are not created equal Toy Example: • Gunn’s Matlab code • Todd’s Matlab code Serious errors in Gunn’s version, does not find real optimum…

  40. SVMs, Tuning Parameter Recall Regularization Parameter C: • Controls penalty for violation • I.e. lying on wrong side of plane • Appears in slack variables • Affects performance of SVM Toy Example: d = 50, Spherical Gaussian data

  41. SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian data X=Axis: Opt. Dir’n Other: SVM Dir’n • Small C: • Where is the margin? • Small angle to optimal (generalizable) • Large C: • More data piling • Larger angle (less generalizable) • Bigger gap (but maybe not better???) • Between: Very small range

  42. SVMs, Tuning Parameter Toy Example: d = 50, Spherical Gaussian data Careful look at small C: Put MD on horizontal axis E.g. • Shows SVM and MD same for C small • Mathematics behind this? • Separates for large C • No data piling for MD

  43. Distance Weighted Discrim’n Improvement of SVM for HDLSS Data Toy e.g. (similar to earlier movie)

  44. Distance Weighted Discrim’n Toy e.g.: Maximal Data Piling Direction - Perfect Separation - Gross Overfitting - Large Angle - Poor Gen’ability

  45. Distance Weighted Discrim’n Toy e.g.: Support Vector Machine Direction - Bigger Gap - Smaller Angle - Better Gen’ability - Feels support vectors too strongly??? - Ugly subpops? - Improvement?

  46. Distance Weighted Discrim’n Toy e.g.: Distance Weighted Discrimination - Addresses these issues - Smaller Angle - Better Gen’ability - Nice subpops - Replaces min dist. by avg. dist.

  47. Distance Weighted Discrim’n Based on Optimization Problem: More precisely: Work in appropriate penalty for violations Optimization Method: Second Order Cone Programming • “Still convex” gen’n of quad’c program’g • Allows fast greedy solution • Can use available fast software (SDP3, Michael Todd, et al)

  48. Distance Weighted Discrim’n 2=d Visualization: Pushes Plane Away From Data All Points Have Some Influence

  49. DWD Batch and Source Adjustment • Recall from Class Meeting, 9/6/05: • For Perou’s Stanford Breast Cancer Data • Analysis in Benito, et al (2004) Bioinformatics • https://genome.unc.edu/pubsup/dwd/ • Use DWD as useful direction vector to: • Adjust for Source Effects • Different sources of mRNA • Adjust for Batch Effects • Arrays fabricated at different times

  50. DWD Adj: Biological Class Colors & Symbols

More Related