1 / 54

Support Vector Clustering - Asa Ben-Hur, David Horn, Hava T. Seigelman, Vladimir Vapnik

Support Vector Clustering - Asa Ben-Hur, David Horn, Hava T. Seigelman, Vladimir Vapnik. Data Mining – II Yiyu Chen Siddharth Pal Ritendra Datta Ashish Parulekar Shiva Kasiviswanathan. Overview. Clustering and Classification Support Vector Machines – a brief idea

gyula
Download Presentation

Support Vector Clustering - Asa Ben-Hur, David Horn, Hava T. Seigelman, Vladimir Vapnik

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector Clustering- Asa Ben-Hur, David Horn, Hava T. Seigelman, Vladimir Vapnik Data Mining – II Yiyu Chen Siddharth Pal Ritendra Datta Ashish Parulekar Shiva Kasiviswanathan

  2. Overview • Clustering and Classification • Support Vector Machines – a brief idea • Support Vector Clustering • Results • Conclusion

  3. Clustering and Classification • Classification – The task of assigning instances to pre-defined classes. • E.g. Deciding whether a particular patient record can be associated with a specific disease. • Clustering – The task of grouping related data points together without labeling them. • E.g. Grouping patient records with similar symptoms without knowing what the symptoms indicate.

  4. Supervised v/s Unsupervised Learning • Unsupervised Learning - There is no training process involved. • E.g. Clustering. • Supervised Learning – Training is done using labeled data. • E.g. Classification.

  5. Support Vector Machines • Classification/Clustering of data by exploiting complex patterns in data. • Instance of a class of algorithms called Kernel Machines. • Exploiting information about the inner product between data points, in some (possibly very complex) feature space.

  6. Kernel-based learning • Composed of two parts: • General-purpose learning machine. • Application-specific kernel function. • Simple case – Classification: • Decision function = Hyper-plane in input space. • Rosenblatt’s Perceptron Algorithm. • SVM draws inspiration from this linear case.

  7. SVM Features • Maps data into feature space where they are more easily (possibly linearly) separable. • Duality – The main optimization function can be re-written in the dual form where data appears only as inner-product. • KernelKisa function that returns the inner product of the images of two data points. • K(X1, X2) = <Φ(X1),Φ(X2)> • This K is needed, the actual mapping Φ(X) may not even be known.

  8. SVM Features • Dimensionality – High dimensionality in the feature space can cause an over-fitting of the data. • Regularities in the training set may not appear in the test sets. • Convexity – The optimization problem is a convex quadratic programming problem • No local minima • Solvable in polynomial time

  9. Support Vector Clustering • Vapnik (1995): Support Vector Machines • Tax & Duin (1999): SVs to characterize high dimensional distribution • Scholkopf et al. (2001): SVs to compute set of contours enclosing data points • Ben-Hur et al. (2001): SVs to systematically search for clustering solutions

  10. Highlights of this Approach • Data points mapped to HD feature space using a Gaussian kernel • In feature space, look for smallest sphere enclosing image of data • Map sphere back to data space to form set of contours enclosing data points • Points enclosed by each contour associated with same cluster

  11. Highlights of this Approach (cont.) • Decreasing the width of Gaussian kernel increases number of contours • Searching for contours = searching for valleys in the probability distribution • Use a soft margin to deal with outliers • Handle overlapping clusters with large values of the soft margin

  12. Major Steps • Sphere Analysis • Clusters Analysis

  13. Sphere Analysis

  14. Sphere Analysis

  15. Sphere Analysis

  16. The Sphere

  17. Data Space

  18. Major Steps • Sphere Analysis • Clusters Analysis

  19. Cluster analysis: adjacent matrix

  20. Cluster analysis: connect components • A unweighted graph • Connected components = clusters • BSV unclassified

  21. Parameters • The shape of the enclosing contours in data space is governed by two parameters • The scale parameter of the gaussian kernel “q” • The soft margin constant “C”

  22. Examples without BSV

  23. Examples without BSV • The scale parameter of the gaussian kernel q, is increased, the shape of the boundary in data-space varies. • As q increases the boundary fits the data more tightly • At several q values the enclosing contour splits forming increased number of clusters • With increase q, the number of support vectors nsv increases.

  24. Examples with BSV In real data, clusters are usually not as well separated as in Thus, in order to observe splitting of contours, we must allow for BSVs. The number of outliers is controlled by parameter C nbsv = 1/C Where nbsv is the number of BSVs and C is soft margin constant p = 1/NC 1/NC is the upper bound on the fraction of BSVs When distinct clusters are present ,but some outliers (due to noise ) prevent contour separation, it is useful to use BSVs

  25. Support vectors The difference between data that are contour-separable without BSVs and data that require use of BSVs is illustrated in Figure 4. A small overlap between the two probability distributions that generate the data is enough to prevent separation if there are no BSVs.

  26. Example

  27. Example • SVC is used iteratively • Starting with a low value of q where there is a single cluster, and increasing it, to observe the formation of an increasing number of clusters (as the Gaussian kernel describes the data with increasing precision). • If, however, the number of SVs is excessive, p is increased to allow these points to turn into outliers, thus facilitating contour separation.

  28. Example (cont.) • SVC is used iteratively • As p is increased not only does the number of BSVs increase, but their influence on the shape of the cluster contour decreases. • The number of support vectors depends on both q and p. For fixed q , as p is increased, the number of SVs decreases since some of them turn into BSVs and the contours become smoother

  29. Strong overlapping clusters What if the overlap is very strong ? Use high BSV regime, reinterpret the sphere in feature space as representing cluster cores, rather the envelope of data

  30. Strong overlapping clusters The sphere in the data space can be expressed as where p is determined by the value of the sum on the support vectors The set of points enclosed in contour is In extreme case when almost all points are BSVs p --> 1, the sum in this expression where Pwis Parzen window estimator Cluster centers are define as maxima of the Parzen window estimator Pw(x).

  31. Strong overlapping clusters (cont.) • Advantages • Instead of solving a problem with many local maxima, They identify core boundaries by an SV method with a global optimal solution. • The conceptual advantage of the method is that they define a region, rather than just a peak, as the core of the cluster.

  32. Iris Data The data set consists of 150 instances each composed of fours measurements of an iris flower. There are three type of flowers, represented by 50 instances of each

  33. Iris Data One of the clusters is linearly separable from the other two by a clear gap in the probability distribution. The remaining two clusters have significant overlap and were separated at q = 6 and p = 0.6. When these two clusters are considered together, the result is 2 misclassifications.

  34. Iris Data Analysis • Adding the third principal component three clusters were obtained at q = 7 p = 0.7, with four misclassifications. • With the fourth principal component the number of misclassifications increased to 14 (using q = 9 p = 0.75). • In addition, the number of support vectors increased with increasing dimensionality (18 in 2 dimensions, 23 in 3 dimensions and 34 in 4 dimensions). • The improved performance in 2 or 3 dimensions can be attributed to the noise reduction effect of PCA. • Compared to other method SVC algorithm performed better • Approach of Tishby and Slonim (2001) leads to 5 misclassifications • SPC algorithm of Blatt et al. (1997), when applied to the dataset in the original data-space, has 15 misclassifications.

  35. Varying ‘p’ and ‘q’ Start with initial value of q, at this scale all pairs of points produce a sizeable kernel value, resulting in a single cluster. At this value no outliers are needed, so we choose C = 1. If q is being increased, clusters of single or few points break off, or cluster boundaries become very rough, p should be increased in order to investigate what happens when BSVs are allowed.

  36. Varying ‘p’ and ‘q’ A low number of SVs guarantees smooth boundaries. As q increases this number increases, If the number of SVs is excessive, so p should be increased, where by many SVs may be turned into BSVs, and smooth cluster (or core) boundaries emerge, as in Figure 3b. Systematically increase q and p along a direction that guarantees a minimal number of SVs.

  37. Our Results for Iris

  38. Dimension 1 & 4

  39. Dimension 1 & 4

  40. Dimension 2 & 3

  41. Dimension 2 & 3

  42. Dimension 2 & 4

  43. Dimension 2 & 4

  44. Dimension 3 & 4

  45. Dimension 3 & 4

  46. Results on our synthesized data q=1

  47. Results on our synthesized data q=2

  48. Results on our synthesized data q=5

  49. Results on our synthesized data q=10

  50. Results on our synthesized data q=20

More Related