560 likes | 725 Views
Support Vector Clustering - Asa Ben-Hur, David Horn, Hava T. Seigelman, Vladimir Vapnik. Data Mining – II Yiyu Chen Siddharth Pal Ritendra Datta Ashish Parulekar Shiva Kasiviswanathan. Overview. Clustering and Classification Support Vector Machines – a brief idea
E N D
Support Vector Clustering- Asa Ben-Hur, David Horn, Hava T. Seigelman, Vladimir Vapnik Data Mining – II Yiyu Chen Siddharth Pal Ritendra Datta Ashish Parulekar Shiva Kasiviswanathan
Overview • Clustering and Classification • Support Vector Machines – a brief idea • Support Vector Clustering • Results • Conclusion
Clustering and Classification • Classification – The task of assigning instances to pre-defined classes. • E.g. Deciding whether a particular patient record can be associated with a specific disease. • Clustering – The task of grouping related data points together without labeling them. • E.g. Grouping patient records with similar symptoms without knowing what the symptoms indicate.
Supervised v/s Unsupervised Learning • Unsupervised Learning - There is no training process involved. • E.g. Clustering. • Supervised Learning – Training is done using labeled data. • E.g. Classification.
Support Vector Machines • Classification/Clustering of data by exploiting complex patterns in data. • Instance of a class of algorithms called Kernel Machines. • Exploiting information about the inner product between data points, in some (possibly very complex) feature space.
Kernel-based learning • Composed of two parts: • General-purpose learning machine. • Application-specific kernel function. • Simple case – Classification: • Decision function = Hyper-plane in input space. • Rosenblatt’s Perceptron Algorithm. • SVM draws inspiration from this linear case.
SVM Features • Maps data into feature space where they are more easily (possibly linearly) separable. • Duality – The main optimization function can be re-written in the dual form where data appears only as inner-product. • KernelKisa function that returns the inner product of the images of two data points. • K(X1, X2) = <Φ(X1),Φ(X2)> • This K is needed, the actual mapping Φ(X) may not even be known.
SVM Features • Dimensionality – High dimensionality in the feature space can cause an over-fitting of the data. • Regularities in the training set may not appear in the test sets. • Convexity – The optimization problem is a convex quadratic programming problem • No local minima • Solvable in polynomial time
Support Vector Clustering • Vapnik (1995): Support Vector Machines • Tax & Duin (1999): SVs to characterize high dimensional distribution • Scholkopf et al. (2001): SVs to compute set of contours enclosing data points • Ben-Hur et al. (2001): SVs to systematically search for clustering solutions
Highlights of this Approach • Data points mapped to HD feature space using a Gaussian kernel • In feature space, look for smallest sphere enclosing image of data • Map sphere back to data space to form set of contours enclosing data points • Points enclosed by each contour associated with same cluster
Highlights of this Approach (cont.) • Decreasing the width of Gaussian kernel increases number of contours • Searching for contours = searching for valleys in the probability distribution • Use a soft margin to deal with outliers • Handle overlapping clusters with large values of the soft margin
Major Steps • Sphere Analysis • Clusters Analysis
Major Steps • Sphere Analysis • Clusters Analysis
Cluster analysis: connect components • A unweighted graph • Connected components = clusters • BSV unclassified
Parameters • The shape of the enclosing contours in data space is governed by two parameters • The scale parameter of the gaussian kernel “q” • The soft margin constant “C”
Examples without BSV • The scale parameter of the gaussian kernel q, is increased, the shape of the boundary in data-space varies. • As q increases the boundary fits the data more tightly • At several q values the enclosing contour splits forming increased number of clusters • With increase q, the number of support vectors nsv increases.
Examples with BSV In real data, clusters are usually not as well separated as in Thus, in order to observe splitting of contours, we must allow for BSVs. The number of outliers is controlled by parameter C nbsv = 1/C Where nbsv is the number of BSVs and C is soft margin constant p = 1/NC 1/NC is the upper bound on the fraction of BSVs When distinct clusters are present ,but some outliers (due to noise ) prevent contour separation, it is useful to use BSVs
Support vectors The difference between data that are contour-separable without BSVs and data that require use of BSVs is illustrated in Figure 4. A small overlap between the two probability distributions that generate the data is enough to prevent separation if there are no BSVs.
Example • SVC is used iteratively • Starting with a low value of q where there is a single cluster, and increasing it, to observe the formation of an increasing number of clusters (as the Gaussian kernel describes the data with increasing precision). • If, however, the number of SVs is excessive, p is increased to allow these points to turn into outliers, thus facilitating contour separation.
Example (cont.) • SVC is used iteratively • As p is increased not only does the number of BSVs increase, but their influence on the shape of the cluster contour decreases. • The number of support vectors depends on both q and p. For fixed q , as p is increased, the number of SVs decreases since some of them turn into BSVs and the contours become smoother
Strong overlapping clusters What if the overlap is very strong ? Use high BSV regime, reinterpret the sphere in feature space as representing cluster cores, rather the envelope of data
Strong overlapping clusters The sphere in the data space can be expressed as where p is determined by the value of the sum on the support vectors The set of points enclosed in contour is In extreme case when almost all points are BSVs p --> 1, the sum in this expression where Pwis Parzen window estimator Cluster centers are define as maxima of the Parzen window estimator Pw(x).
Strong overlapping clusters (cont.) • Advantages • Instead of solving a problem with many local maxima, They identify core boundaries by an SV method with a global optimal solution. • The conceptual advantage of the method is that they define a region, rather than just a peak, as the core of the cluster.
Iris Data The data set consists of 150 instances each composed of fours measurements of an iris flower. There are three type of flowers, represented by 50 instances of each
Iris Data One of the clusters is linearly separable from the other two by a clear gap in the probability distribution. The remaining two clusters have significant overlap and were separated at q = 6 and p = 0.6. When these two clusters are considered together, the result is 2 misclassifications.
Iris Data Analysis • Adding the third principal component three clusters were obtained at q = 7 p = 0.7, with four misclassifications. • With the fourth principal component the number of misclassifications increased to 14 (using q = 9 p = 0.75). • In addition, the number of support vectors increased with increasing dimensionality (18 in 2 dimensions, 23 in 3 dimensions and 34 in 4 dimensions). • The improved performance in 2 or 3 dimensions can be attributed to the noise reduction effect of PCA. • Compared to other method SVC algorithm performed better • Approach of Tishby and Slonim (2001) leads to 5 misclassifications • SPC algorithm of Blatt et al. (1997), when applied to the dataset in the original data-space, has 15 misclassifications.
Varying ‘p’ and ‘q’ Start with initial value of q, at this scale all pairs of points produce a sizeable kernel value, resulting in a single cluster. At this value no outliers are needed, so we choose C = 1. If q is being increased, clusters of single or few points break off, or cluster boundaries become very rough, p should be increased in order to investigate what happens when BSVs are allowed.
Varying ‘p’ and ‘q’ A low number of SVs guarantees smooth boundaries. As q increases this number increases, If the number of SVs is excessive, so p should be increased, where by many SVs may be turned into BSVs, and smooth cluster (or core) boundaries emerge, as in Figure 3b. Systematically increase q and p along a direction that guarantees a minimal number of SVs.