440 likes | 602 Views
Clustering Methods: Part 6. Dimensionality. Ilja Sidoroff Pasi Fränti. Speech and Image Processing Unit Department of Computer Science University of Joensuu, FINLAND. Dimensionality of data.
E N D
Clustering Methods: Part 6 Dimensionality Ilja Sidoroff Pasi Fränti Speech and Image Processing UnitDepartment of Computer Science University of Joensuu, FINLAND
Dimensionality of data • Dimensionality of data set = the minimum number of free variables needed to represent data without information loss • An d-attribute data set has an intrinsic dimensionality (ID) of M if its elements lie entirely within an M-dimensional subspace of Rd (M < d)
Dimensionality of data • The use of more dimensions than necessary leads to problems: • greater storage requirements • the speed of algorithms is slower • finding clusters and creating good classifiers is more difficult (curse of dimensionality)
Curse of dimensionality • When the dimensionality of space increases, distance measures become less useful • all points are more or less equidistant • most of the volume of a sphere is concentrated on a thin layer near the surface of the sphere (eg. next slide)
V(r) – volume of sphere with radius r D – dimension of the sphere
Two approaches • Estimation of dimensionality • knowing ID of data set could help in tuning classification or clustering performance • Dimensionality reduction • projecting data to some subspace • eg. 2D/3D visualisation of multi-dimensional data set • may result in information loss if the subspace dimension is smaller than ID
Goodness of the projection Can be estimated by two measures: • Trustworthiness: data points that are not neighbours in input space are not mapped as neighbours in output space. • Continuity: data points that are close are not mapped far away in output space [11].
Trustworthiness • N - number of feature vectors • r(i,j) – the rank of data sample j in the ordering according to the distance from i in the original data space • Uk(i) – set of feature vectors that are in the size k-neighbourhood of sample i in the projection space but not in the original space • A(k) – Scales the measure between 0 and 1
Continuity • r'(i,j) – the rank of data sample j in the ordering according to the distance from i in the projection space • Vk(i) – set of feature vectors that are in the size k-neighbourhood of sample i in the original space but not in the projection space
Example data sets • Swiss roll: 20000 3D points • 2D manifold in 3D space • http://isomap.stanford.edu
Example data sets • 16 16 pixel images of hands in different positions • Each image can be considered as 4096-dimensional data element • Could also be interpreted in terms of finger extension – wrist rotation (2D)
Example data sets http://isomap.stanford.edu
Synthetic data sets [11] Sphere S-shaped manifold Six clusters
Principal component analysis (PCA) • Idea: find directions of maximal variance and align coordinate axis to them. • If variance is zero, that dimension is not needed. • Drawback: works well only with linear data [1]
PCA method (1/2) • Center data so that its means are zero • Calculate covariance matrix for data • Calculate eigenvalues and eigenvectors of the covariance matrix • Arrange eigenvectors according to the eigenvalues • For dimensionality reduction, choose the desired number of eigenvectors (2 or 3 for visualization)
PCA Method • Intrinsic dimensionality = number of non-zero eigenvalues • Dimensionality reduction by projection: yi = Axi • Here xi is the input vector, yi the output vector, and A is the matrix containing eigenvectors corresponding to the largest eigenvalues. • For visualization typically 2 or 3 eigenvalues preserved.
Example of PCA • The distances between points are different in projections. • Test set c: • two clusters are projected into one cluster • s-shaped cluster is projected nicely
Another example of PCA [10] • Data set: point lying on circle: (x2 + y2 = 1), ID = 2 • PCA yield two non-null eigenvalues • u, v – principal components
Limitations of PCA • Since eigenvectors are orthogonal works well only with linear data • Tends to overestimate ID • Kernel PCA uses so called kernel trick to apply PCA also to non linear data • make non linear projection into a higher dimensional space, perform PCA analysis in this space
Multidimensional scaling method (MDS) • Project data into a new space while trying to preserve distances between data points • Define stress E (difference of pairwise distances in original and projection spaces) • E is minimized using some optimization algorithm • With certain stress functions (i.e. Kruskal) when E is 0, perfect projection exists • ID of the data is the smallest projection dimension where perfect projection exists
Metric MDS The simplest stress function [2], raw stress: d(xi, xj)distance in the original space d(yi, yj)distance in the projection space yi, yj representation of xi, xj in output space
Sammon's Mapping • Sammon's mapping gives small distances a larger weight [5]:
Kruskal's stress • Ranking the point distances accounts for decreasing distances in lower dimensional projections:
MDS example • Separates clusters better than PCA • Local structures are not always preserved (leftmost test set)
Other MDS approaches • ISOMAP [12] • Curvilinear component analysis CCA [13]
Local methods • Previous methods are global in the sense that the all input data is considered at once. • Local methods consider only some neighbourhood of data points may be computationally less demanding • Try to estimate topological dimension of the data manifold
Fukunaga-Olsen algorithm [6] • Assume that data can be divided into small regions, i.e. clustered • Each cluster (voronoi set) of the data vector lies in an approximately linear surface => PCA method can be applied to each cluster • Eigenvalues are normalized by diving by the largest eigenvalue
Fukunaga-Olsen algorithm • ID is defined as the number of normalized eigenvalues that are larger than a threshold T • Defining a good threshold is a problem as such
Near neighbour algorithm • Trunk's method [7]: • An initial value for an integer parameter k is chosen (usually k=1). • k nearest neighbours for each data vector are identified. • for each data vector i, subspace spanned by vectors from i to each of its k neighbours is constructed.
(k+1)th-neighbour Near neighbour algorithm • The angle between (k+1)th near neighbour and its projection to the subspace is calculated for each data vector • If the average of these angles is below a threshold, ID is k, otherwise increase k and repeat the process angle subspace
Near neighbour algorithm • It is not clear how to select suitable value for threshold • Improvements to Trunk's method • Pettis et al. [8] • Verver-Duin [9]
Fractal methods • Global methods, but different definition of dimensionality • Basic idea: • count the observations inside a ball of radius r (f(r)). • analyse the growth rate of f(r) • if f grows as rkthe dimensionality of data can be considered as k
Fractal methods • Dimensionality can be fractional, i.e. 1.5 • So does not provide projections for lesser dimensional space (what is an R1,5anyway?) • Fractal dimensionality estimate can be used in time-series analysis etc. [10]
Fractal methods • Different definitions for fractal dimensions [10] • Hausdorff dimension • Box-counting dimension • Correlation dimension • In order to get an accurate estimate of the dimension D, the data set cardinality must be at least 10D/2
Hausdorff dimension • data set is covered by cells siwith variable diameter ri, all ri < r • in other words, we look for collection of covering sets siwith diameter less than or equal to r, which minimizes the sum • d-dimensional Hausdorff measure:
Hausdorff dimension • For every data set ΓdH is infinite if d is less than some critical value DH, and 0 if d is greater than DH • The critical value DH is the Hausdorff dimension of the data set
Box-Counting dimension • Hausdorff dimension is not easy to calculate • Box-Counting DB dimension is an upper bound of Hausdorff dimension, does not usually differ from it: v(r) – is the number of the boxes of size r needed to cover the data set
Box-Counting dimension • Although Box-Counting dimension is easier to calculate than Hausdorff dimension, the algorithmic complexity grows exponentially with the set dimensionality => can be used only for low-dimensional data sets • Correlation dimension is computationally more feasible fractal dimension measure • Correlation dimension is an lower bound of the Box-Counting dimension
Correlation dimension • Let x1, x2, x3, ... , xNbe data points • Correlation integral can be defined as: I(x) is indicator function: I(x) = 1, iff x istrue, I(x) = 0, otherwise.
Correlation dimension (some explanation needed!!!)
Literature • M. Kirby, Geometric Data Analysis: An Empirical Approach to Dimensionality Reduction and the Study of Patterns, John Wiley and Sons, 2001. • J. B. Kruskal, Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis, Psychometrika 29 (1964) 1–27. • R. N. Shepard, The analysis of proximities: Multimensional scaling with an unknown distance function, Psychometrika 27 (1962) 125–140. • R. S. Bennett, The intrinsic dimensionality of signal collections, IEEE Transactions on Information Theory 15 (1969) 517–525. • J. W. J. Sammon, A nonlinear mapping for data structure analysis, IEEE Transaction on Computers C-18 (1969) 401–409. • K. Fukunaga, D. R. Olsen, An algorithm for finding intrinsic dimensionality of data, IEEE Transactions on Computers 20 (2) (1976) 165–171. • G. V. Trunk, Statistical estimation of the intrinsic dimensionality of a noisy signal collection, IEEE Transaction on Computers 25 (1976) 165–171.
Literature • K. Pettis, T. Bailey, T. Jain, R. Dubes, An intrinsic dimensionality estimator from near-neighbor information, IEEE Transaction on Pattern Analysis and Machine Intelligence 1 (1) (1979) 25–37. • P. J. Verveer, R. Duin, An evaluation of intrinsic dimensionality estimators, IEEE Transaction on Pattern Analysis and Machine Intelligence 17 (1) (1995) 81–86. • F. Camastra, Data dimensionality estimation methods: a survey, Pattern Recognition 36 (2003) 2945-2954. • J. Venna, Dimensionality reduction for visual exploration of similarity structures (2007), PhD thesis manuscript (submitted) • J. B. Tenenbaum, V. de Silva, J. C. Langford, A global geometric framework for nonlinear dimensionality reduction, Science 290 (12) (2000) 2319–2323. • P. Demartines, J. Herault, Curvilinear component analysis: A self-organizing neural network for nonlinear mapping in cluster analysis, IEEE Transactions on Neural Networks 8 (1) (1997) 148–154.