230 likes | 419 Views
Multi-view Clustering via Canonical Correlation Analysis Kamalika Chaudhuri et al. ICML 2009. Presented By Wanchen Lu 2/25/2013. Introduction. Assumption in Multi-View problems.
E N D
Multi-view Clustering viaCanonical Correlation AnalysisKamalikaChaudhuri et al. ICML 2009. Presented By Wanchen Lu 2/25/2013
Assumption in Multi-View problems • The input variable (a real vector) can be partitioned into two different views, where it is assumed that either view of the input is sufficient to make accurate predictions --- essentially the co-training assumption. • e.g. • Identity recognition with one view being a video stream and the other an audio stream; • Web page classification where one view is the text and the other is the hyperlink structure; • Object recognition with pictures from different camera angles; • A bilingual parallel corpus, with each view presented in one language.
Intuition in Multi-View problems • Many multi-view learning algorithms force agreement between the predictors based on either view. (usually force the predictor on view 1 to equal to the predictor based on view 2) • The complexity of the learning problem is reduced by eliminating hypothesis from each view that do not agree with each other.
Canonical correlation analysis • CCA is a way of measuring the linear relationship between two multidimensional variables. • Find two basis vectors, one for x and one for y, such that the correlations between the projections of the variables onto these basis vectors are maximized.
Calculating Canonical correlations • Consider the total covariance matrix of random variables x and y with zero mean: • The canonical correlations between x and y can be found by solving the eigenvalue equations
Relation to other linear subspace methods • Formulate the problems in one single eigenvalue equation
Principal component analysis • The principal components are the eigenvectors of the covariance matrix. • The projection of data onto the principal components is an orthogonal transformation that diagonalizes the covariance matrix.
Partial least squares • PLS is basically the singular value decomposition (SVD) of a between-sets covariance matrix. • In PLS regression, the principal vectors corresponding to the largest principal values are used as basis. A regression of y onto x is then performed in this basis.
The basic idea • Use CCA to project the data down to the subspace spanned by the means to get an easier clustering problem, then apply standard clustering algorithms in this space. • When the data in at least one of the views is well separated, this algorithm clusters correctly with high probability.
Algorithm • Input: a set of samples S, the number of clusters k • Randomly partition S into two subsets A and B of equal size. • Let C_12(A) be the covariance matrix between views 1 and 2, computed from the set A. Compute the top k-1 left singular vectors of C_12(A), and project the samples in B on the subspace spanned by these vectors. • Apply clustering algorithm (single linkage clustering, K-means) to the projected examples in view 1.
Speaker identification • Dataset • 41 speakers, speaking 10 sentences each • Audio features 1584 dimensions • Video feature 2394 dimensions • Method 1: use PCA project into 40 D • Method 2: use CCA (after PCA into 100 D for images and 1000 D for audios) • Cluster into 82 clusters (2 / speaker) using K-means
Speaker identification • Evaluation • Conditional perplexity • = the mean # of speakers corresponding to each cluster
clustering Wikipedia articles • Dataset • 128 K Wikipedia articles, evaluated on 73 K articles that belong to the 500 most frequent categories. • Link structure feature L is a concatenation of ``to`` and ``from`` vectors. L(i) is the number of times the current article links to/from article i. • Text feature is a bag-of-words vector. • Methods: compared PCA and CCA • Used a hierarchical clustering procedure, iteratively pick the largest cluster, reduce the dimensionality using PCA or CCA, and use k-means to break the cluster into smaller ones, until reaching the total desired number of clusters.
APPENDIX: A note on correlation • Correlation between x_i and x_i is the covariance normalized by the geometric mean of the variances of x_i and x_j
Affine transformations • An affine transformation is a map