260 likes | 341 Views
Sufficient Dimensionality Reduction with Irrelevance Statistics. Amir Globerson 1 Gal Chechik 2 Naftali Tishby 1 1 Center for Neural Computation and School of Computer Science and Engineering. The Hebrew University 2 Robotics Lab, CS department, Stanford University. X. Y. ?.
E N D
Sufficient Dimensionality Reduction with Irrelevance Statistics Amir Globerson1 Gal Chechik2 Naftali Tishby1 1 Center for Neural Computation and School of Computer Science and Engineering. The Hebrew University 2Robotics Lab, CS department, Stanford University
X Y ? • Goal: Find a simple representation of X which captures its relation to Y • Q1: Clustering is not always the most appropriate description (e.g. PCA vs Clustering) • Q2: Data may contain many structures. Not all of them relevant for a given task.
Talk layout • Continuous feature extraction using SDR • Using Irrelevance data in unsupervised learning • SDR with irrelevance data • Applications
Continuous features Terms Applications Papers Theory • What would be a simple representation of the papers ? • Clustering is not a good solution. Continuous scale. • Look at the mean number of words in the following groups: • figure, performance, improvement, empirical • equation, inequality, integral • Better look at weighted means (e.g. figure only loosely related to results) • The means give a continuous index reflecting the content of the document
Information in Expectations • Represent p(X|y) via the expected value of some function • Look at • A set of |Y| values, representing p(X|y) p(X|y) p(x2|y) p(x1|y) p(xn|y) <1(X)>p(X|y) <2(X)>p(X|y)
Examples • Weighted sum of word counts in a document can be informative about content • Weighted grey levels in specific image areas may reveal its identity • Mean expression level can reveal tissue identity • But what are the best features to use ? • Need a measure of information in expected values
Quantifying information in expectations • Possible measures ? I(X;Y) for any 1-1 (x) Goes to H(Y) as n grows • Want to extract the information related only to expected values • Consider all distributions which have the given expected values, and choose the least informative one.
Quantifying information in expectations • Define the set of distributions which agree with p on the expected values of and marginals: • We define the information in measuring (x) on p(x,y) as
Sufficient Dimensionality Reduction (SDR) • Find (x) which maximizes • Equivalent to finding the maximum likelihood parameters for • Can be done using an iterative projection algorithm (GT, JMLR 03) • Produces useful features for document analysis • But what if p(x,y) contains many structures ?
Talk layout • Feature extraction using SDR • Irrelevance data in unsupervised learning • Enhancement of SDR with irrelevance data • Applications
Relevant and Irrelevant Structures • Data may contain structures we don’t want to learn • For example: • Face recognition: face geometry is important, illumination is not. • Speech recognition: spectral envelope is important, not pitch (English) • Document classification: content is important, style is not. • Gene classification: A given gene may be involved in pathological as well as normal pathways • Relevance is not absolute, it is task dependent
Irrelevance Data • Data set which contains only irrelevant structures are often available (Chechik and Tishby, NIPS 2002) • Images of one person under different illumination conditions • Recordings of one word uttered in different intonations • Document of similar content but different styles • Gene expression patterns from healthy tissues • Find features which avoid the irrelevant ones
Learning with Irrelevance Data Main data (D+) Irrelevance data (D-) • Given a model of the data f, Q(f,D) is some quantifier of the goodness of feature f on the dataset D (e.g. likelihood, information) • We want to find maxfQ (f,D+)-Q(f,D-) • Has been demonstrated successfully (CT,2002) for the case where • F=p(T|X), soft clustering • Q(F,Y)=I(T;Y) • The principle is general and can be applied to any modeling scheme
Talk layout • Information in expectations (SDR) • Irrelevance data in unsupervised learning • Enhancement of SDR with irrelevance data • Applications
Adding Irrelevance Statistics to SDR • Using as our goodness of feature quantifier, we can use two distributions, a relevant , and irrelevant • The optimal feature is then • For =0 we have SDR
Calculating *(x) • When =0, an iterative algorithm can be devised (Globerson and Tishby 02) • Otherwise, the gradient of L() can be calculated and ascended
Synthetic Example D+ D-
Phase Transitions (x)
Talk layout • Feature extraction using SDR • Irrelevance data in unsupervised learning • SDR with irrelevance data • Applications
Extracting a single feature • The AR dataset consists of images of 50 men and women at different illuminations and postures • We took the following distributions: • Relevant : 50 men at two illumination conditions (right and left) • Irrelevant: 50 women at the same illumination conditions • Expected features: Discriminate between men, but not between illuminations
Face clustering task • Took 5 men with 26 different postures • Task is to cluster the images according to their identity • Took 26 images of another man as irrelevance data • Performed dimensionality reduction using several methods (PCA,OPCA,CPCA and SDR-IS) and measured precision for the reduced data
Conclusions • Presented a method for feature extraction based on expected values of X • Showed how it can be augmented to avoid irrelevant structures • Future Work • Eliminate dependence on the dimension of Y via compression constraints • Extend to the multivariate case (graphical models)