840 likes | 978 Views
Techniques For Exploiting Unlabeled Data. Thesis Defense. September 8,2008. Mugizi Rwebangira. Committee: Avrim Blum, CMU (Co-Chair) John Lafferty, CMU (Co-Chair) William Cohen, CMU Xiaojin (Jerry) Zhu, Wisconsin. Motivation. Supervised Machine Learning:. induction.
E N D
Techniques For Exploiting Unlabeled Data Thesis Defense September 8,2008 Mugizi Rwebangira Committee: Avrim Blum, CMU (Co-Chair) John Lafferty, CMU (Co-Chair) William Cohen, CMU Xiaojin (Jerry) Zhu, Wisconsin
Motivation Supervised Machine Learning: induction Labeled Examples {(xi,yi)} Model x →y Problems: Document classification, image classification, protein sequence determination. Algorithms: SVM, Neural Nets, Decision Trees, etc.
Motivation In recent years, there has been growing interest in techniques for using unlabeled data: More data is being collected than ever before. Labeling examples can be expensive and/or require human intervention.
Examples Images: Abundantly available (digital cameras) labeling requires humans (captchas). Proteins: sequence can be easily determined, structure determination is a hard problem. Web Pages: Can be easily crawled on the web, labeling requires human intervention.
Motivation Semi-Supervised Machine Learning: Labeled Examples {(xi,yi)} x →y Unlabeled Examples {xi}
+ - + - Motivation
However… Techniques not as well developed as supervised techniques: Techniques for adapting supervised algorithms to semi-supervised algorithms Best practices for using unlabeled data:
Outline Motivation Randomized Graph Mincut Local Linear Semi-supervised Regression Learning with Similarity Functions Conclusion and Questions
+ - Add auxiliary “super-nodes”
Obtain s-t mincut + - Mincut
+ - Mincut Classification
+ - Plain mincut can give very unbalanced cuts. Problem
Add random weights to the edges Run plain mincut and obtain a classification. Solution Repeat the above process several times. For each unlabeled example take a majority vote.
Before adding random weights + - Mincut
After adding random weights + - Mincut
PAC-Bayes • PAC-Bayes bounds suggests that when the graph has many small cuts consistent with the labeling, randomization should improve generalization performance. • In this case each distinct cut corresponds to a different hypothesis. • Hence the average of these cuts will be less likely to overfit than any single cut.
Markov Random Fields • Ideally we would like to assign a weight to each cut in the graph (a higher weight to small cuts) and then take a weighted vote over all the cuts in the graph. • This corresponds to a Markov Random Field model. • We don’t know how to do this efficiently, but we can view randomized mincuts as an approximation.
How to construct the graph? • k-NN • Graph may not have small balanced cuts. • How to learn k? • Connect all points within distance δ • Can have disconnected components. • How to learn δ? • Minimum Spanning Tree • No parameters to learn. • Gives connected, sparse graph. • Seems to work well on most datasets.
Experiments • ONE vs. TWO: 1128 examples . • (8 X 8 array of integers, Euclidean distance). • ODD vs. EVEN: 4000 examples . • (16 X 16 array of integers, Euclidean distance). • PC vs. MAC: 1943 examples . • (20 newsgroup dataset, TFIDF distance) .
Summary Randomization helps plain mincut achieve a comparable performance to Gaussian Fields. We can apply PAC sample complexity analysis and interpret it in terms of Markov Random Fields. There is an intuitive interpretation for the confidence of a prediction in terms of the “margin” of the vote. • “Semi-supervised Learning Using Randomized Mincuts”, • Blum, J. Lafferty, M.R. Rwebangira, R. Reddy , • ICML 2004
Outline Motivation Randomized Graph Mincut Local Linear Semi-supervised Regression Learning with Similarity Functions Proposed Work and Time Line
(Supervised) Linear Regression y * * * * x
Semi-Supervised Regression y * * * * + + + + + + + + + x
One way of doing this: Minimize ξ(f) = ∑wij(fi-fj)2 Smoothness assumption Things that are close together should have similar values Where wijis the similarity between examples i and j. And fi and fj are the predictions for example i and j. Gaussian Fields (Zhu, Ghahramani & Lafferty)
Local Constancy The predictions made by Gaussian Fields are locally constant y * + u u + Δ x More formally: m (u + Δ) ≈ m(u)
Local Linearity For many regression tasks we would prefer predictions to be locally linear. y + * u + Δ u x More formally: m (u + Δ) ≈ m(u) + m’(u) Δ
Problem Develop a version of Gaussian Fields which is Local Linear Or a semi-supervised version of Linear Regression Local Linear Semi-supervised Regression
Local Linear Semi-supervised Regression By analogy with ∑wij(fi-fj)2 βj } βjo βio (βio – XjiTβj)2 XjiTβj βi xi xj
Local Linear Semi-supervised Regression So we find β to minimize the following objective function ξ(β) = ∑wij (βio – XjiTβj)2 Where wijis the similarity between xi and xj.
Synthetic Data: Gong Gong function y = (1/x)sin (15/x) σ2 = 0.1 (noise)
Experimental Results: GONG Weighted Kernel Regression, MSE=25.7
Experimental Results: GONG Local Linear Regression, MSE=14.4
Experimental Results: GONG LLSR, MSE=7.99
PROBLEM: RUNNING TIME If we have n examples and dimension d then to compute a closed form solution we have to invert an (n(d+1)*n(d+1)) matrix. This is prohibitively expensive, especially if d is large. For example if n=1500 and d=199 then we have to invert a matrix of size 720 GB in Matlab’s double precision format.
SOLUTION: ITERATION It turns out that because of the form of the equation we can start from an arbitrary initial guessand do an iterative computation that provably converges to the desired solution. In the case of n=1500 and d=199, instead of dealing with a matrix of size 720 GB we only have to store 2.4 MB in memory which makes the algorithm much more practical.
We do model selection using Leave One Out Cross validation Experiments on Real Data We compare: Weighted Kernel Regression (WKR) – a purely supervised method. Local Linear Regression (LLR) – another purely supervised method. Local Learning Regularization (LL-Reg) – an up to date semi-supervised method Local Linear Semi-Supervised Regularization (LLSR) For each algorithm and dataset we give: 1. The mean and standard deviation of 10 runs. 2. The results of an OPTIMAL choice of parameters.
Summary LLSR is a natural semi-supervised generalization of Linear Regression While the analysis is not as clear as with semi-supervised classification, semi-supervised regression can perform better than supervised regression if the function has a smooth manifold similar to the GONG function. FUTURE WORK: Carefully analyzing the assumptions under which unlabeled data can be useful in regression.
Outline Motivation Randomized Graph Mincut Local Linear Semi-supervised Regression Learning with Similarity Functions Proposed Work and Time Line
Kernels K(x,y): Informally considered as a measure of similarity between x and y Kernel trick: K(x,y) = Φ(x)∙Φ(y) (Mercer’s theorem) This allows us to implicitly project non-linearly separable data into a high dimensional space where a linear separator can be found . Kernel must satisfy strict mathematical definitions 1. Continuous 2. Symmetric 3. Positive semi-definite
Problems with Kernels There is a conceptual disconnect between the notion of kernels as similarity functions and the notion of finding max-margin separators in possibly infinite dimensional Hilbert spaces. The properties of kernels such as being Positive Semi-Definite are rather restrictive and in particular similarity functions used in certain domains, such as the Smith-Waterman score in molecular biology do do not fit in this framework. WANTED: A method for using similarity functions that is both easy and general.
The Balcan-Blum approach An approach fitting these requirements was recently proposed by Balcan and Blum. Gave a general definition of a good similarity function for learning. Showed that kernels are special case of their definition. Gave an algorithm for learning with good similarity functions.
The Balcan-Blum approach Suppose S(x,y) \in (-1,+1) is our similarity function. Then • Draw d examples {x1, x2, x3, … xd} uniformly at random from the • data set. 2. For each example x compute the mapping x → {S(x,x1), S(x,x2), S(x,x3), … S(x,xd)} KEY POINT: This method can make use of UNLABELEDDATA.
Combining Feature based and Graph Based Methods Feature based methods directly operate on the native features:- e.g. Decision Tree, MaxEnt, Winnow, Perceptron Graph based methods operate on the graph of similarities between examples, e.g Kernel methods, Gaussian Fields, Graph mincut and most semi-supervised learning methods. These methods can work well on different datasets, we want to find a way to find a way to COMBINE these approaches into one algorithm.
SOLUTION: Similarity functions + Winnow Use the Balcan-Blum approach to generate extra features. Append the extra features to the original features:- x → {x,S(x,x1), S(x,x2), S(x,x3), … S(x,xd)} Run the Winnow algorithm on the combined features (Winnow is known to be resistant to irrelevant features.)