650 likes | 671 Views
Techniques For Exploiting Unlabeled Data. Thesis Proposal. May 11,2007. Mugizi Rwebangira. Committee: Avrim Blum, CMU (Co-Chair) John Lafferty, CMU (Co-Chair) William Cohen, CMU Xiaojin (Jerry) Zhu, Wisconsin. Motivation. Supervised Machine Learning:. induction.
E N D
Techniques For Exploiting Unlabeled Data Thesis Proposal May 11,2007 Mugizi Rwebangira Committee: Avrim Blum, CMU (Co-Chair) John Lafferty, CMU (Co-Chair) William Cohen, CMU Xiaojin (Jerry) Zhu, Wisconsin
Motivation Supervised Machine Learning: induction Labeled Examples {(xi,yi)} Model x →y Problems: Document classification, image classification, protein sequence determination. Algorithms: SVM, Neural Nets, Decision Trees, etc.
Motivation In recent years, there has been growing interest in techniques for using unlabeled data: More data is being collected than ever before. Labeling examples can be expensive and/or require human intervention.
Examples Images: Abundantly available (digital cameras) labeling requires humans (captchas). Proteins: sequence can be easily determined, structure determination is a hard problem. Web Pages: Can be easily crawled on the web, labeling requires human intervention.
Motivation Semi-Supervised Machine Learning: Labeled Examples {(xi,yi)} x →y Unlabeled Examples {xi}
+ - + - Motivation
However… Techniques not as well developed as supervised techniques: Techniques for adapting supervised algorithms to semi-supervised algorithms Best practices for using unlabeled data:
Outline Motivation Randomized Graph Mincut Local Linear Semi-supervised Regression Learning with Similarity Functions Proposed Work and Time Line
+ - Add auxiliary “super-nodes”
Obtain s-t mincut + - Mincut
+ - Mincut Classification
+ - Plain mincut can give very unbalanced cuts. Problem
Add random weights to the edges Run plain mincut and obtain a classification. Solution Repeat the above process several times. For each unlabeled example take a majority vote.
Before adding random weights + - Mincut
After adding random weights + - Mincut
PAC-Bayes • PAC-Bayes bounds suggests that when the graph has many small cuts consistent with the labeling, randomization should improve generalization performance. • In this case each distinct cut corresponds to a different hypothesis. • Hence the average of these cuts will be less likely to overfit than any single cut.
Markov Random Fields • Ideally we would like to assign a weight to each cut in the graph (a higher weight to small cuts) and then take a weighted vote over all the cuts in the graph. • This corresponds to a Markov Random Field model. • We don’t know how to do this efficiently, but we can view randomized mincuts as an approximation.
How to construct the graph? • k-NN • Graph may not have small balanced cuts. • How to learn k? • Connect all points within distance δ • Can have disconnected components. • How to learn δ? • Minimum Spanning Tree • No parameters to learn. • Gives connected, sparse graph. • Seems to work well on most datasets.
Experiments • ONE vs. TWO: 1128 examples . • (8 X 8 array of integers, Euclidean distance). • ODD vs. EVEN: 4000 examples . • (16 X 16 array of integers, Euclidean distance). • PC vs. MAC: 1943 examples . • (20 newsgroup dataset, TFIDF distance) .
Summary Randomization helps plain mincut achieve a comparable performance to Gaussian Fields. We can apply PAC sample complexity analysis and interpret it in terms of Markov Random Fields. There is an intuitive interpretation for the confidence of a prediction in terms of the “margin” of the vote. • “Semi-supervised Learning Using Randomized Mincuts”, • Blum, J. Lafferty, M.R. Rwebangira, R. Reddy , • ICML 2004
Outline Motivation Randomized Graph Mincut Local Linear Semi-supervised Regression Learning with Similarity Functions Proposed Work and Time Line
Gaussian Fields (Zhu, Ghahramani & Lafferty) This algorithm minimize the following functional ξ(f) = ∑wij(fi-fj)2 Where wijis the similarity between examples i and j. And fi and fj are the predictions for example i and j.
Locally Constant (Kernel regression) y * * * * x
Locally Linear y * * * * x
Local Linear Regression This algorithm minimize the following functional ξ(β) = ∑wi (yi-βTXxi)2 Where wiis the similarity between examples i and x. β is the coefficient of the local linear fit at x.
Problem Develop Local Linear version of Gaussian Fields Or semi-supervised version of Local Linear Regression Local Linear Semi-supervised Regression
Local Linear Semi-supervised Regression βj } βjo βio (βio – XjiTβj)2 XjiTβj βi xi xj
Local Linear Semi-supervised Regression This algorithm minimize the following functional ξ(β) = ∑wij (βio – XjiTβj)2 Where wijis the similarity between xi and xj.
Synthetic Data: Doppler Doppler function y = (1/x)sin (15/x) σ2 = 0.1 (noise)
Experimental Results: DOPPLER Weighted Kernel Regression, LOOCV MSE= 6.54, MSE=25.7
Experimental Results: DOPPLER Local Linear Regression, LOOCV MSE= 80.8, MSE=14.4
Experimental Results: DOPPLER LLSR, LOOCV MSE= 2.00, MSE=7.99
PROBLEM: RUNNING TIME If number of examples is n and the dimension of the examples is d then we have to invert an n(d+1) X n(d+1) matrix. This is prohibitively expensive, especially if the d is large.
PROPOSED WORK: Improving Running Time Sparsification: Ignore examples which are far away so as to get a sparser matrix to invert. Iterative Methods for solving Linear systems: For a matrix equation Ax=b, we can obtain successive approximations x1, x2 … xk. Can be significantly faster if matrix A is sparse.
PROPOSED WORK: Improving Running Time Power series: Use the identity (I-A)-1 = I + A + A2 + A3 + … y’ =(Q+γΔ)-1Py = Q-1Py + (-γQ-1Δ)Q-1Py + (-γQ-1Δ)2Q-1Py + … A few terms may be sufficient to get a good approximation Compute supervised answer first, then “smooth” the answer to get semi- Supervised solution. This can be combined with iterative methods as we can use the supervised solution as the starting point for our iterative algorithm.
PROPOSED WORK: Experimental Evaluation Comparison against other proposed semi-supervised regression algorithms. Evaluation on a large variety of data sets, especially high dimensional ones.
Outline Motivation Randomized Graph Mincut Local Linear Semi-supervised Regression Learning with Similarity Functions Proposed Work and Time Line
Kernels K(x,y) = Φ(x)∙Φ(y) Allows us to implicitly project non-linearly separable data into a high dimensional space where a linear separator can be found . Kernel must satisfy strict mathematical definitions 1. Continuous 2. Symmetric 3. Positive semi-definite
Generic similarity Functions What if the best similarity function in a given domain does not satisfy the properties of a kernel? Two options: 1. Use a kernel with inferior performance 2. Try to “coerce” the similarity function into a kernel by building a kernel that has similar behavior. There is another way …
The Balcan-Blum approach Recently Balcan and Blum initiated the theory of learning with generic similarity functions. They gave a general definition of a good similarity function for learning and showed that the popular large margin kernels are a special case of their definition. They also gave an algorithm for learning with good similarity functions. Their approach makes use of unlabeled data…
The Balcan-Blum approach The algorithm is very simple Suppose S(x,y) is our similarity function. Then • Draw d examples {x1, x2, x3, … xd} uniformly at random from the • data set. 2. For each example x compute the mapping x → {S(x,x1), S(x,x2), S(x,x3), … S(x,xd)}
PROPOSED WORK Overall goal: Investigate the practical applicability of this theory and find out what is needed to make it work on real problems. Two main application areas: 1. Domains which have expert defined similarity functions that are not kernels (protein homology). 2. Domains which have many irrelevant features and in which the data may not be linearly separable in the original features (text classification).
PROPOSED WORK: Protein Homology The Smith-Waterman score is the best performing measure of similarity but it does not satisfy the kernel properties. Machine learning applications have either used other similarity functions Or tried to force SW score into a kernel. Can we achieve better performance by using SW score directly?