1 / 65

Techniques For Exploiting Unlabeled Data

Techniques For Exploiting Unlabeled Data. Thesis Proposal. May 11,2007. Mugizi Rwebangira. Committee: Avrim Blum, CMU (Co-Chair) John Lafferty, CMU (Co-Chair) William Cohen, CMU Xiaojin (Jerry) Zhu, Wisconsin. Motivation. Supervised Machine Learning:. induction.

ktenorio
Download Presentation

Techniques For Exploiting Unlabeled Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Techniques For Exploiting Unlabeled Data Thesis Proposal May 11,2007 Mugizi Rwebangira Committee: Avrim Blum, CMU (Co-Chair) John Lafferty, CMU (Co-Chair) William Cohen, CMU Xiaojin (Jerry) Zhu, Wisconsin

  2. Motivation Supervised Machine Learning: induction Labeled Examples {(xi,yi)} Model x →y Problems: Document classification, image classification, protein sequence determination. Algorithms: SVM, Neural Nets, Decision Trees, etc.

  3. Motivation In recent years, there has been growing interest in techniques for using unlabeled data: More data is being collected than ever before. Labeling examples can be expensive and/or require human intervention.

  4. Examples Images: Abundantly available (digital cameras) labeling requires humans (captchas). Proteins: sequence can be easily determined, structure determination is a hard problem. Web Pages: Can be easily crawled on the web, labeling requires human intervention.

  5. Motivation Semi-Supervised Machine Learning: Labeled Examples {(xi,yi)} x →y Unlabeled Examples {xi}

  6. + - + - Motivation

  7. However… Techniques not as well developed as supervised techniques: Techniques for adapting supervised algorithms to semi-supervised algorithms Best practices for using unlabeled data:

  8. Outline Motivation Randomized Graph Mincut Local Linear Semi-supervised Regression Learning with Similarity Functions Proposed Work and Time Line

  9. Graph Mincut (Blum & Chawla,2001)

  10. Construct an (unweighted) Graph

  11. + - Add auxiliary “super-nodes”

  12. Obtain s-t mincut + - Mincut

  13. + - Mincut Classification

  14. + - Plain mincut can give very unbalanced cuts. Problem

  15. Add random weights to the edges Run plain mincut and obtain a classification. Solution Repeat the above process several times. For each unlabeled example take a majority vote.

  16. Before adding random weights + - Mincut

  17. After adding random weights + - Mincut

  18. PAC-Bayes • PAC-Bayes bounds suggests that when the graph has many small cuts consistent with the labeling, randomization should improve generalization performance. • In this case each distinct cut corresponds to a different hypothesis. • Hence the average of these cuts will be less likely to overfit than any single cut.

  19. Markov Random Fields • Ideally we would like to assign a weight to each cut in the graph (a higher weight to small cuts) and then take a weighted vote over all the cuts in the graph. • This corresponds to a Markov Random Field model. • We don’t know how to do this efficiently, but we can view randomized mincuts as an approximation.

  20. How to construct the graph? • k-NN • Graph may not have small balanced cuts. • How to learn k? • Connect all points within distance δ • Can have disconnected components. • How to learn δ? • Minimum Spanning Tree • No parameters to learn. • Gives connected, sparse graph. • Seems to work well on most datasets.

  21. Experiments • ONE vs. TWO: 1128 examples . • (8 X 8 array of integers, Euclidean distance). • ODD vs. EVEN: 4000 examples . • (16 X 16 array of integers, Euclidean distance). • PC vs. MAC: 1943 examples . • (20 newsgroup dataset, TFIDF distance) .

  22. ONE vs. TWO

  23. ODD vs. EVEN

  24. PC vs. MAC

  25. Summary Randomization helps plain mincut achieve a comparable performance to Gaussian Fields. We can apply PAC sample complexity analysis and interpret it in terms of Markov Random Fields. There is an intuitive interpretation for the confidence of a prediction in terms of the “margin” of the vote. • “Semi-supervised Learning Using Randomized Mincuts”, • Blum, J. Lafferty, M.R. Rwebangira, R. Reddy , • ICML 2004

  26. Outline Motivation Randomized Graph Mincut Local Linear Semi-supervised Regression Learning with Similarity Functions Proposed Work and Time Line

  27. Gaussian Fields (Zhu, Ghahramani & Lafferty) This algorithm minimize the following functional ξ(f) = ∑wij(fi-fj)2 Where wijis the similarity between examples i and j. And fi and fj are the predictions for example i and j.

  28. Locally Constant (Kernel regression) y * * * * x

  29. Locally Linear y * * * * x

  30. Local Linear Regression This algorithm minimize the following functional ξ(β) = ∑wi (yi-βTXxi)2 Where wiis the similarity between examples i and x. β is the coefficient of the local linear fit at x.

  31. Problem Develop Local Linear version of Gaussian Fields Or semi-supervised version of Local Linear Regression Local Linear Semi-supervised Regression

  32. Local Linear Semi-supervised Regression βj } βjo βio (βio – XjiTβj)2 XjiTβj βi xi xj

  33. Local Linear Semi-supervised Regression This algorithm minimize the following functional ξ(β) = ∑wij (βio – XjiTβj)2 Where wijis the similarity between xi and xj.

  34. Synthetic Data: Doppler Doppler function y = (1/x)sin (15/x) σ2 = 0.1 (noise)

  35. Experimental Results: DOPPLER Weighted Kernel Regression, LOOCV MSE= 6.54, MSE=25.7

  36. Experimental Results: DOPPLER Local Linear Regression, LOOCV MSE= 80.8, MSE=14.4

  37. Experimental Results: DOPPLER LLSR, LOOCV MSE= 2.00, MSE=7.99

  38. PROBLEM: RUNNING TIME If number of examples is n and the dimension of the examples is d then we have to invert an n(d+1) X n(d+1) matrix. This is prohibitively expensive, especially if the d is large.

  39. PROPOSED WORK: Improving Running Time Sparsification: Ignore examples which are far away so as to get a sparser matrix to invert. Iterative Methods for solving Linear systems: For a matrix equation Ax=b, we can obtain successive approximations x1, x2 … xk. Can be significantly faster if matrix A is sparse.

  40. PROPOSED WORK: Improving Running Time Power series: Use the identity (I-A)-1 = I + A + A2 + A3 + … y’ =(Q+γΔ)-1Py = Q-1Py + (-γQ-1Δ)Q-1Py + (-γQ-1Δ)2Q-1Py + … A few terms may be sufficient to get a good approximation Compute supervised answer first, then “smooth” the answer to get semi- Supervised solution. This can be combined with iterative methods as we can use the supervised solution as the starting point for our iterative algorithm.

  41. PROPOSED WORK: Experimental Evaluation Comparison against other proposed semi-supervised regression algorithms. Evaluation on a large variety of data sets, especially high dimensional ones.

  42. Outline Motivation Randomized Graph Mincut Local Linear Semi-supervised Regression Learning with Similarity Functions Proposed Work and Time Line

  43. Kernels K(x,y) = Φ(x)∙Φ(y) Allows us to implicitly project non-linearly separable data into a high dimensional space where a linear separator can be found . Kernel must satisfy strict mathematical definitions 1. Continuous 2. Symmetric 3. Positive semi-definite

  44. Generic similarity Functions What if the best similarity function in a given domain does not satisfy the properties of a kernel? Two options: 1. Use a kernel with inferior performance 2. Try to “coerce” the similarity function into a kernel by building a kernel that has similar behavior. There is another way …

  45. The Balcan-Blum approach Recently Balcan and Blum initiated the theory of learning with generic similarity functions. They gave a general definition of a good similarity function for learning and showed that the popular large margin kernels are a special case of their definition. They also gave an algorithm for learning with good similarity functions. Their approach makes use of unlabeled data…

  46. The Balcan-Blum approach The algorithm is very simple Suppose S(x,y) is our similarity function. Then • Draw d examples {x1, x2, x3, … xd} uniformly at random from the • data set. 2. For each example x compute the mapping x → {S(x,x1), S(x,x2), S(x,x3), … S(x,xd)}

  47. Synthetic Data: Circle

  48. Experimental Results: Circle

  49. PROPOSED WORK Overall goal: Investigate the practical applicability of this theory and find out what is needed to make it work on real problems. Two main application areas: 1. Domains which have expert defined similarity functions that are not kernels (protein homology). 2. Domains which have many irrelevant features and in which the data may not be linearly separable in the original features (text classification).

  50. PROPOSED WORK: Protein Homology The Smith-Waterman score is the best performing measure of similarity but it does not satisfy the kernel properties. Machine learning applications have either used other similarity functions Or tried to force SW score into a kernel. Can we achieve better performance by using SW score directly?

More Related