1 / 58

Semi-supervised Learning

Semi-supervised Learning. Overview. Introduction to SSL Problem SSL Algorithms. Why SSL?. Data labeling is expensive and difficult Labeling is often unreliable Unlabeled examples Easy to obtain in large numbers e.g. webpage classification, bioinformatics, image classification.

amie
Download Presentation

Semi-supervised Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semi-supervised Learning

  2. Overview • Introduction to SSL Problem • SSL Algorithms

  3. Why SSL? • Data labeling is expensive and difficult • Labeling is often unreliable • Unlabeled examples • Easy to obtain in large numbers • e.g. webpage classification, bioinformatics, image classification

  4. Notations(classification) • input instance x, label y • estimate • labeled data • unlabeled data , available during training(additional source tells about P(x)) • usually • test data , not available during training

  5. SSL vs. Transductive Learning • Semi-supervised learning is ultimately applied to the test data (inductive). • Transductive learning is only concerned with the unlabeled data.

  6. Glossary • supervised learning (classification, regression) • {(x1:n, y1:n)} • semi-supervised classification/regression • {(x1:l, y1:l), xl+1:n, xtest} • transductive classification/regression • {(x1:l, y1:l), xl+1:n} • semi-supervised clustering • {x1:n, must-link, cannot-links} • unsupervised learning (clustering) • {x1:n}

  7. Is unlabeled samples useful? • In general yes, but not always(discuss later) • Classification error reduces • Exponentially with labeled examples • Linearly with unlabeled examples

  8. SSL Algorithms Self-Training Generative Models S3VMs Graph-Based Algorithms Co-training Multiview algorithms

  9. Self-Training • Assumption: • One’s own high confidence predictions are correct. • Self-training algorithm: • Train f from {(x1:n, y1:n)} • Predict on • Add (x, f(x)) to labeled data • Add all • Add a few most confident pairs • Add weight for each pairs • Repeat

  10. Advantages of Self-Training • The simplest semi-supervised learning method. • A wrapper method, applies to existing classifiers. • Often used in real tasks like natural language processing.

  11. Disadvantages of Self-Training • Early mistakes could reinforce themselves. • Heuristic solutions, e.g. endowed with weights or choose the most confident ones. • Cannot say too much in terms of convergence.

  12. SSL Algorithms • Self-Training • Generative Models • S3VMs • Graph-Based Algorithms • Co-training • Multiview algorithms

  13. Generative Models • Assuming each class has a Gaussian distribution, what is the decision boundary?

  14. Decision boundary

  15. Adding unlabeled data

  16. The new decision boundary

  17. They are different because…

  18. Basic idea • If we have the full generative models p(X, Y|θ): • quantity of interest: • find the maximum likelihood estimate (MLE) of , the maximum a posteriori (MAP) estimate, or be Bayesian

  19. Some generative models • Mixture of Gaussian distributions (GMM) • image classification • the EM algorithm • Mixture of multinomial distributions • text categorization • the EM algorithm • Hidden Markov Models (HMM) • speech recognition • Baum-Welch algorithm

  20. Example: GMM • For simplicity, consider binary classification with GMM using MLE. • Model parameters: θ={w1, w2, μ1, μ2, ∑1, ∑2} • So: • To estimate θ, we maximize: • Then, we have:

  21. Continue… • Now we get θ, then Predicty maximum a posteriori:

  22. What about SSGMM? • To estimate θ, we maximize: • More complicated?(a mixture of two normal distribution)

  23. A more complicated case • For simplicity, consider a mixture of two normal distribution. • Model parameters: • So:

  24. A more complicated case • Then: • Direct MLE is difficult numerically.

  25. The EM for GMM • We consider unobserved latent variables Δi: • If Δi =0, then (xi, yi) comes from model 0 • Else Δi =1, then (xi, yi)comes from model 1 • Suppose we know the values of Δi‘s, then:

  26. The EM for GMM • The values of the Δi's are actually unknown. • EM’s idea: we proceed in an iterative fashion, substituting for each Δi in its expected value.

  27. Another version of EM for GMM • Start from MLE θ={w1, w2, μ1, μ2, ∑1, ∑2} on (Xl, Yl), repeat: • The E-step: compute the expected label p(y|x, θ) for all x Xu • label p(y=1|x, θ)-fraction of x with class 1 • label p(y=2|x, θ)-fraction of x with class 2

  28. Another version of EM for GMM • The M-step: update MLE θ with (now weighted labeled) Xu

  29. The EM algorithm in general • Set up: • observed data D = (Xl, Yl, Xu) • hidden data Yu • Goal: find θ to maximize • Properties: • starts from an arbitrary θ0(or estimate on (Xl, Yl)) • The E-step: estimate p(Yu|Xu, θ0) • The M-step: maximize • iteratively improves p(D|θ) • converges to a local maximum of θ

  30. Beyond EM • Key is to maximize p(Xl, Yl, Xu|θ). • EM is just one way to maximize it. • Other ways to find parameters are possible too, e.g. variational approximation, or direct optimization.

  31. Advantages of generative models • Clear, well-studied probabilistic framework • Can be extremely effective, if the model is close to correct

  32. Disadvantages of generative models • Often difficult to verify the correctness of the model • Model identifiability • p(y=1)=0.2, p(x|y=1)=unif(0, 0.2), p(x|y=−1)=unif(0.2, 1) (1) • p(y=1)=0.6, p(x|y=1)=unif(0, 0.6), p(x|y=−1)=unif(0.6, 1) • Can we predict on x=0.5? • EM local optima • Unlabeled data may hurt if generative model is wrong

  33. Unlabeled data may hurt SSL

  34. Heuristics to lessen the danger • Carefully construct the generative model to reflect the task • e.g. multiple Gaussian distributions per class, instead of a single one • Down-weight the unlabeled data (λ<1)

  35. Related method: cluster-and-label • Instead of probabilistic generative models, any clustering algorithm can be used for semi-supervised classification too: • Run your favorite clustering algorithm on Xl,Xu. • Label all points within a cluster by the majority of labeled points in that cluster. • Pro: Yet another simple method using existing algorithms. • Con: Can be difficult to analyze due to their algorithmic nature.

  36. SSL Algorithms Self-Training Generative Models S3VMs Graph-Based Algorithms Co-training Multiview algorithms

  37. Semi-supervised SVMs • Semi-supervised SVMs(S3VMs) • Transductive SVMs(TSVMs)

  38. SVM with hinge loss • The hinge loss: • The optimization problem(objective function):

  39. S3VMs • Assumption: • Unlabeled data from different classes are separated with large margin. • Basic idea: • Enumerate all 2u possible labeling of Xu • Build one standard SVM for each labeling (and Xl) • Pick the SVM with the largest margin • NP-hard!

  40. A smart trick • How to incorporate unlabeled points? • Assign labels sign(f(x)) to x∈Xu, i.e. the unlabeled ones classified correctly. • Is it equivalent to our basic idea?(Yes) • The hinge loss on unlabeled points becomes

  41. S3VM objective function • S3VM objective: • the decision boundary f = 0 wants to be placed so that there is few unlabeled data near it.

  42. The class balancing constraint • Directly optimizing the S3VM objective often produces unbalanced classification • most points fall in one class. • Heuristic class balance: • Relaxed class balancing constraint:

  43. S3VM algorithm • The optimization problem: • Classify a new test point x by sign(f(x))

  44. The S3VM optimization challenge • SVM objective is convex. • S3VM objective is non-convex. • Finding a solution for semi-supervised SVM is difficult, which has been the focus of S3VM research. • Different approaches: SVMlight, rS3VM, continuation S3VM, deterministic annealing, CCCP, Branch and Bound, SDP convex relaxation, etc.

  45. Advantages of S3VMs • Applicable wherever SVMs are applicable, i.e. almost everywhere • Clear mathematical framework • More modest assumption than generative model or graph-based methods

  46. Disadvantages of S3VMs • Optimization difficult • Can be trapped in bad local optima

  47. SSL Algorithms Self-Training Generative Models S3VMs Graph-Based Algorithms Co-training Multiview algorithms

  48. Graph-Based Algorithms • Assumption: • A graph is given on the labeled and unlabeled data. Instances connected by heavy edge tend to have the same label. • The optimization problem: • Some algorithms • mincut • harmonic • local and global consistency • manifold regularization

  49. SSL Algorithms Self-Training Generative Models S3VMs Graph-Based Algorithms Co-training Multiview algorithms

  50. Co-training • Two views of an item: image and HTML text

More Related