620 likes | 1.01k Views
Semi-supervised Learning. Overview. Introduction to SSL Problem SSL Algorithms. Why SSL?. Data labeling is expensive and difficult Labeling is often unreliable Unlabeled examples Easy to obtain in large numbers e.g. webpage classification, bioinformatics, image classification.
E N D
Overview • Introduction to SSL Problem • SSL Algorithms
Why SSL? • Data labeling is expensive and difficult • Labeling is often unreliable • Unlabeled examples • Easy to obtain in large numbers • e.g. webpage classification, bioinformatics, image classification
Notations(classification) • input instance x, label y • estimate • labeled data • unlabeled data , available during training(additional source tells about P(x)) • usually • test data , not available during training
SSL vs. Transductive Learning • Semi-supervised learning is ultimately applied to the test data (inductive). • Transductive learning is only concerned with the unlabeled data.
Glossary • supervised learning (classification, regression) • {(x1:n, y1:n)} • semi-supervised classification/regression • {(x1:l, y1:l), xl+1:n, xtest} • transductive classification/regression • {(x1:l, y1:l), xl+1:n} • semi-supervised clustering • {x1:n, must-link, cannot-links} • unsupervised learning (clustering) • {x1:n}
Is unlabeled samples useful? • In general yes, but not always(discuss later) • Classification error reduces • Exponentially with labeled examples • Linearly with unlabeled examples
SSL Algorithms Self-Training Generative Models S3VMs Graph-Based Algorithms Co-training Multiview algorithms
Self-Training • Assumption: • One’s own high confidence predictions are correct. • Self-training algorithm: • Train f from {(x1:n, y1:n)} • Predict on • Add (x, f(x)) to labeled data • Add all • Add a few most confident pairs • Add weight for each pairs • Repeat
Advantages of Self-Training • The simplest semi-supervised learning method. • A wrapper method, applies to existing classifiers. • Often used in real tasks like natural language processing.
Disadvantages of Self-Training • Early mistakes could reinforce themselves. • Heuristic solutions, e.g. endowed with weights or choose the most confident ones. • Cannot say too much in terms of convergence.
SSL Algorithms • Self-Training • Generative Models • S3VMs • Graph-Based Algorithms • Co-training • Multiview algorithms
Generative Models • Assuming each class has a Gaussian distribution, what is the decision boundary?
Basic idea • If we have the full generative models p(X, Y|θ): • quantity of interest: • find the maximum likelihood estimate (MLE) of , the maximum a posteriori (MAP) estimate, or be Bayesian
Some generative models • Mixture of Gaussian distributions (GMM) • image classification • the EM algorithm • Mixture of multinomial distributions • text categorization • the EM algorithm • Hidden Markov Models (HMM) • speech recognition • Baum-Welch algorithm
Example: GMM • For simplicity, consider binary classification with GMM using MLE. • Model parameters: θ={w1, w2, μ1, μ2, ∑1, ∑2} • So: • To estimate θ, we maximize: • Then, we have:
Continue… • Now we get θ, then Predicty maximum a posteriori:
What about SSGMM? • To estimate θ, we maximize: • More complicated?(a mixture of two normal distribution)
A more complicated case • For simplicity, consider a mixture of two normal distribution. • Model parameters: • So:
A more complicated case • Then: • Direct MLE is difficult numerically.
The EM for GMM • We consider unobserved latent variables Δi: • If Δi =0, then (xi, yi) comes from model 0 • Else Δi =1, then (xi, yi)comes from model 1 • Suppose we know the values of Δi‘s, then:
The EM for GMM • The values of the Δi's are actually unknown. • EM’s idea: we proceed in an iterative fashion, substituting for each Δi in its expected value.
Another version of EM for GMM • Start from MLE θ={w1, w2, μ1, μ2, ∑1, ∑2} on (Xl, Yl), repeat: • The E-step: compute the expected label p(y|x, θ) for all x Xu • label p(y=1|x, θ)-fraction of x with class 1 • label p(y=2|x, θ)-fraction of x with class 2
Another version of EM for GMM • The M-step: update MLE θ with (now weighted labeled) Xu
The EM algorithm in general • Set up: • observed data D = (Xl, Yl, Xu) • hidden data Yu • Goal: find θ to maximize • Properties: • starts from an arbitrary θ0(or estimate on (Xl, Yl)) • The E-step: estimate p(Yu|Xu, θ0) • The M-step: maximize • iteratively improves p(D|θ) • converges to a local maximum of θ
Beyond EM • Key is to maximize p(Xl, Yl, Xu|θ). • EM is just one way to maximize it. • Other ways to find parameters are possible too, e.g. variational approximation, or direct optimization.
Advantages of generative models • Clear, well-studied probabilistic framework • Can be extremely effective, if the model is close to correct
Disadvantages of generative models • Often difficult to verify the correctness of the model • Model identifiability • p(y=1)=0.2, p(x|y=1)=unif(0, 0.2), p(x|y=−1)=unif(0.2, 1) (1) • p(y=1)=0.6, p(x|y=1)=unif(0, 0.6), p(x|y=−1)=unif(0.6, 1) • Can we predict on x=0.5? • EM local optima • Unlabeled data may hurt if generative model is wrong
Heuristics to lessen the danger • Carefully construct the generative model to reflect the task • e.g. multiple Gaussian distributions per class, instead of a single one • Down-weight the unlabeled data (λ<1)
Related method: cluster-and-label • Instead of probabilistic generative models, any clustering algorithm can be used for semi-supervised classification too: • Run your favorite clustering algorithm on Xl,Xu. • Label all points within a cluster by the majority of labeled points in that cluster. • Pro: Yet another simple method using existing algorithms. • Con: Can be difficult to analyze due to their algorithmic nature.
SSL Algorithms Self-Training Generative Models S3VMs Graph-Based Algorithms Co-training Multiview algorithms
Semi-supervised SVMs • Semi-supervised SVMs(S3VMs) • Transductive SVMs(TSVMs)
SVM with hinge loss • The hinge loss: • The optimization problem(objective function):
S3VMs • Assumption: • Unlabeled data from different classes are separated with large margin. • Basic idea: • Enumerate all 2u possible labeling of Xu • Build one standard SVM for each labeling (and Xl) • Pick the SVM with the largest margin • NP-hard!
A smart trick • How to incorporate unlabeled points? • Assign labels sign(f(x)) to x∈Xu, i.e. the unlabeled ones classified correctly. • Is it equivalent to our basic idea?(Yes) • The hinge loss on unlabeled points becomes
S3VM objective function • S3VM objective: • the decision boundary f = 0 wants to be placed so that there is few unlabeled data near it.
The class balancing constraint • Directly optimizing the S3VM objective often produces unbalanced classification • most points fall in one class. • Heuristic class balance: • Relaxed class balancing constraint:
S3VM algorithm • The optimization problem: • Classify a new test point x by sign(f(x))
The S3VM optimization challenge • SVM objective is convex. • S3VM objective is non-convex. • Finding a solution for semi-supervised SVM is difficult, which has been the focus of S3VM research. • Different approaches: SVMlight, rS3VM, continuation S3VM, deterministic annealing, CCCP, Branch and Bound, SDP convex relaxation, etc.
Advantages of S3VMs • Applicable wherever SVMs are applicable, i.e. almost everywhere • Clear mathematical framework • More modest assumption than generative model or graph-based methods
Disadvantages of S3VMs • Optimization difficult • Can be trapped in bad local optima
SSL Algorithms Self-Training Generative Models S3VMs Graph-Based Algorithms Co-training Multiview algorithms
Graph-Based Algorithms • Assumption: • A graph is given on the labeled and unlabeled data. Instances connected by heavy edge tend to have the same label. • The optimization problem: • Some algorithms • mincut • harmonic • local and global consistency • manifold regularization
SSL Algorithms Self-Training Generative Models S3VMs Graph-Based Algorithms Co-training Multiview algorithms
Co-training • Two views of an item: image and HTML text