Semi-supervised Learning

Semi-supervised Learning

Overview • Introduction to SSL Problem • SSL Algorithms

Why SSL? • Data labeling is expensive and difficult • Labeling is often unreliable • Unlabeled examples • Easy to obtain in large numbers • e.g. webpage classification, bioinformatics, image classification

Notations(classification) • input instance x, label y • estimate • labeled data • unlabeled data , available during training(additional source tells about P(x)) • usually • test data , not available during training

SSL vs. Transductive Learning • Semi-supervised learning is ultimately applied to the test data (inductive). • Transductive learning is only concerned with the unlabeled data.

Glossary • supervised learning (classification, regression) • {(x1:n, y1:n)} • semi-supervised classification/regression • {(x1:l, y1:l), xl+1:n, xtest} • transductive classification/regression • {(x1:l, y1:l), xl+1:n} • semi-supervised clustering • {x1:n, must-link, cannot-links} • unsupervised learning (clustering) • {x1:n}

Is unlabeled samples useful? • In general yes, but not always(discuss later) • Classification error reduces • Exponentially with labeled examples • Linearly with unlabeled examples

SSL Algorithms Self-Training Generative Models S3VMs Graph-Based Algorithms Co-training Multiview algorithms

Self-Training • Assumption: • One’s own high confidence predictions are correct. • Self-training algorithm: • Train f from {(x1:n, y1:n)} • Predict on • Add (x, f(x)) to labeled data • Add all • Add a few most confident pairs • Add weight for each pairs • Repeat

Advantages of Self-Training • The simplest semi-supervised learning method. • A wrapper method, applies to existing classifiers. • Often used in real tasks like natural language processing.

Disadvantages of Self-Training • Early mistakes could reinforce themselves. • Heuristic solutions, e.g. endowed with weights or choose the most confident ones. • Cannot say too much in terms of convergence.

SSL Algorithms • Self-Training • Generative Models • S3VMs • Graph-Based Algorithms • Co-training • Multiview algorithms

Generative Models • Assuming each class has a Gaussian distribution, what is the decision boundary?

Decision boundary

Adding unlabeled data

The new decision boundary

They are different because…

Basic idea • If we have the full generative models p(X, Y|θ): • quantity of interest: • find the maximum likelihood estimate (MLE) of , the maximum a posteriori (MAP) estimate, or be Bayesian

Some generative models • Mixture of Gaussian distributions (GMM) • image classification • the EM algorithm • Mixture of multinomial distributions • text categorization • the EM algorithm • Hidden Markov Models (HMM) • speech recognition • Baum-Welch algorithm

Example: GMM • For simplicity, consider binary classification with GMM using MLE. • Model parameters: θ={w1, w2, μ1, μ2, ∑1, ∑2} • So: • To estimate θ, we maximize: • Then, we have:

Continue… • Now we get θ, then Predicty maximum a posteriori:

What about SSGMM? • To estimate θ, we maximize: • More complicated?(a mixture of two normal distribution)

A more complicated case • For simplicity, consider a mixture of two normal distribution. • Model parameters: • So:

A more complicated case • Then: • Direct MLE is difficult numerically.

The EM for GMM • We consider unobserved latent variables Δi: • If Δi =0, then (xi, yi) comes from model 0 • Else Δi =1, then (xi, yi)comes from model 1 • Suppose we know the values of Δi‘s, then:

The EM for GMM • The values of the Δi's are actually unknown. • EM’s idea: we proceed in an iterative fashion, substituting for each Δi in its expected value.

Another version of EM for GMM • Start from MLE θ={w1, w2, μ1, μ2, ∑1, ∑2} on (Xl, Yl), repeat: • The E-step: compute the expected label p(y|x, θ) for all x Xu • label p(y=1|x, θ)-fraction of x with class 1 • label p(y=2|x, θ)-fraction of x with class 2

Another version of EM for GMM • The M-step: update MLE θ with (now weighted labeled) Xu

The EM algorithm in general • Set up: • observed data D = (Xl, Yl, Xu) • hidden data Yu • Goal: find θ to maximize • Properties: • starts from an arbitrary θ0(or estimate on (Xl, Yl)) • The E-step: estimate p(Yu|Xu, θ0) • The M-step: maximize • iteratively improves p(D|θ) • converges to a local maximum of θ

Beyond EM • Key is to maximize p(Xl, Yl, Xu|θ). • EM is just one way to maximize it. • Other ways to find parameters are possible too, e.g. variational approximation, or direct optimization.

Advantages of generative models • Clear, well-studied probabilistic framework • Can be extremely effective, if the model is close to correct

Disadvantages of generative models • Often difficult to verify the correctness of the model • Model identifiability • p(y=1)=0.2, p(x|y=1)=unif(0, 0.2), p(x|y=−1)=unif(0.2, 1) (1) • p(y=1)=0.6, p(x|y=1)=unif(0, 0.6), p(x|y=−1)=unif(0.6, 1) • Can we predict on x=0.5? • EM local optima • Unlabeled data may hurt if generative model is wrong

Unlabeled data may hurt SSL

Heuristics to lessen the danger • Carefully construct the generative model to reflect the task • e.g. multiple Gaussian distributions per class, instead of a single one • Down-weight the unlabeled data (λ<1)

Related method: cluster-and-label • Instead of probabilistic generative models, any clustering algorithm can be used for semi-supervised classification too: • Run your favorite clustering algorithm on Xl,Xu. • Label all points within a cluster by the majority of labeled points in that cluster. • Pro: Yet another simple method using existing algorithms. • Con: Can be difficult to analyze due to their algorithmic nature.

Semi-supervised SVMs • Semi-supervised SVMs(S3VMs) • Transductive SVMs(TSVMs)

SVM with hinge loss • The hinge loss: • The optimization problem(objective function):

S3VMs • Assumption: • Unlabeled data from different classes are separated with large margin. • Basic idea: • Enumerate all 2u possible labeling of Xu • Build one standard SVM for each labeling (and Xl) • Pick the SVM with the largest margin • NP-hard！

A smart trick • How to incorporate unlabeled points? • Assign labels sign(f(x)) to x∈Xu, i.e. the unlabeled ones classified correctly. • Is it equivalent to our basic idea?(Yes) • The hinge loss on unlabeled points becomes

S3VM objective function • S3VM objective: • the decision boundary f = 0 wants to be placed so that there is few unlabeled data near it.

The class balancing constraint • Directly optimizing the S3VM objective often produces unbalanced classification • most points fall in one class. • Heuristic class balance: • Relaxed class balancing constraint:

S3VM algorithm • The optimization problem: • Classify a new test point x by sign(f(x))

The S3VM optimization challenge • SVM objective is convex. • S3VM objective is non-convex. • Finding a solution for semi-supervised SVM is difficult, which has been the focus of S3VM research. • Different approaches: SVMlight, rS3VM, continuation S3VM, deterministic annealing, CCCP, Branch and Bound, SDP convex relaxation, etc.

Advantages of S3VMs • Applicable wherever SVMs are applicable, i.e. almost everywhere • Clear mathematical framework • More modest assumption than generative model or graph-based methods

Disadvantages of S3VMs • Optimization difficult • Can be trapped in bad local optima

Graph-Based Algorithms • Assumption: • A graph is given on the labeled and unlabeled data. Instances connected by heavy edge tend to have the same label. • The optimization problem: • Some algorithms • mincut • harmonic • local and global consistency • manifold regularization

Co-training • Two views of an item: image and HTML text

Semi-supervised Learning