150 likes | 263 Views
An Overview on Semi-Supervised Learning Methods. Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany. Overview. The SSL Problem Paradigms for SSL. Examples The Importance of Input-dependent Regularization Note : Citations omitted here (given in my literature review). m. y.
E N D
An Overview onSemi-Supervised LearningMethods Matthias SeegerMPI for Biological Cybernetics Tuebingen, Germany
Overview • The SSL Problem • Paradigms for SSL. Examples • The Importance ofInput-dependent Regularization Note: Citations omitted here (given inmy literature review)
m y x q Semi-Supervised Learning SSL is Supervised Learning... Goal: Estimate P(y|x) from Labeled DataDl={ (xi,yi) } But: Additional Source tells about P(x)(e.g., Unlabeled Data Du={xj}) The Interesting Case:
Obvious Baseline Methods The Goal of SSL is To Do Better Not: Uniformly and always(No Free Lunch; and yes (of course): Unlabeled data can hurt) But (as always): If our modelling and algorithmic efforts reflecttrue problem characteristics • Do not use info about P(x) Supervised Learning • Fit a Mixture Modelusing Unsupervised Learning, then“label up” components using {yi}
y q p x The Generative Paradigm • Model Class Distributions and • Implies model for P(y|x)and for P(x)
The Joint Likelihood Natural Criterion in this context: • Maximize using EM (idea as old as EM) • Early and recent theoretical work onasymptotic variance • Advantage: Easy to implement forstandard mixture model setups
Drawbacks of Generative SSL • Choice of source weightingl crucial • Cross-Validation fails for small n • Homotopy Continuation (Corduneanu etal.) • Just like in Supervised Learning: • Model for P(y|x) specified indirectly • Fitting not primarily concerned with P(y|x).Also: Have to represent P(x)generally wellNot just aspects which help with P(y|x).
q y m x The Diagnostic Paradigm • Model P(y|x,q) and P(x|m)directly • But: Since q,m areindependent a priori,q does not depend on m, given data Knowledge of mdoes not influenceP(y|x) prediction in a probabilistic setup!
What To Do About It • Non-probabilistic diagnostic techniques • Replace expected lossbyTong, Koller; Chapelle etal. Very limited effect if n small • Some old work (eg., Anderson) • Drop the prior independence of q,m Input-dependent Regularization
y m x Input-Dependent Regularization q • Conditional priors P(q|m)make P(y|x) estimationdependent on P(x), • Now, unlabeled data can really help... • And can hurt for the same reason!
The Cluster Assumption (CA) • Empirical Observation: Clustering of data {xj} w.r.t. “sensible” distance / features often fairly compatible with class regions • Weaker: Class regions do not tend to cut high-volume regions of P(x) • Why? Ask Philosophers! My guess:Selection bias for features/distance No Matter Why: Many SSL Methods implement theCA and work fine in practice
Examples For IDR Using CA • Label Propagation, Gaussian Random Fields: Regularization depends on graph structure which is built from all {xj} More smoothness in regions of high connectivity / affinity flows • Cluster kernels for SVM (Chapelle etal.) • Information Regularization(Corduneanu, Jaakkola)
More Examples for IDR Some methods do IDR, but implement the CA only in special cases: • Fisher Kernels (Jaakkola etal.)Kernel from Fisher features Automatic feature induction fromP(x) model • Co-Training (Blum, Mitchell)Consistency across diff. views (features)
Is SSL Always Generative? Wait: We have to model P(x) somehow.Is this not always generative then? ... No! • Generative: Model P(x|y) fairly directly, P(y|x) model and effect of P(x) are implicit • Diagnostic IDR: • Direct model for P(y|x), more flexibility • Influence of P(x) knowledge on P(y|x) prediction directly controlled, eg. through CA Model for P(x) can be much less elaborate
Conclusions • Given taxonomy for probabilistic approaches to SSL • Illustrated paradigms by examples from literature • Tried to clarify some points which have led to confusions in the past