1 / 15

An Overview on Semi-Supervised Learning Methods

An Overview on Semi-Supervised Learning Methods. Matthias Seeger MPI for Biological Cybernetics Tuebingen, Germany. Overview. The SSL Problem Paradigms for SSL. Examples The Importance of Input-dependent Regularization Note : Citations omitted here (given in my literature review). m. y.

adamdaniel
Download Presentation

An Overview on Semi-Supervised Learning Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Overview onSemi-Supervised LearningMethods Matthias SeegerMPI for Biological Cybernetics Tuebingen, Germany

  2. Overview • The SSL Problem • Paradigms for SSL. Examples • The Importance ofInput-dependent Regularization Note: Citations omitted here (given inmy literature review)

  3. m y x q Semi-Supervised Learning SSL is Supervised Learning... Goal: Estimate P(y|x) from Labeled DataDl={ (xi,yi) } But: Additional Source tells about P(x)(e.g., Unlabeled Data Du={xj}) The Interesting Case:

  4. Obvious Baseline Methods The Goal of SSL is To Do Better Not: Uniformly and always(No Free Lunch; and yes (of course): Unlabeled data can hurt) But (as always): If our modelling and algorithmic efforts reflecttrue problem characteristics • Do not use info about P(x) Supervised Learning • Fit a Mixture Modelusing Unsupervised Learning, then“label up” components using {yi}

  5. y q p x The Generative Paradigm • Model Class Distributions and • Implies model for P(y|x)and for P(x)

  6. The Joint Likelihood Natural Criterion in this context: • Maximize using EM (idea as old as EM) • Early and recent theoretical work onasymptotic variance • Advantage: Easy to implement forstandard mixture model setups

  7. Drawbacks of Generative SSL • Choice of source weightingl crucial • Cross-Validation fails for small n • Homotopy Continuation (Corduneanu etal.) • Just like in Supervised Learning: • Model for P(y|x) specified indirectly • Fitting not primarily concerned with P(y|x).Also: Have to represent P(x)generally wellNot just aspects which help with P(y|x).

  8. q y m x The Diagnostic Paradigm • Model P(y|x,q) and P(x|m)directly • But: Since q,m areindependent a priori,q does not depend on m, given data Knowledge of mdoes not influenceP(y|x) prediction in a probabilistic setup!

  9. What To Do About It • Non-probabilistic diagnostic techniques • Replace expected lossbyTong, Koller; Chapelle etal. Very limited effect if n small • Some old work (eg., Anderson) • Drop the prior independence of q,m Input-dependent Regularization

  10. y m x Input-Dependent Regularization q • Conditional priors P(q|m)make P(y|x) estimationdependent on P(x), • Now, unlabeled data can really help... • And can hurt for the same reason!

  11. The Cluster Assumption (CA) • Empirical Observation: Clustering of data {xj} w.r.t. “sensible” distance / features often fairly compatible with class regions • Weaker: Class regions do not tend to cut high-volume regions of P(x) • Why? Ask Philosophers! My guess:Selection bias for features/distance No Matter Why: Many SSL Methods implement theCA and work fine in practice

  12. Examples For IDR Using CA • Label Propagation, Gaussian Random Fields: Regularization depends on graph structure which is built from all {xj} More smoothness in regions of high connectivity / affinity flows • Cluster kernels for SVM (Chapelle etal.) • Information Regularization(Corduneanu, Jaakkola)

  13. More Examples for IDR Some methods do IDR, but implement the CA only in special cases: • Fisher Kernels (Jaakkola etal.)Kernel from Fisher features Automatic feature induction fromP(x) model • Co-Training (Blum, Mitchell)Consistency across diff. views (features)

  14. Is SSL Always Generative? Wait: We have to model P(x) somehow.Is this not always generative then? ... No! • Generative: Model P(x|y) fairly directly, P(y|x) model and effect of P(x) are implicit • Diagnostic IDR: • Direct model for P(y|x), more flexibility • Influence of P(x) knowledge on P(y|x) prediction directly controlled, eg. through CA Model for P(x) can be much less elaborate

  15. Conclusions • Given taxonomy for probabilistic approaches to SSL • Illustrated paradigms by examples from literature • Tried to clarify some points which have led to confusions in the past

More Related