Self-taught Learning Transfer Learning from Unlabeled Data

Self-taught LearningTransfer Learning from Unlabeled Data Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University

The “one learning algorithm” hypothesis • There is some evidence that the human brain uses essentially the same algorithm to understand many different input modalities. • Example: Ferret experiments, in which the “input” for vision was plugged into auditory part of brain, and the auditory cortex learns to “see.” [Roe et al., 1992] (Roe et al., 1992. Hawkins & Blakeslee, 2004) Self-taught Learning

The “one learning algorithm” hypothesis • There is some evidence that the human brain uses essentially the same algorithm to understand many different input modalities. • Example: Ferret experiments, in which the “input” for vision was plugged into auditory part of brain, and the auditory cortex learns to “see.” [Roe et al., 1992] If we could find this one learning algorithm, we would be done. (Finally!) (Roe et al., 1992. Hawkins & Blakeslee, 2004) Self-taught Learning

Finding a deep learning algorithm • If the brain really is one learning algorithm, it would suffice to just: • Find a learning algorithm for a single layer, and, • Show that it can build a small number of layers. • We evaluate our algorithms: • Against biology. • On applications. • e.g., Sparse RBMs for V2: • Poster yesterday (Lee et al.) • This talk Self-taught Learning

Cars Motorcycles Supervised learning Train Test Supervised learning algorithms may not work well with limited labeled data. Self-taught Learning

Learning in humans • Your brain has 1014 synapses (connections). • You will live for 109 seconds. • If each synapse requires 1 bit to parameterize, you need to “learn” 1014 bits in 109 seconds. • Or, 105 bits per second. Human learning is largely unsupervised, and uses readily available unlabeled data. (Geoffrey Hinton, personal communication) Self-taught Learning

Cars Motorcycles Supervised learning Train Test Self-taught Learning

“Brain-like” Learning Train Test Cars Motorcycles Unlabeled images (randomly downloaded from the Internet) Self-taught Learning

“Brain-like” Learning + ? Labeled Webpages Labeled Digits Unlabeled English characters + ? Unlabeled newspaper articles + ? Unlabeled English speech Labeled Russian Speech Self-taught Learning

“Self-taught Learning” + ? Labeled Webpages Labeled Digits Unlabeled English characters + ? Unlabeled newspaper articles + ? Unlabeled English speech Labeled Russian Speech Self-taught Learning

Cars Cars Cars Motorcycles Motorcycles Motorcycles Bus Tractor Aircraft Helicopter Motorcycle Car Natural scenes Recent history of machine learning • 20 years ago: Supervised learning • 10 years ago: Semi-supervised learning. • 10 years ago: Transfer learning. • Next: Self-taught learning?

Self-taught Learning • Labeled examples: • Unlabeled examples: • The unlabeled and labeled data: • Need not share labels y. • Need not share a generative distribution. • Advantage: Such unlabeled data is often easy to obtain. Self-taught Learning

A self-taught learning algorithm Overview: Represent each labeled or unlabeled input as a sparse linear combination of “basis vectors” . x = 0.8 * b87+ 0.3 * b376+ 0.5 * b411 = 0.8 * + 0.3 * + 0.5 * Self-taught Learning

A self-taught learning algorithm Key steps: Learn good bases using unlabeled data . Use these learnt bases to construct “higher-level” features for the labeled data. Apply a standard supervised learning algorithm on these features. x = 0.8 * b87+ 0.3 * b376+ 0.5 * b411 = 0.8 * + 0.3 * + 0.5 * Self-taught Learning

Learning the bases: Sparse coding Given only unlabeled data, we find good bases b using sparse coding: Reconstruction error Sparsity penalty (Efficient algorithms: Lee et al., NIPS 2006) [Details: An extra normalization constraint on is required.] Self-taught Learning

Example bases Learnt bases: “Edges” Natural images. Learnt bases: “Strokes” Handwritten characters. Self-taught Learning

Constructing features • Using the learnt bases b, compute features for the examples xlfrom the classification task by solving: • Finally, learn a classifer using a standard supervised learning algorithm (e.g., SVM) over these features. Sparsity penalty Reconstruction error = 0.8 * + 0.3 * + 0.5 * xl = 0.8 * b87+ 0.3 * b376+ 0.5 * b411 Self-taught Learning

Image classification Large image (Platypus from Caltech101 dataset) Feature visualization Self-taught Learning

Image classification Platypus image (Caltech101 dataset) Feature visualization Self-taught Learning

Image classification Other reported results: Fei-Fei et al, 2004: 16% Berg et al., 2005: 17% Holub et al., 2005: 40% Serre et al., 2005: 35% Berg et al, 2005: 48% Zhang et al., 2006: 59% Lazebnik et al., 2006: 56% (15 labeled images per class) 36.0% error reduction Self-taught Learning

Character recognition Digits Handwritten English English font Handwritten English classification (20 labeled images per handwritten character) Bases learnt on digits English font classification (20 labeled images per font character) Bases learnt on handwritten English 8.2% error reduction 2.8% error reduction Self-taught Learning

Text classification Reuters newswire UseNet articles Webpages Webpage classification (2 labeled documents per class) Bases learnt on Reuters newswire UseNet classification (2 labeled documents per class) Bases learnt on Reuters newswire 4.0% error reduction 6.5% error reduction Self-taught Learning

Shift-invariant sparse coding Sparse features Basis functions Reconstruction (Algorithms: Grosse et al., UAI 2007) Self-taught Learning

Audio classification Speaker identification (5 labels, TIMIT corpus, 1 sentence per speaker.) Bases learnt on different dialects Musical genre classification (5 labels, 18 seconds per genre.) Bases learnt on different genres, songs 8.7% error reduction 5.7% error reduction (Details: Grosse et al., UAI 2007) Self-taught Learning

Sparse deep belief networks . . . h: Hidden layer Sparse RBM W, b, c: Parameters . . . v: Visible layer New (Details: Lee et al., NIPS 2007. Poster yesterday.) Self-taught Learning

Sparse deep belief networks Image classification (Caltech101 dataset) 3.2% error reduction (Details: Lee et al., NIPS 2007. Poster yesterday.) Self-taught Learning

Cars Motorcycles Summary • Self-taught learning: Unlabeled data does not share the labels of the classification task. • Use unlabeled data to discover features. • Use sparse coding to construct an easy-to-classify, “higher-level” representation. Unlabeled images = 0.8 * + 0.3 * + 0.5 * Self-taught Learning

THE END

Related Work • Weston et al, ICML 2006 • Make stronger assumptions on the unlabeled data. • Ando & Zhang, JMLR 2005 • For natural language tasks and character recognition, use heuristics to construct a transfer learning task using unlabeled data. Self-taught Learning

Self-taught Learning Transfer Learning from Unlabeled Data

Self-taught Learning Transfer Learning from Unlabeled Data

Presentation Transcript

An introduction to self-taught learning

LEARNING FROM DATA

Stochastic Unsupervised Learning on Unlabeled Data

Transfer Learning

Predictive Learning from Data

Learning From the Data

Learning from Positive and Unlabeled Examples

Predictive Learning from Data

Learning from labelled and unlabeled data

Learning From Data

Incorporating Unlabeled Data in the Learning Process

Learning from Labeled and Unlabeled Data using Graph Mincuts

Predictive Learning from Data

Incorporating Unlabeled Data in the Learning Process

Predictive Learning from Data

A Theoretical Model for Learning from Labeled and Unlabeled Data