360 likes | 849 Views
Self-taught Learning Transfer Learning from Unlabeled Data. Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University. The “one learning algorithm” hypothesis.
E N D
Self-taught LearningTransfer Learning from Unlabeled Data Rajat Raina Honglak Lee, Roger Grosse Alexis Battle, Chaitanya Ekanadham, Helen Kwong, Benjamin Packer, Narut Sereewattanawoot Andrew Y. Ng Stanford University
The “one learning algorithm” hypothesis • There is some evidence that the human brain uses essentially the same algorithm to understand many different input modalities. • Example: Ferret experiments, in which the “input” for vision was plugged into auditory part of brain, and the auditory cortex learns to “see.” [Roe et al., 1992] (Roe et al., 1992. Hawkins & Blakeslee, 2004) Self-taught Learning
The “one learning algorithm” hypothesis • There is some evidence that the human brain uses essentially the same algorithm to understand many different input modalities. • Example: Ferret experiments, in which the “input” for vision was plugged into auditory part of brain, and the auditory cortex learns to “see.” [Roe et al., 1992] If we could find this one learning algorithm, we would be done. (Finally!) (Roe et al., 1992. Hawkins & Blakeslee, 2004) Self-taught Learning
Finding a deep learning algorithm • If the brain really is one learning algorithm, it would suffice to just: • Find a learning algorithm for a single layer, and, • Show that it can build a small number of layers. • We evaluate our algorithms: • Against biology. • On applications. • e.g., Sparse RBMs for V2: • Poster yesterday (Lee et al.) • This talk Self-taught Learning
Cars Motorcycles Supervised learning Train Test Supervised learning algorithms may not work well with limited labeled data. Self-taught Learning
Learning in humans • Your brain has 1014 synapses (connections). • You will live for 109 seconds. • If each synapse requires 1 bit to parameterize, you need to “learn” 1014 bits in 109 seconds. • Or, 105 bits per second. Human learning is largely unsupervised, and uses readily available unlabeled data. (Geoffrey Hinton, personal communication) Self-taught Learning
Cars Motorcycles Supervised learning Train Test Self-taught Learning
“Brain-like” Learning Train Test Cars Motorcycles Unlabeled images (randomly downloaded from the Internet) Self-taught Learning
“Brain-like” Learning + ? Labeled Webpages Labeled Digits Unlabeled English characters + ? Unlabeled newspaper articles + ? Unlabeled English speech Labeled Russian Speech Self-taught Learning
“Self-taught Learning” + ? Labeled Webpages Labeled Digits Unlabeled English characters + ? Unlabeled newspaper articles + ? Unlabeled English speech Labeled Russian Speech Self-taught Learning
Cars Cars Cars Motorcycles Motorcycles Motorcycles Bus Tractor Aircraft Helicopter Motorcycle Car Natural scenes Recent history of machine learning • 20 years ago: Supervised learning • 10 years ago: Semi-supervised learning. • 10 years ago: Transfer learning. • Next: Self-taught learning?
Self-taught Learning • Labeled examples: • Unlabeled examples: • The unlabeled and labeled data: • Need not share labels y. • Need not share a generative distribution. • Advantage: Such unlabeled data is often easy to obtain. Self-taught Learning
A self-taught learning algorithm Overview: Represent each labeled or unlabeled input as a sparse linear combination of “basis vectors” . x = 0.8 * b87+ 0.3 * b376+ 0.5 * b411 = 0.8 * + 0.3 * + 0.5 * Self-taught Learning
A self-taught learning algorithm Key steps: Learn good bases using unlabeled data . Use these learnt bases to construct “higher-level” features for the labeled data. Apply a standard supervised learning algorithm on these features. x = 0.8 * b87+ 0.3 * b376+ 0.5 * b411 = 0.8 * + 0.3 * + 0.5 * Self-taught Learning
Learning the bases: Sparse coding Given only unlabeled data, we find good bases b using sparse coding: Reconstruction error Sparsity penalty (Efficient algorithms: Lee et al., NIPS 2006) [Details: An extra normalization constraint on is required.] Self-taught Learning
Example bases Learnt bases: “Edges” Natural images. Learnt bases: “Strokes” Handwritten characters. Self-taught Learning
Constructing features • Using the learnt bases b, compute features for the examples xlfrom the classification task by solving: • Finally, learn a classifer using a standard supervised learning algorithm (e.g., SVM) over these features. Sparsity penalty Reconstruction error = 0.8 * + 0.3 * + 0.5 * xl = 0.8 * b87+ 0.3 * b376+ 0.5 * b411 Self-taught Learning
Image classification Large image (Platypus from Caltech101 dataset) Feature visualization Self-taught Learning
Image classification Platypus image (Caltech101 dataset) Feature visualization Self-taught Learning
Image classification Platypus image (Caltech101 dataset) Feature visualization Self-taught Learning
Image classification Platypus image (Caltech101 dataset) Feature visualization Self-taught Learning
Image classification Other reported results: Fei-Fei et al, 2004: 16% Berg et al., 2005: 17% Holub et al., 2005: 40% Serre et al., 2005: 35% Berg et al, 2005: 48% Zhang et al., 2006: 59% Lazebnik et al., 2006: 56% (15 labeled images per class) 36.0% error reduction Self-taught Learning
Character recognition Digits Handwritten English English font Handwritten English classification (20 labeled images per handwritten character) Bases learnt on digits English font classification (20 labeled images per font character) Bases learnt on handwritten English 8.2% error reduction 2.8% error reduction Self-taught Learning
Text classification Reuters newswire UseNet articles Webpages Webpage classification (2 labeled documents per class) Bases learnt on Reuters newswire UseNet classification (2 labeled documents per class) Bases learnt on Reuters newswire 4.0% error reduction 6.5% error reduction Self-taught Learning
Shift-invariant sparse coding Sparse features Basis functions Reconstruction (Algorithms: Grosse et al., UAI 2007) Self-taught Learning
Audio classification Speaker identification (5 labels, TIMIT corpus, 1 sentence per speaker.) Bases learnt on different dialects Musical genre classification (5 labels, 18 seconds per genre.) Bases learnt on different genres, songs 8.7% error reduction 5.7% error reduction (Details: Grosse et al., UAI 2007) Self-taught Learning
Sparse deep belief networks . . . h: Hidden layer Sparse RBM W, b, c: Parameters . . . v: Visible layer New (Details: Lee et al., NIPS 2007. Poster yesterday.) Self-taught Learning
Sparse deep belief networks Image classification (Caltech101 dataset) 3.2% error reduction (Details: Lee et al., NIPS 2007. Poster yesterday.) Self-taught Learning
Cars Motorcycles Summary • Self-taught learning: Unlabeled data does not share the labels of the classification task. • Use unlabeled data to discover features. • Use sparse coding to construct an easy-to-classify, “higher-level” representation. Unlabeled images = 0.8 * + 0.3 * + 0.5 * Self-taught Learning
Related Work • Weston et al, ICML 2006 • Make stronger assumptions on the unlabeled data. • Ando & Zhang, JMLR 2005 • For natural language tasks and character recognition, use heuristics to construct a transfer learning task using unlabeled data. Self-taught Learning