Deep Unsupervised Learning in Pattern Recognition

Pattern Recognition and Machine LearningDeep Alternative Architectures Dipartimento di Ingegneria «Enzo Ferrari» Università di Modena e Reggio Emilia

UNSUPERVISED LEARNING

Motivation Mostimpressiveresultsindeeplearninghavebeenobtainedwith purelysupervisedlearningmethods(seeprevioustalk) • Invision,typicallyclassification(e.g.objectrecognition) • Thoughprogresshasbeenslower,itislikelythatunsupervised learningwillbeimportanttofutureadvancesinDL Image:Krizhevsky(2012)-AlexNet,the“hammer”ofDL • 23 June 2014/ 2 CVPRDLforVisionTutorial･UnsupervisedLearning/GTaylor

Why UnsupervisedLearning? Reason1: Wecanexploitunlabelleddata;muchmorereadilyavailable and oftenfree. 23 June 2014/ 4 CVPRDLforVisionTutorial･UnsupervisedLearning/GTaylor

Why UnsupervisedLearning? Reason2: We can capture enough information about the observed variablessoastoasknewquestionsaboutthem;questions thatwerenotanticipatedattrainingtime. Layer1 Layer2 Layer3 Layer4 Layer5 23 June 2014/ 5 Image:Featuresfromaconvolutionalnet(ZeilerandFergus,2013) CVPRDLforVisionTutorial･UnsupervisedLearning/GTaylor

Why UnsupervisedLearning? Reason3: Unsupervised learning has been shown to be a good regularizerforsupervisedlearning;ithelpsgeneralize. 1500 This advantage showsup in practicalapplications: • transfer learning, domainadaptation • unbalancedclasses • zero-shot,one-shot learning Withoutpre−training 1000 Withpre−training 500 0 −500 −1000 −1500 −4000 −3000 −2000 −1000 0 1000 2000 3000 4000 Image:ISOMAPembeddingoffunctionsrepresentedby 50networkswandw/opretraining(Erhanetal.,2010) 23 June 2014/ 6 CVPRDLforVisionTutorial･UnsupervisedLearning/GTaylor

Why UnsupervisedLearning? Reason4: Thereisevidencethatunsupervisedlearningcanbeachieved mainly through a level-local training signal; compare this to supervised learning where the only signal driving parameter updatesisavailableattheoutputandgetsbackpropagated. Propagatecredit Supervisedlearning Locallearning 23 June 2014/ CVPRDLforVisionTutorial･UnsupervisedLearning/GTaylor

Why UnsupervisedLearning? Reason5: A recent trend in machine learning is to consider problems where the output is high-dimensional and has a complex, possibly multi-modal joint distribution. Unsupervised learningcanbeusedinthese“structuredoutput”problems. animal pet furry … striped Attribute Prediction Segmentation 23 June 2014/ CVPRDLforVisionTutorial･UnsupervisedLearning/GTaylor

LearningRepresentations “Concepts”or“Abstractions”thathelpusmake senseofthevariabilityindata • Often hand-designed to have desirableproperties: e.g.sensitivetovariableswewanttopredict,less sensitive to other factors explainingvariability • DLhasleveragedtheabilitytolearn representations • these canbe task-specific or task-agnostic - 23 June 2014/ CVPRDLforVisionTutorial･UnsupervisedLearning/GTaylor

Supervised Learningof Representations Learn a representation with the objective of selectingonethatisbestsuitedforpredicting targets giveninput • (c) Layer 5, strongest feature mapprojections (a) InputImage (b) Layer 5, strongest featuremap input prediction f() Error target 23 June2014 / 10 CVPRDLforVisionTutorial･UnsupervisedLearning/GTaylor Image:Featuresfromaconvolutionalnet(ZeilerandFergus,2013)

Unsupervised Learningof Representations f() input prediction Error ? 23 June2014 / 11 CVPRDLforVisionTutorial･UnsupervisedLearning/GTaylor

Unsupervised learningof representations code What is theobjective? • reconstructionerror? - input reconstruction maximumlikelihood? - Input images disentangle factors ofvariation? - Learning Identitymanifold coordinates FixedID Posemanifold coordinates 23 June2014 / 12 CVPRDLforVisionTutorial･UnsupervisedLearning/GTaylor FixedPose Input Image: Lee et al.2014

Principal ComponentsAnalysis PCA works well whenthe data is near a linear manifold in high- dimensionalspace • Project the dataonto this subspace spanned by principalcomponents • directionoffirstprincipalcomponenti.e. direction of greatestvariance In dimensionsorthogonal to the subspace the data has lowvariance • Credit: GeoﬀHinton

AnineﬀicientwaytofitPCA Train a neuralnetwork with a “bottleneck” hiddenlayer • code (bottleneck) output (reconstruction) input Ifthehiddenandoutputlayersarelinear, and we minimize squaredreconstruction error: • Try to make theoutput the same asthe input • The M hidden units will span the same space as the first M principalcomponents • Buttheirweightvectorswillnotbe orthogonal • Andtheywillhaveapproximatelyequal variance • Credit: GeoﬀHinton

Why fit PCAineﬀiciently? input code reconstruction encoder decoder h(x) xˆ (h(x)) Error Withnonlinearlayersbeforeandafterthecode,itshouldbepossibleto representdatathatliesonornearanonlinearmanifold • theencodermapsfromdataspacetoco-ordinatesonthemanifold - thedecoderdoestheinversetransformation - Theencoder/decodercanberich,multi-layerfunctions •

Auto-encoder input code reconstruction encoder decoder h(x) xˆ (h(x)) Error Feed-forwardarchitecture • Trained to minimize reconstructionerror • bottleneck or regularizationessential - 23 June2014 / 17 CVPRDLforVisionTutorial･UnsupervisedLearning/GTaylor

RegularizedAuto-encoders input code reconstruction encoder decoder h(x) xˆ (h(x)) Error Permitcodetobehigher-dimensionalthantheinput • Capture structure of the training distribution due to predictive opposition b/w reconstructiondistribution andregularizer • Regularizertriestomakeenc/decassimpleaspossible •

Simple? Reconstructtheinputfromthecodeandmakethecode compact (PCA, auto-encoder withbottleneck) • Reconstructtheinputfromthecodeandmakethecodesparse (sparseauto-encoders) • Addnoisetotheinputorcodeandreconstructthecleaned-up version (denoisingauto-encoders) • Reconstructtheinputfromthecodeandmakethecode insensitivetotheinput(contractiveauto-encoders) • 23 June2014 / 19 CVPRDLforVisionTutorial･UnsupervisedLearning/GTaylor

SparseAuto-encoders 23 June2014 / 20 CVPRDLforVisionTutorial･UnsupervisedLearning/GTaylor

DeconvolutionalNetworks Deep convolutional sparsecoding Layer4 • Trained to reconstruct theinput from anylayer • Layer3 Fast approximateinference • Recently used to visualizefeatures learned by convolutional nets (Zeiler and Fergus2013) • Layer1 Layer2

(Vincent et al.2008) DenoisingAuto-encoders noisy input input code reconstruction encoder decoder noise x˜ (x) h(x˜) xˆ (h(x˜)) Error Thecodecanbeviewedasalossy compression of theinput • Learning drives it to be a good compressor for trainingexamples (andhopefullyothersaswell)but not arbitraryinputs •

(Rifai et al.2011) ContractiveAuto-encoders input code reconstruction encoder decoder h(x) xˆ (h(x)) Error Learn good models of high- dimensional data (Bengioetal. 2013) • Can obtain goodrepresentations forclassification • Can produce good quality samplesbyarandomwalknear the manifold of high density (Rifai et al.2012) •

Resources Onlinecourses • Andrew Ng’s Machine Learning(Coursera) - Geoﬀ Hinton’s Neural Networks(Coursera) - Websites • deeplearning.net - http://deeplearning.stanford.edu/wiki/index.php/ UFLDL_Tutorial -

Surveys andReviews • Y. Bengio, A. Courville, and P. Vincent. Representation learning:A review and new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1798–1828, Aug2013. • Y.Bengio.Deeplearningofrepresentations:Lookingforward.InStatistical LanguageandSpeechProcessing,pages1–37.Springer,2013. • Y.Bengio,I.Goodfellow,andA.Courville.DeepLearning.2014.Draft available athttp://www.iro.umontreal.ca/~bengioy/dlbook/ • J.Schmidhuber.Deeplearninginneuralnetworks:Anoverview.arXiv preprint arXiv:1404.7828,2014. • Y.Bengio.Learningdeeparchitecturesforai.Foundationsandtrendsin Machine Learning, 2(1):1–127,2009.

Sequencemodelling

Sequencemodelling • When applying machine learning to sequences, we often want to turn an input sequence into an output sequence that lives in a different domain. – E. g. turn a sequence of sound pressures into a sequence of word identities. • When there is no separate target sequence, we can get a teaching signal by trying to predict the next term in the input sequence. – The target output sequence is the input sequence with an advance of 1 step. – This seems much more natural than trying to predict one pixel in an image from the other pixels, or one patch of an image from the rest of the image. • For temporal sequences there is a natural order for the predictions.

Memorylessmodels for sequences Autoregressivemodels FeedForward network

Memory and Hidden State • If we give our generative model some hidden state, and if we give this hidden state its own internal dynamics, we get a much more interestingkind of model. – It can store information in its hidden state for a long time. – If the dynamics is noisy and the way it generates outputs from its hidden state is noisy, we can never know its exact hidden state. • The best we can do is to infer a probability distribution over the space of hidden state vectors.

RNN • RNNs are very powerful, because they combine twoproperties: • – Distributed hidden state that allows them to store a lot of information aboutthe pastefficiently. • Non-linear dynamicsthatallowsthem to update their hidden state in complicatedways. With enough neurons and time, RNNs can compute anything that can be computedby your computer.

RNN Structure and WeightSharing Jordan Network Elman Network

Backpropagatingthrough time

Deep Unsupervised Learning in Pattern Recognition