190 likes | 299 Views
(Infinitely) Deep Learning in Vision. Max Welling (UCI) collaborators: Ian Porteous (UCI) Evgeniy Bart UCI/Caltech) Pietro Perona (Caltech). Outline. Nonparametric Bayesian Taxonomy models for object categorization Hierarchical representations from networks of HDPs. Motivation.
E N D
(Infinitely) Deep Learningin Vision Max Welling (UCI) collaborators: Ian Porteous (UCI) Evgeniy Bart UCI/Caltech) Pietro Perona (Caltech)
Outline • Nonparametric Bayesian Taxonomy models for object categorization • Hierarchical representations from networks of HDPs
Motivation • Building systems that learn for a lifetime, from • “construction to destruction” • E.g. unsupervised learning of object category taxonomies. • (with E. Bart, I Porteous and P. Perona) • Hierarchical models can help to: • Act as a prior to transfer information to new categories • Fast recognition • Classify at appropriate level of abstraction (Fidodogmammal) • Can define similarity measure (kernel) • Nonparametric Bayesian framework allows models to grow • their model complexity without bound (with growing dataset size)
Nonparametric Model for Visual Taxonomy image / scene prior over trees is nested CRP (Blei et al. 04) -a path is more popular if it has been traveled a lot taxonomy word distribution for topic k topic 1 topic 2 topic k visual word detection 0.7 0.26 0.04
300 images from Corel database. (experiments and figures by E. Bart)
Beyond Trees? • Deep belief nets are more powerful alternatives to taxonomies (in a modeling sense). • Nodes in the hierarchy represent overlapping and increasingly abstract categories • More sharing of statistical strength • Proposal: stack LDA models
LDA (Blei, Ng, Jordan ‘02) token i in doc. j was assigned to type w (observed). token i in image j was assigned to topic k (hidden). image-specific distribution over topics. Topic-specific distribution over words.
Stage-wise LDA LDA • Use Z1 as pseudo-data for next layer. • After second LDA model is fit, we have 2 distributions over Z1. • We combine these distributions by taking their mixture.
Special Words Layer .. • At the bottom layer we have an image-specific distribution over words. • It filters out image-idiosyncrasies which are not modeled well by topics Special words topic model (Chemudgunda, Steyvers, Smyth, 06)
.. .. Model At every level a switching variable picks either or . The lowest level at which was picked disconnects the upstream variables. Last layer that has any data assigned to it. A switching variable has picked this level– all layers above are disconnected.
Collapsed Gibbs Sampling Marginalize out • Given X, perform an upward • pass to compute posterior • probabilities for each level. • Sample a level. • From that level, sample all • downstream Z-variables. • (ignore upstream Z-variables) .. ..
The Digits ... (I deeply believe in) All experiments done by I. Porteous (and finished 2 hours ago).
This level filters out image-idiosyncrasies. No information from this level is “transferred” to test-data
(level 1 topic distributions) (level 2 topic distributions)
Assignment to Levels Brightness = average level assignment
Properties • Properties which are more specific to an image/document are explained at lower levels of hierarchy. • They act as a data-filters for the higher layers • Higher levels become increasingly abstract, with larger “receptive fields” • and higher variance (complex cell property). Limitation? • Higher levels therefore “own” less data. • Hence higher levels are have larger plasticity. • The more data, the more levels become populated. • We infer the number of layers. • By marginalizing out parameters , all variables become coupled.
Conclusion • Nonparametric Bayesian models good candidate for “lifelong learning” • need to improve computational efficiency & memory requirements • Algorithm for growing object taxonomies as a function of observed data • Proposal for deep belief net based on stacking LDA modules • more flexible representation & more sharing of statistical strength than taxonomy • Infinite Extension: • LDA HDP • mixture over levels Dirichlet process • nr. hidden variables per layer and nr layers inferred • demo?