Levels of supervision for training object category models

Unsupervised discovery of visual object class hierarchiesJosef Sivic (INRIA / ENS), Bryan Russell (MIT), Andrew Zisserman (Oxford), Alyosha Efros (CMU) and Bill Freeman (MIT)

Levels of supervision for training object category models • Object label + segmentation [Agarwal & Roth, Leibe & Schiele, Torralba et al., Shotton et al.] [Viola & Jones] • Object label only [Csurka et al, Dorko & Schmid, Fergus et al.,Opelt et al] [Barnard et al.] • None? Images only Can we learn about objects just by looking at images?

Goal: Given a collection of unlabelled images, discover a hierarchy of visual object categories Which images contain the same object(s)? Where is the object in the image? Organize objects into a visual hierarchy (tree).

Review: Object discovery in the visual domain I. Represent an image as a bag-of-visual-words II. Apply topic discovery methods to find objects in the corpus of images Hofmann: Probabilistic latent semantic analysis Blei et al.: Latent Dirichlet Allocation Decompose image collection into objects common to all images and mixture coefficients specific to each image [Sivic, Russell, Efros, Freeman, Zisserman, ICCV’05]

Topic discovery models Probabilistic Latent Semantic Analysis (pLSA) [Hofmann’99] M documents N words per document d … documents (images) w … visual words z … topics (‘objects’) ‘Flat’ topic structure – all topics are ‘available’ to all documents

Hierarchical topic models [Hofmann’99, Blei et al. ’2004, Barnard et al.’01] z … levels • Topics organized in a tree • Document is a superposition of topics along a single path • Topics at internal nodes are shared by two or more paths • The hope is that more specialized topics emerge as we descend the tree c … paths

z … levels c … paths Hierarchical topic models [Hofmann’99, Blei et al. ’2004, Barnard et al.’01] d … documents (images) w … words z … levels of the tree c … paths in the tree • For each document: • Introduce a hidden variable c indicating the path in the tree

Hierarchical Latent Dirichlet Allocation (hLDA) [Blei et al. ’2004] d … documents (images) w … words z … levels of the tree c … paths in the tree z … levels c … paths Treat P(z|d) and P(w|z,c) as random variables sampled from Dirichlet prior:

Hierarchical Latent Dirichlet Allocation (hLDA) [Blei et al. ’2004] d … documents (images) w … words z … levels of the tree c … paths in the tree z … levels c … paths • Tree structure is not fixed: • assignments of documents to paths, cj, are sampled from • the nested Chinese restaurant process prior (nCRP)

Nested Chinese restaurant process (nCRP) [Blei et al.’04] Example: CRP: customers sit in a restaurant with unlimited number of tables 5th customer arriving • Nested CRP: extension of CRP to tree structures • Prior on assignments of documents to paths in the tree (of fixed depth L) • Each internal node corresponds to a CRP, each table points to a child node 1,2,3,4 A C B 1,2,3 4 1,2 3 4 D E F Example: Tree of depth 3 with 4 documents Sample path for the 5-th document

z … levels c … paths hLDA model fitting Use Gibbs sampler to generate samples from P(z,c,T|w) • For a given document j: • sample zj while keeping cj fixed (LDA along one path) • sample cj while keeping zj fixed (can delete/create branches)

Image representation – ‘dense’ visual words Extract circular regions on a regular grid, at multiple scales Represent each region by a SIFT descriptor Cf. [Agarwal and Triggs’05, Bosch and Zisserman’06]

Build a visual vocabulary K = 10 + 1 K = 100 + 1 Visualization by ‘average’ words from the training set (single scale) Quantize descriptors using k-means

Vocabulary with varying degree of spatial and appearance granularity Granularity Appearance Spatial Bag of words V1: K1 = 11 Bag of words V2: K2 = 101 Combined vocabulary: K = 11+101+909+2,525 = 3,546 visual words 3x3 grid V3: K3 = 101 5x5 grid K4 = 101 V4: Cf. Fergus et al.’ 05 Lazebnik et al.’06

Example I. – cropped LabelMe images • 125 images, 5 object classes: • cars side, cars rear, switches, traffic lights, computer screens • Images cropped to contain mostly the object, and normalized for scale

z … levels Bag of words V1: K1 = 11 Bag of words V2: K2 = 101 c … paths 3x3 grid V3: K3 = 101 5x5 grid K4 = 101 V4: Example I. – cropped LabelMe images • Learn 4-level tree hierarchy • Initialization: • c with a random tree (125 documents) sampled from nCRP (=1) • z based on vocabulary granularity

Example I. – cropped LabelMe images Learnt object hierarchy Nodes visualized by average images Example images assigned to different paths

Quality of the tree? For each node t and class i measure the classification score: Good score: • All images of class i assigned to node t (high recall) • No images of other classes assigned to t (high precision) Images assigned to a path passing through t ground truth images in class i Intersection Union Score for class i:

Quality of the tree? For each node t and class i measure the classification score: Example: traff. lights, node 2 Images assigned to a path passing through t ground truth images in class i Intersection Union Score for class i:

Quality of the tree? For each node t and class i measure the classification score: Example: switches, node 9 Images assigned to a path passing through t ground truth images in class i Intersection Union Score for class i:

Quality of the tree? For each node t and class i measure the classification score: Images assigned to a path passing through t ground truth images in class i Intersection Union Overall score:

Example II. – MSRC b1 dataset 240 images, 9 object classes, pixel-wise labelled Cars Airplanes Cows Buildings Faces Grass Trees Bicycles Sky

Example II. – MSRC b1 dataset Experiment 1: Known object mask (manual), unknown class labels • More objects and images (than Ex. I) • Measure classification performance • Compare with the standard `flat’ LDA Experiment 2: Both segmentation and class labels unknown (just images) • ‘Unsupervised discovery’ scenario • Employ the ‘multiple segmentations’ framework of [Russell et al.,’06] • Measure segmentation accuracy

MSRC b1 dataset – known object mask Learnt tree visualized by average images, nodes size indicates # of images Some nodes visualized by top 3 images (sorted by KL divergence)

MSRC b1 dataset – known object mask Classification performance: comparison with ‘flat’ LDA Flat LDA: Estimate mixing weights for each topic i Assign each image to a single topic:

MSRC b1 dataset – unknown object mask and image labels

Multiple segmentation approach [Russell et al.’06] (review) 1) Produce multiple segmentations of each image 2) Discover clusters of similar segments 3) Score segments by how well they fit object cluster (here use hLDA) Images Multiple segmentations Cars Buildings

Road/asphalt

Conclusions • Investigated learning visual object hierarchies using hLDA • The number of topics/objects and the structure of the tree is estimated automatically from the data • Topic/object hierarchy may improve classification performance

Levels of supervision for training object category models