Finding Things: Image Parsing with Regions and Per-Exemplar Detectors

Finding Things: Image Parsing with Regions and Per-Exemplar Detectors CVPR2013 Oral

Outline • Introduction • Approach • Experiments • Conclusions

Introduction • The problem of image parsing, or labeling each pixel in an image with its semantic category. • Our goal is achieving broad coverage – the ability to recognize hundreds or thousands of object classes that commonly occur in everyday street scenes and indoor environments.

Introduction • A major challenge in doing this is posed by the non-uniform statistics of these classes in realistic scene images. • Two main categories : • Stuff : A small number of classes – mainly ones associated with large regions or “stuff,” such as road, sky, trees, buildings, etc. • Things : people, cars, dogs, mailboxes, vases, stop signs – occupy a small percentage of image pixels and have relatively few instances each.

Introduction • An image parsing system that integrates region-based cues with the promising novel framework of per-exemplar. • First to transfer masks using per-exemplar detectors • Output a dense many-category labeling.

Approach • Region-Based Parsing • Detector-Based Parsing • SVM Combination and MRF Smoothing

Approach

Parsing pipeline • Obtain a retrieval set of globally similar training images • Region based data term (ER) is computed using Superparsing system • Detector based data term (ED) : • Run per-exemplar detectors for exemplars in the retrieval set • Transfer masks from all detections above a set detection • threshold to test image • Detector data term is computed as the sum of these • masks scaled by their detection score • Combine these two data terms by training a SVM on the concatenation of ED and ER • Smooth the SVM output (ESVM) using a MRF

Region-Based Parsing [27] J. Tighe and S. Lazebnik. SuperParsing: Scalable nonparametric image parsing with superpixels. IJCV, 101(2):329–349, Jan 2013.

Region-Based Parsing • Find a retrieval set of images similar to the query image. • Segment the query image into superpixels and compute feature vectors for each superpixel. • For each superpixel and each feature type, find the nearest-neighbor superpixels in the retrieval set. Compute a likelihood score for each class based on the superpixel matches • Use the computed likelihoods together with pairwise co-occurrence energies in an Markov Random Field (MRF) framework to compute a global labeling of the image.

Region-Based Parsing • Matches are used to produce a log-likelihood ratio score for label c at region si. • Use this score to define our region-based data term ER for each pixel p and class c:

Detector-Based Parsing [19] T. Malisiewicz, A. Gupta, and A. A. Efros. Ensemble of exemplarSVMs for object detection and beyond. In ICCV, 2011.

Detector-Based Parsing

Detector-Based Parsing • detector-based data term ED for a class c and pixel p. • simply take the sum of all detection masks from that class weighted by their detection scores:

SVM Combination andMRF Smoothing • Test image, for each pixel p and each class c • Two data terms: ER( p, c) and ED( p, c) • Training data for each SVM is generated by running region- and detector-based parsing on the entire training set. • smooth the labels with an MRF energy function

Experiments • Three challenging datasets: • SIFT Flow [18] • LM+SUN [27] • CamVid [5]

Experiments

Experiments • SIFT Flow

Experiments • LM+SUN

Experiments • CamVid

Experiments

Conclusions • We propose an image parsing system that integrates region-based cues with the promising novel framework of per-exemplar detectors. • Our current system achieves very promising results, but at a considerable computational cost. • Reducing this cost is an important future research direction.

Finding Things: Image Parsing with Regions and Per-Exemplar Detectors