Probabilistic Models for Parsing Images

Probabilistic ModelsforParsing Images Xiaofeng Ren University of California, Berkeley

Water back Grass Tiger Tiger Sand head eye legs tail mouse shadow Parsing Images outdoor wildlife

Objects & Scenes Pixels & Pixel Features Contours & Regions Water Grass Tiger Sand Mid-level High-level Low-level Image Processing Perceptual Organization Recognition A Classical View of Visual Processing

Contours & Regions Objects & Scenes Pixels Mid-level High-level Low-level Perceptual Organization Image Processing Recognition Models for Parsing Images A unified framework incorporating all levels of abstraction

Labels Pixels Probabilistic Models for Images • Markov Random Fields [Geman & Geman 84] • Image restoration • Edge detection • Texture synthesis • Segmentation • Super-resolution • Contour completion ……… Empirical evidence against pixel-based MRF [Ren & Malik 02] very limited representational power

Where is Structure? Our perception of structure is disrupted. We cannot efficiently reason about structure if we cannot represent it.

Outline • Parsing Images • Building a Mid-level Representation • Probabilistic Models for Mid-level Vision • Contour Completion • Figure/Ground Organization • Combining Mid- and High-level Vision • Object Segmentation • Finding People • Conclusion & Future Work

Local Edge Detection • Use the Pb (probability of boundary) edge detector: combining local brightness, texture and color contrasts.

Piece-wise Linear Approximation • Recursively split the boundaries (using angles) until each piece is approximately straight

Constrained Delaunay Triangulation (CDT) • A variant of the standard Delaunay Triangulation • Keeps a given set of edges in the triangulation • Widely used in geometric modeling and finite elements.

Scale Invariance of CDT

millions of pixels •  1000 edges • fast to compute • scale-invariant • completes gaps • little loss of structure • longer ranges of • interaction Pixels Superpixels The CDT Graph: Summary [Ren & Malik; ICCV 2003] [Ren, Fowlkes & Malik; ICCV 2005] Principle of Uniform Connectedness: use homogenous regions as entry-level units in perceptual organization. [Palmer and Rock 94]

Objects & Scenes Objects & Scenes Sentences & Paragraphs Contours & Regions Contours & Regions Phrases Words Superpixels Pixels Pixels Letters Analogy with Natural Language Parsing

Outline • Parsing Images • Building a Mid-level Representation • Probabilistic Models for Mid-level Vision • Contour Completion • Figure/Ground Organization • Combining Mid- and High-level Vision • Object Segmentation • Finding People • Conclusion & Future Work

Figure/ground organization Curvilinear grouping Region segmentation Mid-level Vision • It is not low-level vision (which can be computed independently in a local neighborhood). • It is not high-level vision (which assumes knowledge of particular object categories & scenes). • Problems in mid-level vision

Mid-level Vision • Problems in mid-level vision Figure/ground organization Curvilinear grouping Region segmentation

Good continuation Visual completion Illusory contours Curvilinear Grouping • Boundaries are smooth in nature! • A number of associated visual phenomena

Beyond Local Edge Detection • There is psychophysical evidence that we are approaching the limit of local edge detection • Smoothness of boundaries in natural images provides an important contextual cue.

Random Field: Xe Xe Xe Xe Xe Xe Xe Xe Xe Xe Xe Xe which defines a joint probability distribution on all {Xe} Xe Xe Xe Xe Xe Xe Inference on the CDT Graph Estimate the marginal P(Xe) Xe{0,1} 1: boundary 0: non-boundary

Edge potentialsexp(ii) Junction potentialsexp(jj) where Conditional Random Fields (CRF) X={X1,X2,…,Xm} [Pietra, Pietra & Lafferty 97] [Lafferty, McCallum & Pereira 01] Undirected graphical model with potential functions in the exponential family

Edge Potential: Local Contrast potentialsexp(ii) = average contrast on each edge e

Xe Xe Xe 0 0 1 1 0 1 1 1 0 0 0 1 deg=0 (no lines) deg=1 (line ending) deg=2 (continuation) deg=3 (T-junction) Junction Potential: Degree The degree of the junction depends on the assignments of {Xe} potentialsexp(jj) j= ( deg=j)

 1 1 0 deg=2 (continuation) Junction Potential: Continuity = g()·( deg=2 )

2.46 0.87 1.14 0.01 Learning the Parameters Compare to [Geman and Geman 84] mid-level representation + probabilistic framework + large annotated datasets

Precision Recall matched pairs Precision = High threshold; few detections total detections matched pairs Low threshold; lots of detections Recall = total groundtruth Evaluation: Precision vs Recall match to groundtruth

Horse dataset of [Borenstein and Ullman 02], 175 images training, 175 testing Curvilinear grouping improves boundary detection, both for low-recall and high-recall “Mid-level vision is useful” [Ren, Fowlkes & Malik; ICCV 2005]

Image Pb CRF

Mid-level Vision • Problems in mid-level vision Figure/ground organization Curvilinear grouping Region segmentation

Ground (shapeless) Figure (face) Figure (Goblet) Ground (Shapeless) Figure/Ground Organization • A contour belongs to one of the two (but not both) abutting regions. Important for the perception of shape

Xe Xe Xe Xe Xe Xe Xe Xe Xe Xe Xe Xe Xe Xe Xe Xe Xe Xe Inference on the CDT Graph Local Model: Convexity, Parallelism,… Global Model: Consistency at T-junctions Xe{-1,1} 1: Left is Figure -1: Right is Figure

Results Using human segmentations [Ren, Fowlkes & Malik; ECCV 2006]

Objects & Scenes Labels {Xe} Water Contours & Regions Grass Tiger Sand Superpixels Pixels Models for Contour Labeling Curvilinear Grouping Figure/Ground Assignment CRF

CSP > : contour direction + : convex edge - : concave edge possible junctions (constraints) Line Labeling • Reviving the old tradition with modern technologies, for more realistic applications [Clowes 1971, Huffman 1971; Waltz 1972]

Objects & Scenes Water Contours & Regions Grass Tiger Sand Superpixels Pixels Parsing Images • Add region-based variables and cues • Joint contour and region inference • Add high-level knowledge (objects)

… Object Segmentation Object-specific cues: • Shape • Region support • Color/Texture …

Z Encoding location, scale, pose, etc. Xe Xe Yt Yt Xe Xe Xe Xe Xe Xe Yt Yt Xe Xe Yt Yt Xe Xe Xe Xe Yt Yt Yt Yt Xe Xe Xe Xe Xe Xe Xe Xe Xe Xe Yt Yt Xe Xe Yt Yt Yt Yt Xe Xe Yt Yt Xe Xe Xe Xe Xe Xe Yt Yt Xe Xe Inference on the CDT Graph Z Contour variables {Xe} Region variables {Yt} Object variable {Z} Integrating {Xe},{Yt} and{Z}: low/mid/high-level cues

Grouping Cues • Low-level Cues • Edge energy along edge e • Brightness/texture similarity between two regions s and t • Mid-level Cues • Edge collinearity and junction frequency at vertex V • Consistency between edge e and two adjoining regions s and t • High-level Cues • Texture similarity of region t to exemplars • Compatibility of region support with pose • Compatibility of local edge shape with pose L1(Xe|I) L2(Ys,Yt|I) M1(XV|I) M2(Xe,Ys,Yt) H1(Yt|I) H2(Yt,Z|I) H3(Xe,Z|I)

Cue Integration in CRF Estimate the marginal posteriors of X, Y and Z

Object knowledge helps a lot Mid-level Cues still useful [Ren, Fowlkes & Malik; NIPS 2005]

Input Input Pb Output Contour Output Figure

Finding People The challenges: • Pose articulation + self-occlusion • Clothing • Lighting • Clutter ……

Objects & Scenes Objects & Scenes Contours & Regions Superpixels Pixels Pixels Finding People: Top-Down Top-down approaches • 3D model-based fails most of the time • 2D template-based needs lots of training data

Objects & Scenes Objects & Scenes Objects & Scenes Contours & Regions Contours & Regions Superpixels Superpixels Superpixels Pixels Pixels Pixels Pixels Finding People: Bottom-Up

[Ren, Berg & Malik; ICCV 2005]

Probabilistic Models for Parsing Images