Pattern Theory: Mathematics of Perception - Prof. David Mumford, Brown University

Pattern Theory: the Mathematics of Perception Prof. David Mumford Division of Applied Mathematics Brown University International Congress of Mathematics Beijing, 2002

Outline of talk I. Background: history, motivation, basic definitions • A basic example – Hidden Markov Models and speech; and extensions • The “natural degree of generality” – Markov Random Fields; and vision applications IV. Continuous models: image processing via PDE’s, self-similarity of images and random diffeomorphisms URL: www.dam.brown.edu/people/mumford/Papers /ICM02powerpoint.pdfor/ICM02proceedings.pdf

Some History • Is there a mathematical theory underlying intelligence? • 40’s – Control theory (Wiener-Pontrjagin), the output side: driving a motor with noisy feedback in a noisy world to achieve a given state • 70’s – ARPA speech recognition program • 60’s-80’s – AI, esp. medical expert systems, modal, temporal, default and fuzzy logics and finally statistics • 80’s-90’s – Computer vision, autonomous land vehicle

Statistics vs. Logic • Plato: “If Theodorus, or any other geometer, were prepared to rely on plausibility when he was doing geometry, he'd be worth absolutely nothing.” • Gauss – Gaussian distributions, least squares  relocating lost Ceres from noisy incomplete data • Control theory – the Kalman-Wiener-Bucy filter • AI – Enhanced logics < Bayesian belief networks • Vision – Boolean combinations of features < Markov random fields • Graunt – counting corpses in medieval London

ACTUAL SOUND The ?eel is on the shoe The ?eel is on the car The ?eel is on the table The ?eel is on the orange PERCEIVED WORDS The heel is on the shoe The wheel is on the car The meal is on the table The peel is on the orange What you perceive is not what you hear: (Warren & Warren, 1970) Statistical inference is being used!

Why is this old man recognizable from a cursory glance? His outline is lost in clutter, shadows and wrinkles; except for one ear, his face is invisible. No known algorithm will find him.

VARIABLES: MODEL: ML PARAMETER ESTIMATION: The Bayesian Setup, I

BAYES’S RULE: The Bayesian Setup, II • This is called the “posterior” distribution on xh • Sampling Pr(xo,xh|), “synthesis” is the acid test of the model • The central problem of Statistical learning theory: • The complexity of the model and the Bias-Variance dilemma • Minimum Description Length—MDL, • Vapnik’s VC dimension

sk Sk+1 Sk-1 sk = sound in window around time kDt are the observables xk = part of phoneme being spoken at time kDt are the hidden vars. xk+1 xk-1 xk ’s = log’s of table of values of p1, p2, so A basic example: HMM’s and speech recognition I. Setup

(a) A basic example: HMM’s and speech recognition II. Inference by dynamic programming: (c) Optimizing the ’s done by “EM” algorithm, valid for any exponential model

Make an empirical histogram of changes x and compute the kurtosis: Continuous and discrete variables in perception • Perception locks on to discrete labels, and the • world is made up of discrete objects/events • High kurtosis is nature’s universal signal of • discrete events/objects in space-time. • Stochastic process with i.i.d. increments has • jumps iff the kurtosis k of its increments is > 3.

A typical stochastic process with jumps Xtstochastic process with independent increments, then

Ex.: daily log-price changes in a sample of stocks Note fat power law tails N.B. vertical axis is log of probability

Use a weighted sample: • Bootstrap particle filtering: • (a) Sample with replacement • (b) Diffuse via p1 • (c) Reweight via p2 • Tracking application • (from A.Blake, M.Isard): Particle filtering • Compiling full conditional probability tables is usually impractical.

Estimating the posterior distribution on optical flow in a movie (from M.Black) Horizontal flow

(follow window in red) Horizontal flow

Horizontal flow

No process is truly Markov • Speech has longer range patterns than phonemes: triphones, words, sentences, speech acts, … • PCFG’s = “probabilistic context free grammars” = almost surely finite, labeled, random branching processes: Forest of random trees Tn, labels xv on vertices, leaves in 1:1 corresp with observations sm, prob. p1(xvk|xv) on children, p2(sm|xm) on observations.). • Unfortunate fact: nature is not so obliging, longer range constraints force context-sensitive grammars. But how to make these stochastic??

Grammar in the parsed speech of Helen, a 2 ½ year old

Grammar in images (G. Kanisza):contour completion

Markov Random Fields: the natural degree of generality • Time linear structure of dependencies • space/space-time/abstract situations  general graphical structure of dependencies The Markov property: xv, xw are conditionally independent, given xS , if S separates v,w in G. Hammersley-Clifford: the converse.

xk+1,l-1 xk+1,l xk+1,l+1 xk,l-1 xk,l xk,l+1 xk-1,l-1 xk-1,l xk-1,l+1 A simple MRF: the Ising model sk+1,l-1 sk+1,l sk+1,l+1 sk,l-1 sk,l sk,l+1 sk-1,l-1 sk-1,l sk-1,l+1

The Ising model and image segmentation

A state-of-the-art image segmentation algorithm (S.-C. Zhu) Input Segmentation Synthesis from model I ~ p( I | W*) Hidden variables describe segments and their texture, allowing both slow and abrupt intensity and texture changes (See also Shi-Malik)

Texture synthesis via MRF’s On left: a cheetah hide; In middle, a sample from the Gaussian model with identical second order statistics; On right, a sample from exponential model reproducing 7 filter marginals using:

Monte Carlo Markov Chains Basic idea: use artificial thermal dynamics to find minimum energy (=maximum probability) states

When the graph G is not a tree, use its universal covering graph • The Bethe approximation = • the p1(G)-invariant MRF on ‘closest’ to the given MRF, • ‘closest’ = minimizing Kullback-Liebler information distance Bayesian belief propagation and the Bethe approximation • Can find modes of MRF’s on trees using dynamic programming • ‘Bayesian belief propagation’ = finding the modes of the Bethe • approximation with dynamic programming

p=2 is Mumford-Shah model; • p=1, c3=0 is Osher-Rudin ‘TV’ model • p=2 is Perona-Malik equation; c=0, p=1 is TV-gradient descent Continuous models I:deblurring and denoising • Observe noisy, blurred image I, • seek to remove noise, enhance edges simultaneously!

An example: Bela Bartok enhanced via the Nitzberg-Shiota filter

Continuous models II: images and scaling • The statistics of images of ‘natural scenes’ appear to be a fixed point under block-averaging renormalization, i.e. • Assume NN images of natural scenes have a certain probability distribution; form N/2N/2 images by a window or by 22 averages – get the same marginal distribution!

Power law for the spectrum: • In the continuous limit, images are not locally integrable functions but generalized functions in: Scale invariance has many implications: • Intuitively, this is what we call ‘clutter’ – the mathematical explanation of why vision is hard

2. For all zero-mean filters F, the scalar random variables have kurtosis > 3 (D.Field, J.Huang). Three axioms for natural images 1. Scale invariance 3. Local image patches are dominated by ‘preferred geometries’: edges, bars, blobs as well as ‘blue sky’ blank patches (D.Marr, B.Julesz, A.Lee). It is not known if these axioms can be exactly satisfied!

Empirical data on image filter responses Probability distributions of 1 and 2 filters, estimated from natural image data. a) Top plot is for values of horizontal first difference of pixel values; middle plot is for random 0-mean 8x8 filters. Vertical axis in top 2 plots is log(prob.density). b) Bottom plot shows level curves of Joint prob.density of vert.differences at two horizontally adjacent pixels. All are highly non-Gaussian!

(Random wavelet model) (“Dead leaves” model) Mathematical models for random images

Continuous models III:random diffeomorphisms • The patterns of the world include shapes, structures which recur with distortions: e.g. alphanumeric characters, faces, anatomy • Thus the hidden variables must include (i) clusters of similar shapes, (ii) warpings between shapes in a cluster • Mathematically: need a metric on (i) the space of diffeomorphisms Gk of k, or (ii) the space of “shapes” Sk in k (open subsets with smooth bdry) • Can use diffusion to define a probability measure on Gk .

V.Arnold: Introduce a Riemannian metric in SGk, the group of volume preserving diffeos. For any path {t}, let Then he proved geodesics are solutions of Euler’s equation: Metrics on Gk, I

Christensen, Rabbitt, Miller – on Gk, use stronger metric: vt = velocity, ut = Lvt = momentum in this metric Geodesics now are solutions to a regularized compressible form of Euler’s equation: Metrics on Gk, II Note: linear in u, so u can be a generalized function!

Get geodesics {Ht} on Sk by taking singular momentum: (A geodesic, F.Beg) Geodesics in the quotient space S2 • S2 has remarkable structure: • ‘Weak’ Hilbert manifold • ‘Medial axis’ gives it a cell decomposition • Geometric heat equation defines a deformation retraction • Diffusion defines probability measure • (Dupuis-Grenander-Miller, Yip)

Consider geodesics whose momentum has finite support: This gives the ODE in which particles traveling in the same (resp. opposite) direction attract (resp. repel): K the Green’s function of L, a Bessel function. Geodesics in the quotient space of ‘landmark points’ gives a classical mechanical system(Younes)

Outlook for Pattern Theory • Finding a rich class of stochastic models adequate for duplicating human perception yet tractable (vision remains a major challenge) • Finding algorithms fast enough to make inferences with these models (Monte Carlo? BBP ? competing hypothesis particles?) • Underpinnings for a better biological theory of neural functioning e.g. incorporating particle filtering? grammar? warping? feedback? URL: www.dam.brown.edu/people/mumford/Papers /ICM02powerpoint.pdfor/ICM02proceedings.pdf

A sample of Graunt’s data

Pattern Theory: Mathematics of Perception - Prof. David Mumford, Brown University