Conditional Random Fields

Conditional Random Fields Presented by Shira Kritchman & Lena Gorelick May 13, 2007 Advanced Topics in Computer and Human Vision Spring 2007

Outline • Introduction • Statistical Modeling • Generative vs. Discriminative Models • Naïve Bayes vs. Logistic Regression • Sequence Modeling: HMM • CRF • Sequence Modeling: Linear Chain CRF • Learning (Parameter Estimation) • Improved Iterative Scaling (IIS) • S algorithm • General CRF • Applications

Problem Formulation – Label Assignment • Classification • Segmentation Horse Horse Non-horse

Problem Formulation – Label Assignment Labels Observation Labels Horse Horse Non-horse

Problem Formulation – Label Assignment • Classification Biology Math

Problem Formulation – Label Assignment • Parsing TAKE THE GREEN APPLE FROM THE BOX THEN HIT IT WITH MY SWORD verb article adjective noun preposition article noun conjunction verb pronoun preposition adjective noun

Problem Formulation • Find the Mechanism = LEARNING ? Prior Knowledge Data – Horses are rarely blue. Horses have 4 legs. … Neighboring pixels have similar labels.

Generative Modeling • Define a joint probability distributionover observation and label pairs • Label Assignment: Bayes

What does it generate? • Assumption: the “outputs” probabilistically generate the “inputs” • So we use

Generative Modeling • What are the candidate distributions for ? • Too simple • Underfitting • Too sparse • Overfitting

Data Prior Knowledge Horses are rarely blue. Horses have 4 legs. … Neighboring pixels have similar labels. Generative Modeling – Model Family • Define a model family ?

Generative Modeling – Likelihood • We look for that maximizes the likelihood Data - Training Data !

Generative Model – i.i.d. Framework Data -

Discriminative Modeling • Directly defines a conditional probability distribution over the labels given the observation • Label Assignment:

Discriminative Modeling • Does not include a model of • Which is not needed for label assignment anyway!

Discriminative Modeling – Model Family • Define a model family ?

Discriminative Modeling – Likelihood • We look for that maximizes the conditional likelihood Data !

Discriminative Model – i.i.d. Framework Data

Generative vs. Discriminative

What is Conditional Likelihood? not required for the labeling task Conditional Likelihood!

Generative Model – Example • Naïve Bayes – Horse Horse Non-horse

Discriminative Model – Example • Logistic Regression –

Can have complex dependencies among are independent given Generative vs. Discriminative • Discriminative Model is better suited to contain rich overlapping features

Generative vs. Discriminative • Model relation between (age, weight, blood preasure) will suffer from a heart attack soon (binary) • Natural to model • Unnatural to model

Models Strict independence assumptions on the observations Models Allows arbitrary, inter-dependent features on the observation Does not spend effort on modeling Generative vs. Discriminative

Classifiers and Graphical Models • and predict a single variable • What about predicting many variables that are interdependent? • Use a graphical model

Sequence Models – HMM • Simple graphical models Hidden states Observable variables

Sequence Models – HMM • Parsing verb article adjective noun preposition article noun conjunction verb pronoun preposition adjective noun TAKE THE GREEN APPLE FROM THE BOX THEN HIT IT WITH MY SWORD

HMM – Exponential Form • Rewrite as features noun verb noun apple

From HMM to CRF • The underlying conditional distribution: Partition function per observation

From HMM to CRF • We can now use richer features of the observation for the same price!

Linear-Chain CRF – Definition • random vectors • a parameter vector • real-valued functions  Linear-Chain CRF is HMM

Parameter Estimation – Maximum Likelihood • Maximize the conditional log likelihood: Concave!!! Global Maximum!!!

Parameter Estimation – Maximum Likelihood • Take partial derivatives w.r.t. • There is no closed form solution, since are coupled. Any Alternatives? Model expectation Empirical mean Detour

We assumed: We maximized conditional likelihood: We got: We assume: We maximize conditional entropy: We get: Maximum Likelihood – Maximum Entropy We get the same distribution

Parameter Estimation - Finding • Given current parameter estimation • Find a new set of parameters s.t., • Repeat until convergence Gain in likelihood

Parameter Estimation - Finding • Bound with auxiliary function • Maximize w.r.t. • Update

Parameter Estimation - Finding • Improved Iterative Scaling Algorithm (IIS): • Start with some (arbitrary) value for each • Repeat until convergence: • Solve for • Set

Parameter Estimation: IIS – S algorithm • Differentiating w.r.t. gives • Note that if Total feature count

Parameter Estimation: IIS – S algorithm • Define a new slack feature • And we have an additional constraint for

Parameter Estimation: IIS – S algorithm For each need to compute marginals at every iteration! Local in

Computing Marginals with BP • Computing marginals in Linear-Chain CRF • efficient and exact BP • IIS algorithm # optimization steps # in training

Parameter Estimation: IIS(S) – Summary • Closed form solution • Converges to global maximum • is proportional to the length of • Small optimization steps for large • T algorithm

Conditional Random Fields