510 likes | 551 Views
Discover how Conditional Random Fields (CRF), a special case of Markov Random Fields, are used to model dependencies and independencies efficiently in joint probability distributions, simplifying computations for sequence labeling tasks like Part-of-Speech Tagging.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data Yingbo Max Wang, Christian Warloe, Yolanda Xiao, Wenlong Xiong
Overview • Joint Probability with Markov Random Fields (MRF) • Conditional Random Fields (CRF), a special case of MRF • Inference for CRF • Parameter Estimation for CRF • Experimental Results
Modeling Joint Probability • How do we model the joint probability distribution for a group of random variables? • With no independence assumption, the number of combinations is exponential • P(x_1 ... x_n) = | # of outcomes in each random variable | ^ (# of random variables) • With complete independence assumption, the number of combinations is linear, but is an oversimplification in most cases (variables are actually correlated) • P(x_1 ... x_n) = P(x_1) ... P(x_n) • | # of outcomes in each random variable | * | # of random variables | • Need some middle ground • model dependence and independence between random variables efficiently
Markov Random Fields (MRF) • MRF Definition: • Undirected graph G = (V, E) • Set of random variables indexed by the nodes V • Edges represent correlations between random variables • Graph G is a MRF if X satisfies the local Markov property • Local Markov Property • variable is conditionally independent of all other variables, given its neighbors in the graph G • N(v) are the neighbors (nodes that are directly connected to XV by a single edge)
Markov Random Fields (MRF) • What is the significance of a MRF? • compact, graphical representation of dependencies between variables • each variable only depends on its immediate neighbors • these conditional independencies imply that we can factorize the joint probability • Factorization simplifies computations and reduces amount of calculation needed • remember how the independence assumption simplified calculation? • How do we factorize the joint probability? • Factorize into functions on cliques • Hammersley Clifford Theorem proves this is valid
Cliques • Clique Definition • A clique is a complete subgraph of G • a complete subgraph is a subset of vertices such that every 2 distinct vertices in the subgraph are adjacent • Example: • Red groups are cliques of 1 node • Orange group are also a cliques because: • A, B, C are all adjacent (but A not adjacent to D) • C, D are adjacent (but D not adjacent to A, B)
Hammersley-Clifford Theorem • Called the fundamental theorem of random fields • Definition: Markov Random Field is defined by the following the joint probability: • C is the set of all cliques, x_c is the set of random variables in clique c • F_c is some "potential function" that acts on clique c (is strictly positive) • Z is the partition function (normalizing constant to make probability sum to 1) • P(X) is the joint probability of the set of random variables • The joint probability can be factorized into the product of "clique potentials"
Factorization Example • Cliques are: • (A), (B), (C), (D), (AB), (AC), (BC), (CD), (ABC) • Maximal cliques (not a subset of another clique): • (ABC), (CD) • Therefore, if we only consider maximal cliques:
Clique Potentials • Clique potentials are usually written as an exponential function • { f_k } are k local features on x_c • w_k are weights for each feature f_k • Allows clique potential to be strictly positive • Parameterize clique potential using user-defined local features • Allows the joint probability to be written as:
Recap: MRF • Set of variables, some are dependent, some are independent • MRF lets us compactly model a joint distribution, with some independence assumptions • Hammersley-Clifford theorem lets us factorize joint prob. into clique potentials • Clique potentials can be parameterized using local features and weights • TLDR:
Part-of-Speech Tagging • How would we use a MRF? • Part-of-Speech Tagging Problem: • Model 2 sequences of random variables (length N each) • X - input - sequence of words / a sentence (observations) • Y - output - sequence of labels / tags (hidden states) X: [bob ] [made] [her ] [happy ] [the ] [other ] [day ] Y: [noun] [verb] [noun] [adverb] [article] [adjective] [noun]
Discriminative vs Generative Models • But MRF and HMM are both generative models • Uses a joint distribution P(X, Y) • We don't want to have to model P(X) explicitly, if we only observe a subset of it • Modeling P(X) requires making a lot of assumptions • Discriminative model • Uses conditional probability P(Y | X) • Doesn't model P(X), is just conditioned on it instead • Conditional Random Fields are a special case of MRF that are discriminative
Conditional Random Fields (CRF) • We have graph on a set of random variables {X, Y}, but then fix the observed variables {X} • If the nodes for random variables {Y} obey the Markov Property, {X, Y} is a CRF
Conditional Random Fields • We can define a conditional probability instead of a joint probability for CRFs • Z(x) is a normalization constant for x • The conditional probability factorizes into functions on cliques, just like MRF
Linear Chain CRF • Same graph as linear-chain MRF • Hidden states (labels) form a sequence, and are conditioned on observations (words) • We observe sequence X (white nodes) • We don’t make any assumption on the relationship between Xs • Cliques are Nodes and Edges • The CRF paper splits features into edge features and vertex features
Defining the CRF Model • Conditional Probability • y is a sequence of hidden states, each state of which can take on one of values • x is a sequence of observations, each observation can take on one of values
Defining the CRF Model • Conditional Probability • Features are given and fixed • f_k are features on "hidden state edges" (ex: Y_i is a noun and Y_j is a verb given X) • g_k are features on "hidden state vertices" (ex: Y_i is a noun given X) • lambda_k and mu_k are parameters for each feature • Z(x) is normalization based on all the observations x
Defining the CRF Model • Since the CRF is a linear chain, we can define "transition weights" from one hidden state in the sequence to the next hidden state • hidden state ( i ) takes on value y • hidden state ( i - 1 ) takes on value y'
Defining the CRF Model • Define a matrix M_i, that represents every transition from hidden state ( i - 1 ) to hidden state ( i ) • Let’s look at an example first
Conditional Probability Example • We have hidden state sequence • start and end states • Want to find probability of sequence of states, given X
Y_S, Y1, Y2, Y3, Y_E:Hidden states • A, B, Start, End: Values that the hidden states has taken • Edges in the graph: • Looking at all the edges between two hidden states Yi-1 and Yi: y S A B E S A B E y'
Defining the CRF Model • Define a matrix M_i, that represents every transition from hidden state ( i - 1 ) to hidden state ( i ) • We can use this matrix to define Z and the P(Y|X)
Recap: CRF • Conditional Random Fields follow from MRF • Discriminative model instead of Generative • All the advantages of MRF (compactly models dependence assumptions) • Conditional Random Fields factor the conditional probability into: • features that act on cliques • weights for each feature • cliques are edges and nodes in graph • Questions: • How to perform inference? • How to train (parameter estimation)?
Inference • How do we perform inference if we know model parameters? • How to find the most likely hidden state sequence y? • To predict the label sequence, we maximize the conditional probability: • We use the Viterbi Algorithm
Viterbi Algorithm • Given the model, find the most likely sequence of hidden states • Approach: Recursion + Dynamic Programming (same as HMM) • Update for HMM: • Update for CRF: • S is the set of values y can take on. i, j are values in S • delta_t ( j ) is the maximum "probability" of the most likely path ending at y_t = j
Calculating Marginal Probabilities • How do we calculate the most likely label for a specific state in the sequence? (or most likely transition for a pair of states?) • Use the Forward-Backward algorithm to calculate marginal probabilities • Probability of an edge/vertex is the normalized sum of all paths through that edge / vertex • Use forward and backward vectors to cache these sums
Calculating Marginal Probabilities • To calculate probability of an edge being in a path: • To calculate probability of a vertex being in a path:
Parameter Estimation for CRF • We want to find the best values for μ,λ
Objective Function • How do we define which parameters are best? • Normalized Log Likelihood Function
Improved Iterative Scaling Algorithm • We want to change the parameters in a way that increases the log likelihood • Trying to maximize this directly results in a set of highly coupled equations • We instead maximize a simpler lower bound
Improved Iterative Scaling Algorithm • Take the derivative and set to zero to find the parameter change that maximizes the increase in likelihood
Improved Iterative Scaling Algorithm • Take the derivative and set to zero to find the parameter change that maximizes the increase in likelihood
Algorithm S • How do we sum over varying T(x,y)? • How do we sum over all y (exponential number of combinations)?
Algorithm S • Idea 1: Use a slack feature (i.e. upper bound) S instead of T(x,y)
Algorithm S • Idea 2: Since each feature only depends on a single edge or vertex, sum over all possible edges/vertices instead of sequences (using marginal probabilities)
Final Update Equations • Define update equation for μk similarly, using marginal probability of vertex instead of edge
Improving on Algorithm S • S is usually very large (proportional to the length of the longest training sequence) • Dataset has sequences of varying length • Large S causes parameter updates to be very small • Long time to convergence • Can we use a better approximation of T(x,y)?
Algorithm T • Instead of taking a global upper bound on T(x,y), take the upper bound given x (per-sequence S calculation):
Algorithm T • Group sums by the values of T(x)
Algorithm T • We can use Newton’s method to find the root of the resulting polynomials
Experimental Results • Experiments • Modeling Mixed-Order Sources • Position-of-Speech (POS) Tagging • Models Tested • Hidden Markov Model (HMM): Generative • Conditional Random Field (CRF): Discriminative • Maximum-Entropy Markov Model (MEMM): Discriminative • Condition locally on the current hidden state only - without normalizing the probabilities globally • Suffered from the Label Bias Problem
Modeling Mixed-Order Sources • Data Generation • Synthetic data by randomly chosen HMM, mixture of first-order and second-order models • State transition probability: pα(yi | yi−1, yi−2) =α p2(yi | yi−1, yi−2) + (1 − α) p1(yi | yi−1) • Emission probability: pα(xi | yi, xi−1) =α p2(xi | yi, xi−1)+(1−α) p1(xi | yi) • Training and Testing data: 1000 sequence of length 25 • Training and Testing • Algorithm S (CRF), Viterbi Algorithm to label a test set • MEMMs and CRFs do not use overlapping features for observations
Modeling Mixed-Order Sources • Results • Error rates increase for all models when data become “more second order” • CRF typically outperforms MEMM, except for a few cases with small error rate (a < 0.01) • Maybe insufficient number of CRF training iterations • HMM almost always outperforms MEMM • CRF typically outperforms HMM when data are second-order (a > ½) a < 1/2 a > 1/2 a < 0.01
Position-of-Speech (POS) Tagging • Dataset • Penn Treebank part-of-speech tagset • 45 syntactic tags • 50% training data, 50% testing data • Experiment #1 • First-order HMM, MEMM, CRF • Results • CRF > HMM > MEMM • Labeling Bias Problem