First-Order Probabilistic Models for Coreference Resolution

First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with Andrew McCallum [advisor], Michael Wick, Robert Hall

Probabilistic First-Order Logic for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts, Amherst Joint work with Andrew McCallum [advisor], Michael Wick, Robert Hall

Previous work: Conditional Random Fields for Coreference

A Pairwise Conditional Random Field for Coreference (PW-CRF) [McCallum & Wellner, 2003, ICML] . . . Mr Powell . . . x1 . . . Powell . . . y x2 y y Coreferent(x2,x3)? x3 . . . she . . .

A Pairwise Conditional Random Field for Co-reference (PW-CRF) [McCallum & Wellner, 2003, ICML] . . . Mr Powell . . . x1 . . . Powell . . . y x2 y y x3 . . . she . . .

A Pairwise Conditional Random Field for Co-reference (PW-CRF) [McCallum & Wellner, 2003, ICML] . . . Mr Powell . . . x1 45 . . . Powell . . . y x2 y -30 y Pairwise compatibility score learned from training data 11 x3 . . . she . . .

A Pairwise Conditional Random Field for Co-reference (PW-CRF) [McCallum & Wellner, 2003, ICML] . . . Mr Powell . . . x1 45 . . . Powell . . . y x2 y -30 y Pairwise compatibility score learned from training data 11 Hard transitivity constraints enforced by prediction algorithm x3 . . . she . . .

Prediction in PW-CRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Mr Powell . . . x1 45 . . . Powell . . . x2 -30 Often approximated with agglomerative clustering 11 x3 . . . she . . . = 64

Parameter Estimation in PW-CRFs • Given labeled documents, generate all pairs of mentions • Optionally prune distant mention pairs [Soon, Ng, Lim 2001] • Learn binary classifier to predict coreference • Edge weights proportional to classifier output

Sometimes pairwise comparisons are insufficient • Entities have multiple attributes (name, email, institution, location); need to measure “compatibility” among them. • Having 2 “given names” is common, but not 4. • e.g. Howard M. Dean / Martin, Dean / Howard Martin • Need to measure size of the clusters of mentions. •  a pair of name strings where edit distance differs > 0.5? • Maximum distance between mentions in document • A entity contains only pronoun mentions? We need measures on hypothesized “entities” We need First-order logic

First-Order Logic CRFs for Coreference

First-Order Logic CRFs for Co-reference (FOL-CRF) . . . Mr Powell . . . x1 . . . Powell . . . x2 y -56 Coreferent(x1,x2,x3)? x3 . . . she . . .

First-Order Logic CRFs for Co-reference (FOL-CRF) . . . Mr Powell . . . x1 . . . Powell . . . x2 y -56 Coreferent(x1,x2,x3)? Clusterwise compatibility score learned from training data Features are arbitrary FOL predicates over a set of mentions x3 . . . she . . .

First-Order Logic CRFs for Co-reference (FOL-CRF) . . . Mr Powell . . . x1 . . . Powell . . . x2 y -56 Coreferent(x1,x2,x3)? As in PW-CRF, prediction can be approximated with agglomerative clustering x3 . . . she . . .

Learning Parameters of FOL-CRFs • Generate classification examples where input is a set of mentions • Unlike Pairwise CRF, cannot generate all possible examples in training data

. . . . . . Combinatorial Explosion! … Coreferent(x1,x2 ,x3,x4 ,x5 ,x6) … Coreferent(x1,x2 ,x3,x4 ,x5) … Coreferent(x1,x2 ,x3,x4) … Coreferent(x1,x2 ,x3) … Coreferent(x1,x2) Learning Parameters of FOL-CRFs She He Powell Rice he Secretary

This space complexity is common in probabilistic first-order logic Gaifman 1964 Halpern 1990 Paskin 2002 Poole 2003 Richardson & Domingos 2006

Training in Probabilistic FOLParameter estimation; weight learning • Input • First-order formulae • x S(x) T(x) • Labeled data • a, b, c S(a), T(a), S(b), T(b), S(c) • Output • Weights for each formula • x S(x) T(x) [0.67] xy Coreferent(x,y)  Pronoun(x) xy Coreferent(x,y)  Pronoun(x) [-2.3]

Training in Probabilistic FOLPrevious Work • Maximum likelihood • Require intractable normalization constant • Pseudo-likelihood [Richardson, Domingos 2006] • Ignores uncertainty of relational information • E-M [Kersting, De Raedt 2001; Koller, Pfeffer 1997] • Sampling [Paskin 2002] • Perceptron [Singla, Domingos 2005] • Can be inefficient when prediction is expensive • Piecewise training [Sutton, McCallum 2005] • Train “pieces” of world in isolation • Performance sensitive to which pieces are chosen

Training in Probabilistic FOLParameter estimation; weight learning • Most methods require “unrolling” [grounding] • Unrolling has exponential space complexity • E.g., xyz S(x,y,z) -> T(x,y,z) • For constants [a b c d e f g h] must examine all triples • Sampling can be inefficient due to large sample space. • Proposal: Let prediction errors guide sampling

Error-driven Training • Input • Observed data X // Input mentions • True labeling P // True clustering • Prediction algorithm A // Clustering algorithm • Initial weights W, prediction Q // Initial clustering • Iterate until convergence • Q’ A(Q, W, O) // Merge clusters • If Q’ introduces an error • UpdateWeights(Q, Q’, P, O, W) • Else Q  Q’

UpdateWeights(Q, Q’, P, O, W)Learning to Rank Pairs of Predictions • Using truth P, generate a new Q’’ that is a better modification of Q than Q’. • Update W s.t. Q’’ A(Q, W, O) • Update parameters so Q’’ is ranked higher than Q’

Ranking vs Classification Training • Instead of training[Powell, Mr. Powell, he] --> YES[Powell, Mr. Powell, she] --> NO • ...Rather...[Powell, Mr. Powell, he] > [Powell, Mr. Powell, she] • In general, higher-ranked example may contain errors [Powell, Mr. Powell, George, he] > [Powell, Mr. Powell, George, she]

Ranking Parameter Update • In our experiments, we use a large-margin update based on MIRA [Crammer, Singer 2003] • Wt+1 = argminW ||Wt - W|| • s.t. • Score(Q’’, W) - Score(Q’, W) ≥ 1

Advantages • Never need to unroll entire network • Only explore partial solutions prediction algorithm likely to produce • Weights tuned for prediction algorithm • Adaptable to different prediction algorithms • beam search, simulated annealing, etc. • Adaptable to different loss functions Related: • Incremental Perceptron [Collins, Roark 2004] • LaSO [Daume, Marcu 2005] Extended here for FOL, ranking, max-margin loss. Rank partial, possibly mistaken predictions.

Disadvantages • Difficult to analyze exactly what global objective function is being optimized • Convergence issues • Average weight updates

Experiments • ACE 2004 coreference • 443 newswire documents • Standard feature set [Soon, Ng, Lim 2001; Ng & Cardie 2002] • Text match, gender, number, context, Wordnet • Additional first-order features • Min/Max/Average/Majority of pairwise features • E.g., Average string edit distance, Max document distance • Existential/Universal quantifications of pairwise features • E.g., There exists gender disagreement • Prediction: Greedy agglomerative clustering

Experiments Better Training Better Representation B-Cubed F1 Score on ACE 2004 Noun Coreference [to our knowledge, best previously reported results ~ 69% (Ng, 2005)]

Conclusions Combining logical and probabilistic approaches to AI can improve state-of-the-art in NLP. Simple approximations can make these approaches practical for real-world problems.

Future Work • Fancier features • Over entire clusterings • Less greedy inference • Metropolis-Hastings sampling • Analysis of training • Which positive/negative examples to select when updating • Loss function sensitive to local minima of prediction • Analyze theoretical/empirical convergence

Thank you

First-Order Probabilistic Models for Coreference Resolution

First-Order Probabilistic Models for Coreference Resolution

Presentation Transcript

Supervised models for coreference resolution

Error Analysis for Learning-based Coreference Resolution

Resolution Proof System for First Order Logic

Easy-First Coreference Resolution

Decision Trees for Coreference Resolution

Probabilistic Models

Specialized models and ranking for coreference resolution

Coreference Resolution

Probabilistic Models

Inference Protocols for Coreference Resolution

Graph-based Event Coreference Resolution

Learning noun phrase coreference resolution

Unsupervised Models for Coreference Resolution

Lifted First-Order Probabilistic Inference

Models in First Order Logics

Detecting Anaphoricity and Antecedenthood for Coreference Resolution

Incorporating Contextual Cues in Trainable Models for Coreference Resolution

Learning noun phrase coreference resolution

Learning First-Order Probabilistic Models with Combining Rules

Probabilistic Models

Probabilistic Models

First-Order Probabilistic Inference