310 likes | 328 Views
Explore first-order logic models for coreference resolution, focusing on Probabilistic First-Order Logic and First-Order Logic CRFs. Learn methods for training and parameter estimation.
E N D
First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with Andrew McCallum [advisor], Michael Wick, Robert Hall
Probabilistic First-Order Logic for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts, Amherst Joint work with Andrew McCallum [advisor], Michael Wick, Robert Hall
Previous work: Conditional Random Fields for Coreference
A Pairwise Conditional Random Field for Coreference (PW-CRF) [McCallum & Wellner, 2003, ICML] . . . Mr Powell . . . x1 . . . Powell . . . y x2 y y Coreferent(x2,x3)? x3 . . . she . . .
A Pairwise Conditional Random Field for Co-reference (PW-CRF) [McCallum & Wellner, 2003, ICML] . . . Mr Powell . . . x1 . . . Powell . . . y x2 y y x3 . . . she . . .
A Pairwise Conditional Random Field for Co-reference (PW-CRF) [McCallum & Wellner, 2003, ICML] . . . Mr Powell . . . x1 45 . . . Powell . . . y x2 y -30 y Pairwise compatibility score learned from training data 11 x3 . . . she . . .
A Pairwise Conditional Random Field for Co-reference (PW-CRF) [McCallum & Wellner, 2003, ICML] . . . Mr Powell . . . x1 45 . . . Powell . . . y x2 y -30 y Pairwise compatibility score learned from training data 11 Hard transitivity constraints enforced by prediction algorithm x3 . . . she . . .
Prediction in PW-CRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Mr Powell . . . x1 45 . . . Powell . . . x2 -30 Often approximated with agglomerative clustering 11 x3 . . . she . . . = 64
Parameter Estimation in PW-CRFs • Given labeled documents, generate all pairs of mentions • Optionally prune distant mention pairs [Soon, Ng, Lim 2001] • Learn binary classifier to predict coreference • Edge weights proportional to classifier output
Sometimes pairwise comparisons are insufficient • Entities have multiple attributes (name, email, institution, location); need to measure “compatibility” among them. • Having 2 “given names” is common, but not 4. • e.g. Howard M. Dean / Martin, Dean / Howard Martin • Need to measure size of the clusters of mentions. • a pair of name strings where edit distance differs > 0.5? • Maximum distance between mentions in document • A entity contains only pronoun mentions? We need measures on hypothesized “entities” We need First-order logic
First-Order Logic CRFs for Co-reference (FOL-CRF) . . . Mr Powell . . . x1 . . . Powell . . . x2 y -56 Coreferent(x1,x2,x3)? x3 . . . she . . .
First-Order Logic CRFs for Co-reference (FOL-CRF) . . . Mr Powell . . . x1 . . . Powell . . . x2 y -56 Coreferent(x1,x2,x3)? Clusterwise compatibility score learned from training data Features are arbitrary FOL predicates over a set of mentions x3 . . . she . . .
First-Order Logic CRFs for Co-reference (FOL-CRF) . . . Mr Powell . . . x1 . . . Powell . . . x2 y -56 Coreferent(x1,x2,x3)? As in PW-CRF, prediction can be approximated with agglomerative clustering x3 . . . she . . .
Learning Parameters of FOL-CRFs • Generate classification examples where input is a set of mentions • Unlike Pairwise CRF, cannot generate all possible examples in training data
. . . . . . Combinatorial Explosion! … Coreferent(x1,x2 ,x3,x4 ,x5 ,x6) … Coreferent(x1,x2 ,x3,x4 ,x5) … Coreferent(x1,x2 ,x3,x4) … Coreferent(x1,x2 ,x3) … Coreferent(x1,x2) Learning Parameters of FOL-CRFs She He Powell Rice he Secretary
This space complexity is common in probabilistic first-order logic Gaifman 1964 Halpern 1990 Paskin 2002 Poole 2003 Richardson & Domingos 2006
Training in Probabilistic FOLParameter estimation; weight learning • Input • First-order formulae • x S(x) T(x) • Labeled data • a, b, c S(a), T(a), S(b), T(b), S(c) • Output • Weights for each formula • x S(x) T(x) [0.67] xy Coreferent(x,y) Pronoun(x) xy Coreferent(x,y) Pronoun(x) [-2.3]
Training in Probabilistic FOLPrevious Work • Maximum likelihood • Require intractable normalization constant • Pseudo-likelihood [Richardson, Domingos 2006] • Ignores uncertainty of relational information • E-M [Kersting, De Raedt 2001; Koller, Pfeffer 1997] • Sampling [Paskin 2002] • Perceptron [Singla, Domingos 2005] • Can be inefficient when prediction is expensive • Piecewise training [Sutton, McCallum 2005] • Train “pieces” of world in isolation • Performance sensitive to which pieces are chosen
Training in Probabilistic FOLParameter estimation; weight learning • Most methods require “unrolling” [grounding] • Unrolling has exponential space complexity • E.g., xyz S(x,y,z) -> T(x,y,z) • For constants [a b c d e f g h] must examine all triples • Sampling can be inefficient due to large sample space. • Proposal: Let prediction errors guide sampling
Error-driven Training • Input • Observed data X // Input mentions • True labeling P // True clustering • Prediction algorithm A // Clustering algorithm • Initial weights W, prediction Q // Initial clustering • Iterate until convergence • Q’ A(Q, W, O) // Merge clusters • If Q’ introduces an error • UpdateWeights(Q, Q’, P, O, W) • Else Q Q’
UpdateWeights(Q, Q’, P, O, W)Learning to Rank Pairs of Predictions • Using truth P, generate a new Q’’ that is a better modification of Q than Q’. • Update W s.t. Q’’ A(Q, W, O) • Update parameters so Q’’ is ranked higher than Q’
Ranking vs Classification Training • Instead of training[Powell, Mr. Powell, he] --> YES[Powell, Mr. Powell, she] --> NO • ...Rather...[Powell, Mr. Powell, he] > [Powell, Mr. Powell, she] • In general, higher-ranked example may contain errors [Powell, Mr. Powell, George, he] > [Powell, Mr. Powell, George, she]
Ranking Parameter Update • In our experiments, we use a large-margin update based on MIRA [Crammer, Singer 2003] • Wt+1 = argminW ||Wt - W|| • s.t. • Score(Q’’, W) - Score(Q’, W) ≥ 1
Advantages • Never need to unroll entire network • Only explore partial solutions prediction algorithm likely to produce • Weights tuned for prediction algorithm • Adaptable to different prediction algorithms • beam search, simulated annealing, etc. • Adaptable to different loss functions Related: • Incremental Perceptron [Collins, Roark 2004] • LaSO [Daume, Marcu 2005] Extended here for FOL, ranking, max-margin loss. Rank partial, possibly mistaken predictions.
Disadvantages • Difficult to analyze exactly what global objective function is being optimized • Convergence issues • Average weight updates
Experiments • ACE 2004 coreference • 443 newswire documents • Standard feature set [Soon, Ng, Lim 2001; Ng & Cardie 2002] • Text match, gender, number, context, Wordnet • Additional first-order features • Min/Max/Average/Majority of pairwise features • E.g., Average string edit distance, Max document distance • Existential/Universal quantifications of pairwise features • E.g., There exists gender disagreement • Prediction: Greedy agglomerative clustering
Experiments Better Training Better Representation B-Cubed F1 Score on ACE 2004 Noun Coreference [to our knowledge, best previously reported results ~ 69% (Ng, 2005)]
Conclusions Combining logical and probabilistic approaches to AI can improve state-of-the-art in NLP. Simple approximations can make these approaches practical for real-world problems.
Future Work • Fancier features • Over entire clusterings • Less greedy inference • Metropolis-Hastings sampling • Analysis of training • Which positive/negative examples to select when updating • Loss function sensitive to local minima of prediction • Analyze theoretical/empirical convergence