1 / 31

First-Order Probabilistic Models for Coreference Resolution

First-Order Probabilistic Models for Coreference Resolution. Aron Culotta Computer Science Department University of Massachusetts Amherst. Joint work with Andrew McCallum [advisor], Michael Wick, Robert Hall. Probabilistic First-Order Logic for Coreference Resolution. Aron Culotta

wsherri
Download Presentation

First-Order Probabilistic Models for Coreference Resolution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with Andrew McCallum [advisor], Michael Wick, Robert Hall

  2. Probabilistic First-Order Logic for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts, Amherst Joint work with Andrew McCallum [advisor], Michael Wick, Robert Hall

  3. Previous work: Conditional Random Fields for Coreference

  4. A Pairwise Conditional Random Field for Coreference (PW-CRF) [McCallum & Wellner, 2003, ICML] . . . Mr Powell . . . x1 . . . Powell . . . y x2 y y Coreferent(x2,x3)? x3 . . . she . . .

  5. A Pairwise Conditional Random Field for Co-reference (PW-CRF) [McCallum & Wellner, 2003, ICML] . . . Mr Powell . . . x1 . . . Powell . . . y x2 y y x3 . . . she . . .

  6. A Pairwise Conditional Random Field for Co-reference (PW-CRF) [McCallum & Wellner, 2003, ICML] . . . Mr Powell . . . x1 45 . . . Powell . . . y x2 y -30 y Pairwise compatibility score learned from training data 11 x3 . . . she . . .

  7. A Pairwise Conditional Random Field for Co-reference (PW-CRF) [McCallum & Wellner, 2003, ICML] . . . Mr Powell . . . x1 45 . . . Powell . . . y x2 y -30 y Pairwise compatibility score learned from training data 11 Hard transitivity constraints enforced by prediction algorithm x3 . . . she . . .

  8. Prediction in PW-CRFs = Graph Partitioning [Boykov, Vekler, Zabih, 1999], [Kolmogorov & Zabih, 2002], [Yu, Cross, Shi, 2002] . . . Mr Powell . . . x1 45 . . . Powell . . . x2 -30 Often approximated with agglomerative clustering 11 x3 . . . she . . . = 64

  9. Parameter Estimation in PW-CRFs • Given labeled documents, generate all pairs of mentions • Optionally prune distant mention pairs [Soon, Ng, Lim 2001] • Learn binary classifier to predict coreference • Edge weights proportional to classifier output

  10. Sometimes pairwise comparisons are insufficient • Entities have multiple attributes (name, email, institution, location); need to measure “compatibility” among them. • Having 2 “given names” is common, but not 4. • e.g. Howard M. Dean / Martin, Dean / Howard Martin • Need to measure size of the clusters of mentions. •  a pair of name strings where edit distance differs > 0.5? • Maximum distance between mentions in document • A entity contains only pronoun mentions? We need measures on hypothesized “entities” We need First-order logic

  11. First-Order Logic CRFs for Coreference

  12. First-Order Logic CRFs for Co-reference (FOL-CRF) . . . Mr Powell . . . x1 . . . Powell . . . x2 y -56 Coreferent(x1,x2,x3)? x3 . . . she . . .

  13. First-Order Logic CRFs for Co-reference (FOL-CRF) . . . Mr Powell . . . x1 . . . Powell . . . x2 y -56 Coreferent(x1,x2,x3)? Clusterwise compatibility score learned from training data Features are arbitrary FOL predicates over a set of mentions x3 . . . she . . .

  14. First-Order Logic CRFs for Co-reference (FOL-CRF) . . . Mr Powell . . . x1 . . . Powell . . . x2 y -56 Coreferent(x1,x2,x3)? As in PW-CRF, prediction can be approximated with agglomerative clustering x3 . . . she . . .

  15. Learning Parameters of FOL-CRFs • Generate classification examples where input is a set of mentions • Unlike Pairwise CRF, cannot generate all possible examples in training data

  16. . . . . . . Combinatorial Explosion! … Coreferent(x1,x2 ,x3,x4 ,x5 ,x6) … Coreferent(x1,x2 ,x3,x4 ,x5) … Coreferent(x1,x2 ,x3,x4) … Coreferent(x1,x2 ,x3) … Coreferent(x1,x2) Learning Parameters of FOL-CRFs She He Powell Rice he Secretary

  17. This space complexity is common in probabilistic first-order logic Gaifman 1964 Halpern 1990 Paskin 2002 Poole 2003 Richardson & Domingos 2006

  18. Training in Probabilistic FOLParameter estimation; weight learning • Input • First-order formulae • x S(x) T(x) • Labeled data • a, b, c S(a), T(a), S(b), T(b), S(c) • Output • Weights for each formula • x S(x) T(x) [0.67] xy Coreferent(x,y)  Pronoun(x) xy Coreferent(x,y)  Pronoun(x) [-2.3]

  19. Training in Probabilistic FOLPrevious Work • Maximum likelihood • Require intractable normalization constant • Pseudo-likelihood [Richardson, Domingos 2006] • Ignores uncertainty of relational information • E-M [Kersting, De Raedt 2001; Koller, Pfeffer 1997] • Sampling [Paskin 2002] • Perceptron [Singla, Domingos 2005] • Can be inefficient when prediction is expensive • Piecewise training [Sutton, McCallum 2005] • Train “pieces” of world in isolation • Performance sensitive to which pieces are chosen

  20. Training in Probabilistic FOLParameter estimation; weight learning • Most methods require “unrolling” [grounding] • Unrolling has exponential space complexity • E.g., xyz S(x,y,z) -> T(x,y,z) • For constants [a b c d e f g h] must examine all triples • Sampling can be inefficient due to large sample space. • Proposal: Let prediction errors guide sampling

  21. Error-driven Training • Input • Observed data X // Input mentions • True labeling P // True clustering • Prediction algorithm A // Clustering algorithm • Initial weights W, prediction Q // Initial clustering • Iterate until convergence • Q’ A(Q, W, O) // Merge clusters • If Q’ introduces an error • UpdateWeights(Q, Q’, P, O, W) • Else Q  Q’

  22. UpdateWeights(Q, Q’, P, O, W)Learning to Rank Pairs of Predictions • Using truth P, generate a new Q’’ that is a better modification of Q than Q’. • Update W s.t. Q’’ A(Q, W, O) • Update parameters so Q’’ is ranked higher than Q’

  23. Ranking vs Classification Training • Instead of training[Powell, Mr. Powell, he] --> YES[Powell, Mr. Powell, she] --> NO • ...Rather...[Powell, Mr. Powell, he] > [Powell, Mr. Powell, she] • In general, higher-ranked example may contain errors [Powell, Mr. Powell, George, he] > [Powell, Mr. Powell, George, she]

  24. Ranking Parameter Update • In our experiments, we use a large-margin update based on MIRA [Crammer, Singer 2003] • Wt+1 = argminW ||Wt - W|| • s.t. • Score(Q’’, W) - Score(Q’, W) ≥ 1

  25. Advantages • Never need to unroll entire network • Only explore partial solutions prediction algorithm likely to produce • Weights tuned for prediction algorithm • Adaptable to different prediction algorithms • beam search, simulated annealing, etc. • Adaptable to different loss functions Related: • Incremental Perceptron [Collins, Roark 2004] • LaSO [Daume, Marcu 2005] Extended here for FOL, ranking, max-margin loss. Rank partial, possibly mistaken predictions.

  26. Disadvantages • Difficult to analyze exactly what global objective function is being optimized • Convergence issues • Average weight updates

  27. Experiments • ACE 2004 coreference • 443 newswire documents • Standard feature set [Soon, Ng, Lim 2001; Ng & Cardie 2002] • Text match, gender, number, context, Wordnet • Additional first-order features • Min/Max/Average/Majority of pairwise features • E.g., Average string edit distance, Max document distance • Existential/Universal quantifications of pairwise features • E.g., There exists gender disagreement • Prediction: Greedy agglomerative clustering

  28. Experiments Better Training Better Representation B-Cubed F1 Score on ACE 2004 Noun Coreference [to our knowledge, best previously reported results ~ 69% (Ng, 2005)]

  29. Conclusions Combining logical and probabilistic approaches to AI can improve state-of-the-art in NLP. Simple approximations can make these approaches practical for real-world problems.

  30. Future Work • Fancier features • Over entire clusterings • Less greedy inference • Metropolis-Hastings sampling • Analysis of training • Which positive/negative examples to select when updating • Loss function sensitive to local minima of prediction • Analyze theoretical/empirical convergence

  31. Thank you

More Related