260 likes | 405 Views
Robust Reading: Identification and Tracing of Ambiguous Names. Xin Li, Paul Morie, Dan Roth University of Illinois at Urbana-Champaign.
E N D
Robust Reading: Identification and Tracing of Ambiguous Names Xin Li, Paul Morie, Dan Roth University of Illinois at Urbana-Champaign
Document 1:The Justice Department has officially ended its inquiry into the assassinations ofJohn F. Kennedyand Martin Luther King Jr., finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 thatKennedywas ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963. Document 2: In 1953, MassachusettsSen. John F. Kennedymarried Jacqueline Lee Bouvier in Newport, R.I. In 1960, Democratic presidential candidate John F. Kennedy confronted the issue of his Roman Catholic faith by telling a Protestant group in Houston, ``I do not speak for my church on public matters, and the church does not speak for me.'‘ Document 3: David Kennedywas born in Leicester, England in 1959. …Kennedyco-edited The New Poetry (Bloodaxe Books 1993), and is the author of New Relations: The Refashioning Of British Poetry 1980-1994 (Seren 1996). Presented by Xin Li, UIUC
Document 1:The Justice Department has officially ended its inquiry into the assassinations ofJohn F. Kennedyand Martin Luther King Jr., finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 thatKennedywas ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963. Document 2: In 1953, MassachusettsSen. John F. Kennedymarried Jacqueline Lee Bouvier in Newport, R.I. In 1960, Democratic presidential candidate John F. Kennedy confronted the issue of his Roman Catholic faith by telling a Protestant group in Houston, ``I do not speak for my church on public matters, and the church does not speak for me.'‘ Document 3: David Kennedywas born in Leicester, England in 1959. …Kennedyco-edited The New Poetry (Bloodaxe Books 1993), and is the author of New Relations: The Refashioning Of British Poetry 1980-1994 (Seren 1996). Presented by Xin Li, UIUC
Why is Robust Reading Problem important? • Most of the work in NLP is done at the level of mentions. We would like to start moving from the mention level to the concept level. • Our solution • We develop a global probabilistic view on how documents are generated and names are ``sprinkled’’ to them. • We formulize the problem as learning the model parameters and making inference using it. • Our experimental study showed promising results on New York Times news articles. Presented by Xin Li, UIUC
Outline • A generative model of document generation • three model relaxations. • Learn the models in a completely unsupervised setting. • Experimental Results • Conclusion and Future Directions Presented by Xin Li, UIUC
Generate Document d The Justice Department has officially ended its inquiry into the assassinations of President John F. Kennedy and Martin Luther King Jr., finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 that Kennedywas ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963. …President Kennedy…JFK… Presented by Xin Li, UIUC
At the beginning, we have a set of entities in our mind A set of entities E The Justice Department Dallas The House Assassinations Committee David Kennedy Presented by Xin Li, UIUC
First Step: Select a subset of entities for d. Underlying probability distribution: P(Ed). Ed : entities in d The Justice Department Dallas The House Assassinations Committee Presented by Xin Li, UIUC
President John F. Kennedy Second Step: For each entity e, select a representative r. Underlying probability distribution: P(r|e) and P(Rd|Ed)= P(r|e). Rd : representatives in d The Justice Departmenthas officially ended its inquiry into the assassinations ofPresident John F. Kennedy and Martin Luther King Jr.,finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committeeconcluded in 1978 that Kennedy was ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's beliefthatLee Harvey Oswaldacted alone inDallason Nov. 22, 1963. Presented by Xin Li, UIUC
Kennedy, JFK President Kennedy President John F. Kennedy Third Step: For each representative r, select a set of mentions. Underlying probability distributions: P(m|r) and P(Md|Rd)~ P(m|r). Md : actual mentions in d The Justice Departmenthas officially ended its inquiry into the assassinations ofPresident John F. Kennedy and Martin Luther King Jr.,finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committeeconcluded in 1978 that Kennedy was ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's beliefthatLee Harvey Oswaldacted alone inDallason Nov. 22, 1963. …President Kennedy…JFK… Presented by Xin Li, UIUC
Generate Document d The Justice Department has officially ended its inquiry into the assassinations of President John F. Kennedy and Martin Luther King Jr., finding ``no persuasive evidence'' to support conspiracy theories, according to department documents. The House Assassinations Committee concluded in 1978 that Kennedywas ``probably'' assassinated as the result of a conspiracy involving a second gunman, a finding that broke from the Warren Commission's belief that Lee Harvey Oswald acted alone in Dallas on Nov. 22, 1963. …President Kennedy…JFK… Presented by Xin Li, UIUC
E e Step 1: P(Ed) d Ed edi Step 2: P(r|e) Rd President John F. Kennedy rdi House of Representatives Step 3: P(m|r) Md Mdi {President Kennedy, Kennedy, JFK} {House of Representatives, The House} Presented by Xin Li, UIUC
Robust Reading Assuming we have the model, the fundamental problem is to decide what entities are mentioned in a given document and what the most likely entity to each mention is. Ed = argmax (Ed,Rd)P(Ed,Rd | Md, ) = argmax (Ed,Rd) P(Ed,Rd, Md | ) Presented by Xin Li, UIUC
Model III Model I Model II E e d Ed edi Rd rdi Md Mdi Presented by Xin Li, UIUC
e1, e2, e3, …., Kennedy President Kennedy m1, m2, m3 …., Model I (the simplest) P(D) = P({(ei, mi)}) = P(ei) P(mi | ei) The most likely entity e* for mention m: e* = argmax eE P(e | m, ) = argmax eE P(e) P(m | e). Presented by Xin Li, UIUC
President John F. Kennedy, JFK, Kennedy, President Jonh F. Kennedy Model II E e P(Ed) = P(edi) Ed edi Presented by Xin Li, UIUC
E e P(Ed) = P(e0) P(edi| edi -1) Ed edi Model III (least relaxation) • Mentions are independently selected while entities are selected according to a Markov chain. Presented by Xin Li, UIUC
Learning the Models • Unsupervised Learning Only assuming that we know Md, for each d in D, we hope to learn Θ using D and also hope to label D. • Truncated EM algorithm--- A greedy search algorithm. • Initialization: Perform local clustering of mentions to find <Ed,Rd> inside each document based on a simple similarity metric between names • M-step: Θ* = argmaxΘ P(D={<Ed,Rd,Md>}| Θ). • E-step: (Ed Rd)* = argmax(Ed,Rd) P(D={<Ed,Rd,Md>}| Θ). Presented by Xin Li, UIUC
Parameter Estimation • In the learning process, assuming we have obtained labeled documents D= {(Ed,Rd,Md)} from previous I- or E-steps, • we perform maximum likelihood estimation of model parameters in each M-step. • P(e), P(e2|e1), and the appearance probability PW|W (for example, P(m|r)). Presented by Xin Li, UIUC
Kennedy, JFK President Kennedy President John F. Kennedy The Appearance Probability p(m|r) • Appearance Probability: the probability of one name being transformed from another: P(President Kennedy | President John F. Kennedy)=k P(vk’|vk). • Attributes: FirstName, LastName, Title, Suffix, Gender. • P(vk’|vk) is modelled relationally as a multinomial distribution over a set of predefined values. • Identical Writing, Typical Transformation, Non-typical Transformation, Missing Value. Presented by Xin Li, UIUC
Experimental Setting • Data: 300 TREC documents (New York Times), 8000 mentions, 2000 entities. • Processed with a named entity recognizer: People, Locationand Organization. • Each pair of names is a test example, 130,000 positive examples. • Evaluation: Precision, Recall and F1. Presented by Xin Li, UIUC
Performance Baseline: Predict (m1,m2) as positive iff they have identical writings Discriminative: Cluster based on the SoftTFIDF entity similarity metric Presented by Xin Li, UIUC
Conclusions • We presented an unsupervised learning approach to the “Robust Reading” problem. • We designed a generative model that describes the natural generation process of a document and how names are “sprinkled” into it. • Our model can achieve promising results (89% F1) on news articles. Presented by Xin Li, UIUC
Future Directions • Integrate with more contextual information, • Integrate with general coreference resolution, • Integrate with other NLP tasks. Presented by Xin Li, UIUC
Thank You! Presented by Xin Li, xli1@uiuc.edu A demo of this work is at http://l2r.cs.uiuc.edu/~cogcomp/eoh/index.html Presented by Xin Li, UIUC
The Basic Model • A global probabilistic view on how documents are generated. • A joint distribution over entities P(Ed), • An “author” model that makes sure that at least one mention of an entity is easily identifiable P(r|e), • An appearance model governing how mentions are transformed from the “representative” mention P(m|r). P(d) = P(Ed,Rd,Md) = P(Ed) P(Rd | Ed) P(Md | Rd) P(D) = P(d) Presented by Xin Li, UIUC