10 likes | 159 Views
Exploring Unsupervised and Knowledge-Rich Approaches to Coreference Resolution. Nathan Gilbert & Ellen Riloff University of Utah. Introduction. The current state of the art. Unsupervised Learning.
E N D
Exploring Unsupervised and Knowledge-Rich Approaches to Coreference Resolution. Nathan Gilbert & Ellen Riloff University of Utah Introduction The current state of the art Unsupervised Learning Coreference resolution is the task of identifying coreferent expressions in text and is of great interest to the field of Natural Language Processing. Currently, the best Machine Learning approaches to this task have involved using supervised learning algorithms, which requires annotated corpora. The goal of this research is two-fold: to develop unsupervised methods for coreference resolution and to incorporate domain-specific knowledge to improve resolution. The initial stage of this project will be to implement a state of the art supervised learning based coreference system. The general flow of this system is presented next, notice that for supervised learning systems, the classifier must be trained on annotated text. This will be used as a baseline comparison for the next stage, which is to build a coreference system that uses only un-annotated text for classification. This work is still on going at the University of Utah, the following is an overview of previous work and our motivation for this project. In recent years, a lot of work has gone into building coreference systems that relied on supervised learning to classify textual entities as coreferent. One such system that has garnered a lot of attention is the work of Soon et al. [1] The following figure displays the system pipeline from free text to “Markables”, which was the name they gave to possible coreferential entities. As mentioned previously, supervised learning requires an annotated copora. The process of “hand-crafting” this corpora can be quite time consuming, this fact alone is impetus for pushing the field of coreference resolution towards unsupervised techniques. Previous work in unsupervised coreference resolution has been focused solely on the idea that coreference can be viewed as clusterings of noun phrases. [2] While this approach performed respectively on the MUC-6 and MUC-7 data sets, more work is needed before unsupervised methods will be at the forefront of coreference research. Our research hopes continue building upon the momentum of the unsupervised approach to coreference that was started by the Cardie and Wagstaff group and to add our own spin in the form of incorporating knowledge-rich features. Knowledge-Rich features From here, the markables would be passed to a Decision Tree based classifier that had been previously trained on annotated text. The results of this system on the Message Understanding Conference 6 & 7 (MUC) data sets are below: • Improving coreference through “domain-specific” information. • Current coreference systems primarily consist of what can be considered “knowledge-poor” feature sets. What this means is that majority of characteristics that each classification is based do not require “real-world” information. • For instance, in the feature set from the Soon group, the majority of the features are based off of basic syntactic or lexical based information. • Our research is attempting to incorporate such “real-world” or “domain-specific” information into the classification step in order to improve resolution. The specific form of this information comes in terms of “relationships” between textual entities. • Using relationship information to improve coreference. • “Investor Albert Kahn said in a statement that a group he heads increased its stake in Trans-Lux Corp to 8.9 pct from 8.1 pct on a fully diluted basis. Theinvestor is also considering seeking representation on the Trans-Lux board and starting a proxy contest in connection with the upcoming annual meeting.” Given the relationship: Albert Kahn WORK_AS investor, can we improve the chances that our system chooses “Albert Kahn” as the antecedent for “The investor”? This is one of the questions we are trying to answer through this research. Aside from relationships, our system will also incorporate information about contextual role knowledge to evaluate possible coreferential entities. In the work of Bean & Riloff, a system was developed to determine if the context in which the antecedent and anaphor appeared were compatible. The intuition being, if two entities are coreferential, they should appear in the same role(s). [4] Given the sentences: • Jose Maria Martinez, Roberto Lisandy, and Dino Rossy, who were staying at a Tecun Uman hotel, were kidnapped by armed men who took them to an unknown place. • After they were kidnapped... • After they blindfolded the men... Clearly, in (b) “they” refers to the victims and to the kidnappers in (c). Features for Coreference Resolution What is coreference? • Good Features for coreference resolution need to be strong enough determine if two textual entities corefer, but flexible enough to perform against examples that have not be encountered before. • The following list were used by the Soon et al system, many of which will appear in the current baseline system being developed. • Distance – The distance, in sentences, between anaphor and antecedent. • Pronoun – Are either the antecedent or anaphor a pronoun? • String Match – Are the anaphor and antecedent the same string? • Definite NP – Checks that the anaphor is a definite noun phrase. • Demonstrative NP – Essentially checks if the anaphor is a demonstrative noun phrase. • Number Agreement – Do both the anaphor and antecedent agree in number? • Semantic Class Agreement – Uses WordNet, an online lexical database, to determine the semantic class of the entities in question. • Gender Agreement – Use name lists, and titles to determine gender of the antecedent and anaphor. • Apposition – Do the antecedent and anaphor exist in an appositive relation? Two textual entities that refer to the same object in “the real world.” [3] More Formal Definition: • α1 and α2 corefer if and only if Referent(α1) = Referent(α2). • Referent(α) is 'the entity referred to by α.' Coreference is therefore an equivalence relation for the following three properties hold: reflexive, symmetric and transitive. Often α1 and α2 are referred to as the antecedent and anaphor respectively. Specific examples of definite noun phrases are always coreferential: David Beckham is the L.A. Galaxy midfielder. The proper name David Beckham and the definite noun phrase the L.A. Galaxy midfielder are coreferential by the above definition. Appositives are also good examples of coreference: Mary, vice president of public affairs, ... References Wee Meng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim. A machine learning approach to coreference resolution of noun phrases. HLT/NAACL, 2004. Claire Cardie and Kiri Wagstaff. Noun phrase coreference as clustering. Proceedings of the Joint SIGDAT Conference on EML, 199. Ruslan Mitkov. Anaphora Resolution. Longman, 2002 David Bean and Ellen Riloff. Unsupervised learning of contextual role knowledge for coreference resolution. HLT/NAACL, 2004 Ellen Riloff: riloff@cs.utah.edu Nathan Gilbert: ngilbert@cs.utah.edu This work is a collaboration between the University of Utah, Cornell University and Lawrence Livermore National Labs.