330 likes | 611 Views
Coreference Resolution. Seminar by Satpreet Arora (07D05003). What is Coreference ?. In linguistics, co-reference occurs when multiple expressions in a sentence or document refer to the same entity. Example:
E N D
Coreference Resolution Seminar by Satpreet Arora (07D05003)
What is Coreference? • In linguistics, co-reference occurs when multiple expressions in a sentence or document refer to the same entity. • Example: Aditya went to Videorec to buy a DVD for himself. He had frequented the store for many years now. Here, Aditya, himself and He are coreferent. Also Videorec and store are coreferent.
Coreference Resolution. • Determine which entities in a document/discourse have the same referent. • In NLP, we usually deal with Coreference resolution of NPs. • The coreference system has to form equivalence classes of NPs that have the same real-world entity as its referent. • Coreference Relation is both transitive and symmetric.
A Coreference System • Input of a coreference system: John Simon, Chief Financial Officer of Prime Corp. since 1986, saw his pay jump 20%, to $1.3 million, as the 37-year-old also became the financial-service company’s president. • Output of a coreference system: [JS John Simon], [JS Chief Financial Officer] of [PC Prime Corp.] since 1986, saw [JS his] pay jump 20%, to $1.3 million, as [JS the 37-year-old] also became the [PCfinancial-service company]’s [JS president] • Equivalence classes: JS: {John Simon, Chief Financial Officer, his, the 37-year-old, president} PC: {Prime Corp., financial-service company}
Anaphora Resolution • Anaphora refers to the linguistic phenomenon of having a noun phrase refer to a previously mentioned entity in a text for its semantic interpretation In other words, a pair of NPs <npi, npj> constitutes an anaphoric relationship if i < j and npj depends on npi for its interpretation, where npk denotes the kth NP in a document. For instance, the NP pair <Queen Elizabeth, her> forms an anaphoric relationshipin our example. • Queen Elizabeth set about transforming her husband
Coreference Resolution - an Important Problem Applications include: • Question answering systems and Information Retrieval: Consider the query- Where was Mozart born?. A question answering system may first retrieve the sentence He was born in Salzberg from a document talking about Mozart. In this case, the system will return the correct answer if it can determine that the pronoun ‘He’ is coreferent with Mozart. • Machine Translation. • Anaphora resolution comes into play when discrepancies exist between the two languages with respect to anaphor selection. For example, a pronoun in the Malay language is often translated directly by its antecedent (Mitkov, 1999).
Coreference Resolution - an Important Problem (cont’d) • Text Summarization • Text summarisation tools using coreference resolution not only include in the summary those sentences that contain a term appearing in the query, they also incorporate sentences containing a noun phrase that is coreferent with a term occurring in a sentence already selected by the system. • Cross-document coreference is particularly useful for text summarization systems that need to identify and merge the same piece of information about an entity mentioned in different documents in order to avoid repetition.
Coreference Resolution - a Hard Problem The difficulty of the problem lies in its dependence on sophisticated semantic and world knowledge. • The policemen refused the women a permit for the demonstration because they feared violence. • The policemen refused the women a permit for the demonstration because they advocated violence. • Observe how they is referring two two different entities in the two sentences depending upon the context. Its easy for humans but difficult for machines.
Coreference Resolution - a Hard Problem (cont’d) • Many sources of information play a role: • Lexical information such as head noun matches (as in Lord Spencer and Mr. Spencer) is an indicator of coreference, although it is not an absolute indicator (e.g. Lord Spencer and Diana Spencer are not coreferent). • Knowledge sources such as gender and number, semantic class, discourse focus, and world knowledge also play a role in determining whether two discourse entities are coreferent.
Coreference Resolution - a Hard Problem (cont’d) • No single source of knowledge is completely reliable indicator: • For example, two semantically compatible NPs are potentially coreferent (e.g. Diana Spencer and the princess) but whether the NPs are actually coreferent depends on other factors (such as contextual information). • Linguistic constraints indicating (non-)coreference such as number (dis)agreement is not absolutely hard (e.g. the singular NP assassination (of her bodyguards) can be coreferent with the plural NP these murders.
Coreference Resolution - a Hard Problem (cont’d) • Coreference strategies differ depending on the type of NP: • Definite NPs are more likely to be anaphoric than their non-definite counterparts (e.g. the article immediately preceding man in the sentence “Diana saw the/a photographer following her secretly” determines whether the NP has an existential or definite reading). • Pronoun resolution is difficult because resolution strategies differ for each type of pronouns (e.g. reflexives versus possessives) and also because some pronouns such as pleonastic pronouns are semantically empty (e.g. the pronoun it in the sentence “Camilla went outside and it was raining ” is pleonastic).
The Algorithm A lot of different algorithms use different approaches to solve the problem. However, they all share some basic components. (for non machine learning based algorithms) Step 1: Identification of Discourse Entities (NPs) Step 2: Representation of NPs (as a set of features); Step3: Calculating distances between NPs using the distance metric Step 4:Creating equivalence classes using a clustering algorithm or other classification tools.
Identification of Discourse Entities For coreference resolution algorithms, the task is to identify all of the noun phrases in the text. Textual elements which can be definite noun phrases, demonstrative noun phrases, proper names, appositives, sub-noun phrases that act as modifiers, pronouns and so on. The basic structure of the identification is as follows:
Identification of Discourse Entities (cont’d) • Tokenization and Morphological Processing: Splitting the text to sentences and stemming words to their root form • POS tagging: Hidden Markov Model based statistical POS tagger. • Noun Phrase Identification: Determines noun phrase boundaries based on POS tagger. • Named Entity recognition: May also be HMM based, learns from a tagged corpus of named entities. If there are overlaps – boundaries are adjusted. • Nested Noun Phrases Extraction : Accepts noun phrases and determines the nested phrases (if any)
Representation of NPs Representation of NPs : A set of features used to construct the feature vector. • individual words in the NP; • head noun: last word of the NP; • position in the document; • pronoun type: nominative, accusative, possessive, ambiguous; • article: indefinite, definite, none; • appositive: based on heuristics (commas, etc.); • number: plural, singular; • proper name: based on heuristics (capitalization, etc.); • semantic class based on Wordnet; • gender: masculine, feminine, either, neuter; • animacy: based on semantic class.
Calculating distances between NPs • Different algorithms use different distance metrics. We here present one from Cardie et al. (1999). Noun phrase coreference as clustering’ and the corresponding clustering algorithm. • The distance between noun phrases NP1 and NP2 is defined as: • dist(NP1,NP2) =∑ wf* incompatibilityf(NP1,NP2) • The summation is over f ∈ F • F: set of features • wf: weight of feature f • incompatibilityf: degree of incompatibility between NP1 and NP2
Clustering Properties of the clustering algorithm: • start from the end of the document and work backwards; • if distance between two NPs is less than r , then their equivalence classes are considered for merging; • classes can be merged unless they contain incompatible NPs; • the algorithm automatically computes the transitive closure of the coreference relation; • two NPs can be coreferent even if dist(NP1,NP2) > r, as long as dist(NP1,NP2) ≠∞; • r is a free parameters of the algorithm.
Machine Learning algorithms • In the algorithm we just saw, the weights of each feature are fixed manually. • Unlike manual approaches, machine learning approaches to coreference resolution induce a model that determines the probability that two NPs are coreferent from annotated data automatically via the use of learning algorithms. • They can be characterized in terms of the knowledge sources being employed, the method of training data creation, as well as the learning algorithm and the clustering algorithm being chosen.
Machine Learning algorithms • Training Data Creation: • Positive instances:Two different methods are used to create positive training instances: from Aone et al. (1995) 1)Transitive (i.e. an instance is formed between an NP and each of its preceding NPs in the same anaphoric chain) and 2)Non-transitive (i.e. an instance is formed between an NP and its closest preceding NP in the same anaphoric chain • Negative instances: 1) Negative instances are generated by pairing an NP with each preceding NP that does not have an anaphoric relationship with it.- Aone et al. (1995) 2) To reduce the ratio of negative instances to positive instances a negative instance is created by pairing an anaphoric NP, npj, with each NP appearing between npj and its closest preceding antecedent-Soon et al. (2001)
Learning Algorithm • A lot of recent research involving machine learning techniques use decision trees for classifying NP pairs. Soon et al. used C5 tree classifier.
Learning Algorithm (cont’d) • Each pair of NPs is presented to the classifier and the classifier returns a probability or certainty value of the pair being coreferent. • All pairs which receive a probability value greater than a threshold value are considered as being coreferent. • The algorithm then constructs the transitive closure of all the pairs and thus achieves partitioning of all the NPs into coreferent classes.
Clustering algorithms are also used in machine learning approaches. • The relative importance of all the factors discussed previously is learnt from the training corpus instead of being fixed manually. This allows for consideration of a larger number of factors. • In principle, learning-based systems are more robust than knowledge-based systems in the face of noisy input (meaning there are exceptions to rules). Also machine learning algorithms adapt easier to different topics.
Conclusion • Machine learning approaches to coreference resolution have been shown to be a promising way to build robust coreference resolution systems. • Despite the successful application of machine learning techniques to coreference resolution, the problem is far from being solved. • Linguistics combined with machine learning techniques can prove effective in solving the coreference problem. • Coreference resolution is one of the most difficult problems in language understanding. Given that NLP is often said to be “AI-complete” we might be able to conclude that coreference resolution is among the hardest of the hardest problems in artificial intelligence.
References • Vincent N G (2002):Machine Learning for CoreferenceResolution:Recent Successes and Future Challenges • Cardie, Claire and KiriWagstaff. 1999. Noun phrase coreference as clustering. • Byron, D. and J. Tetreault, 1999. A Flexible Architecture for Reference Resolution. In Proc. of the 9th EACL. • Unsupervised Models for Co-reference Resolution (Vincent Ng 2008). • Wee Meng Soon, Daniel Chung Yong Lim, HweeTou Ng(2001) DSO National Laboratories :A Machine Learning Approach to Coreference Resolution of Noun Phrases • http://www.cs.tau.ac.il/~nachumd/NLP/2010/Anaphora.pdf • http://www.inf.ed.ac.uk/teaching/courses/nlu/lectures/nlu_l16.pdf • http://www.dfki.de/~loeckelt/ss2010/presentations/coreference_resolution.pdf • Wikipedia