460 likes | 778 Views
Ch.8 Lexical Acquisition. 10-05-2009. 1/40. Introduction. In this chapter, we will look at the acquisition of more complex syntactic and semantic properties of words. Main areas covered in this chapter are Verb sub categorization Attachment Ambiguity Selectional Preference
E N D
Ch.8 Lexical Acquisition 10-05-2009 1/40
Introduction • In this chapter, we will look at the acquisition of more complex syntactic and semantic properties of words. • Main areas covered in this chapter are • Verb sub categorization • Attachment Ambiguity • Selectional Preference • Semantic categorization 3/40
Introduction(1) • The General Goal of Lexical Acquisition • Develop algorithms and statistical techniques for filling the holes in existing Machine-Readable Dictionary by looking at the occurrence patterns of words in large text corpora. • Many lexical acquisition problems besides collocations • Selectional preference • Subcategorization • Semantic categorization 4/40
Introduction(2) • Lexicon • That part of the grammar of a language which includes the lexical entries for all the words and/or morphemes in the language and which may also include various other information, depending on the particular theory of grammar. • (8.1) • The children ate the cake with their hands. (b) The children ate the cakewith blue icing. 5/40
tn fp tp fn target selected 8.1 Evaluation Measures(1) • Evaluation in IR makes frequent use of the notions of precision and recall, and their use has crossed over into work on evaluating Statistical NLP models • Precision & Recall • Precision = tp / (tp+fp) • Recall = tp / (tp+fn) 6/40
8.1 Evaluation Measures(2) • F measure • Combine precision and recall in a single measure of overall performance. P : Precision R : Recall α : a factor which determines the weighting of P & R. α= 0.5 is chosen often for equal weighting 7/40
8.2 Verb Subcategorization(1) • Verb Subcategorization : Verbs subcategorize for different semantic categories. • Verb Subcategorization Frame : A particular set of syntactic category that a verb can appear with is called a subcategorization frame. 9/40
8.2 Verb Subcategorization(1) Each category has several subcategories that express their semantic arguments using different syntactic means. The class of verbs with semantic arguments “theme” and “recipient” has a subcategory that expresses these arguments with an object and a prepositional phrase and another subcategory that in addition permits a double-object construction. • Donate + object(theme) + prepositional phrase(recipient) • Gave + double-object He donated a large sum of the money to a church. He gave the church a large sum of money.
8.2 Verb Subcategorization(2) • Knowing the possible subcategorization frames for verbs is important for parsing. • a. She told [the man] [where Peter grew up]. • b. She found [the place [where Peter grew up]]. • This information is not stored in dictionaries • 50% of parse failures can be due to missing subcategorizing frames • A simple & effective algorithm(Lerner) was proposed by Brent in 1993. 10/40
8.2 Verb Subcategorization(3) • Lerner (by Brent) • There are two steps of this algorithm. • Cues : Define a regular pattern of words and syntactic categories which indicates the presence of the frame with high certainty. • Certainty is formalized as probability of error • For a certain cue Cj we define error Ej • Hypothesis testingis done by contradiction • We assume that frame is not appropriate for the verb and call is Ho (Null Hypothesis). • we reject the hypothesis if Cj indicates with high probability that our Ho is wrong. 11/40
8.2 Verb Subcategorization(4) • Cue for frame “NP NP” (transitive verb) • (OBJ | SUBJ_OBJ | CAP) (PUNC | CC)pronoun or capitalized followed by punctuation • Example) • […]greet-V Peter-CAP ,-PUNC […] • I came Thursday, before the storm started. • Frame “NP NP” • reject H0. 12/40
8.2 Verb Subcategorization(5) • Hypotheses testing • H0 • If pE < α, then we reject H0 • Precision : • close to 100% (when α = 0.02) • Recall : 47 ~ 100% n : # of times vi occurs in corpus m : # of frame f j occurs vi(f j)=0 : Verb vi does not permit frame f j C(vi,c j) : # of times that vi occurs with cue c j εj: error rate for cuef j 13/40
8.2 Verb Subcategorization(6) • Manning’s addition • Use tagger and run the cue detection on the output of the tagger • Allowing low-reliability cue and additional cues based on tagger output increases the number of cues significantly • More error prone, but much more abundant cues. Examples: She compared the results with earlier findings. He relies on relatives. 14/40
8.2 Verb Subcategorization(7) • Table 8.3 • Learned subcategorization frames • Verb Correct Incorrect OALD • bridge 1 1 1 • burden 2 2 • depict 2 3 • Emanate 1 1 • leak 1 5 • occupy 1 3 • remark 1 1 4 • retire 2 1 5 • shed 1 2 • troop 0 3 Two of the errors are prepositional phrases (PPs) to bridge between and to retire in. OALD (NP in-PP is not included) “And here we are 10 years later with the same problems,” Mr. Smith remarked. 15/40
8.3 Attachment Ambiguity(1) • (8.14) The children ate the cake with a spoon. • I saw the man with a telescope Syntactic ambiguity • (Log) Likelihood Ratio [a common and good way of comparing between two exclusive alternatives] • Problem: ignores preference for attaching phrase “low” in parse tree 16/40
8.3 Attachment Ambiguity(2) • Chrysler confirmed that it would end its troubled venture with Maserati. 16/40
8.3.1 Hindle and Rooth Event space: all V NP PP sequences, • How likely for a preposition to attach with a verb or noun VAp: Is there a PP headed by p which attaches to v NAp: Is there a PP headed by p which attaches to n Both can be 1: • He put the book on World War II on the table • She sent him into the nursery to gather up his toys. 18/40
8.3.1 Hindle and Rooth 18/40
8.3.2 General remarks on PP attachment(1) • Model’s limitations • Only consider the identity of the preposition, noun and the verb • Consider only the most basic case of PP immediately after an NP object which is modifying either the immediately preceding n or v. The board approved [its acquisition] [by Royal Trustco Ltd.] [of Toronto] [for $27 a share] [at its monthly meeting] • Other attachment issues • Attachment ambiguity in noun compounds • [[Door bell] manufacturer] : left-branching • [Woman [aid worker]] : right-branching 19/40
8.4 Selectional Preferences(1) • Selectional Preference(or Selectional restriction) • Most verbs prefer arguments of a particular type. • Preference ↔ Rules eat + non-food argument Example) eating one’s words. 21/40
8.4 Selectional Preferences(2) • Acquisition of selectional preference is important in Statistical NLP for a number of reasons Durian is missing in dictionary then we can infer part of its meaning from selection restrictions • Another important use is ranking the parse of a sentence • Give high scores to the parses where verb has natural arguments 22/40
8.4 Selectional Preferences(3) • Resnik’s Model(1993,1996) • Selectional Preference Strength How strongly the verb constrains its direct object. Two Assumptions • Take only head noun • Classes of nouns. • P(C) : overall probability distribution of noun classes • P(C|v): probability distribution of noun classes in the direct object position of v 23/40
8.4 Selectional Preferences(4) • Table 8.5 Selectional Preference Strength 24/40
8.4 Selectional Preferences(5) • The Notion of the Resnik’s Model (conti’) 2. Selectional Association between a verb v and a class c • A rule for assigning strength to nouns Ex) (8.31) Susan interrupted the chair. 25/40
8.4 Selectional Preferences(6) • Estimate the Probability of P(c|v) = P(v,c) / P(v) N : total number of verb-object pairs in the corpus words(c) : set of all nouns in class c |classes(n)| : number of noun classes that contain n as a member C(v,n) : number of verb-object pairs with v as the verb and n as the head of the object NP 26/40
8.4 Selectional Preferences(7) • Resnik’s experiments on the Brown corpus (1996) : Table 8.6 • Left half : typical objects • Right half : atypical objects • For most verbs, association strength predicts which object is typical • Most errors the model makes are due to the fact that it performs a form of disambiguation, by choosing the highest A(v,c) for A(v,n) • Implicit object alternation • Mike ate the cake. • Mike ate. • The more constraints a verb puts on its object, the more likely it is to permit the implicit-object construction • Selectional Preference Strength (SPS) is seen as the more basic phenomenon which explains the occurrence of implicit-objects as well as association strength 27/40
8.5 Semantic Similarity (1) • Lexical Acquisition The Acquisition of meaning • Semantic Similarity • Automatically acquiring a relative measure of how similar a new word is to known words is much easier than determining what the meaning actually is • Most often used for generalization under the assumption that semantically similar words behave similarly ex) Susan had never eaten a fresh durian before. • Similarity-based Generalization VS. Class-based Generalization • Similarity-based generalization : Consider the closest neighbors • Class-based generalization : Consider the whole class • Usage of Semantic Similarity • Query expansion : astronaut cosmonaut • k nearest neighborsclassification 28/40
8.5 Semantic Similarity(2) • A Notion of Semantic Similarity • Extension of synonymy and refers to cases of near-synonymy like the pair dwelling/abode • Two words are from the Same domain or topic ex) Doctor, nurse, fever, intravenous • Judgements of Semantic Similarity explained by the degree ofcontextual interchangeability ( Miller and Charles – 1991) • Ambiguity presents a problem for all notions of semantic similarity When applied to ambiguous words, semantically similar usually means ‘similar to the appropriate sense’ ex) litigation ≒ suit (≠ clothes) • Similarity Measures • Vector space measures • Probabilistic measures 29/40
8.5.1 Vector space measures(1) • The two words whose semantic similarity we want to compute are represented as vectors in a multi-dimensional space. • A document-by-word matrix A ( Figure 8.3 ) • Entry contains the number of times word j occurs in document i. • A word-by-word matrix B ( Figure 8.4 ) • Entry contains the number of times word j co-occurs with word i. • A modifier-by-head matrix C ( Figure 8.5 ) • Entry contains the number of times that head J is modified by modifier i. • Different spaces get at different types of semantic similarity • Document-Word, Word-Word spaces capture topical similarity • Modifier-Head space captures more fine grained similarity 30/40
8.5.1 Vector space measures(2) • 3Similarity measures for binary vectors ( Table 8.7 ) • Matching coefficient simply counts the number of dimension on which both vectors are non-zero. • Dice coefficient normailizes for length of the vectors and the total number of non zero entries. • Jaccard (or Tanimoto) coefficient penalizes a small number of shared entries more than the Dice coefficient does. 31/40
8.5.1 Vector space measures(3) • Similarity measures for binary vectors (conti’) • Overlap coefficient has a value of 1.0 if every dimension with a non-zero value for the first vector is also non-zero for the second vector. • Cosine penalizes less in cases where the number of non-zero entries is very different. • Real-valued vector space • More powerful representation for linguistic objects. • The length of a vector 32/40
8.5.1 Vector space measures(4) • Real-valued vector space(conti’) • The dot product between two vectors • The cosine measure • The Euclidean distance • The advantage of vector spaces as a representational medium. • Simplicity. • Computational efficiency. • The disadvantage of vector spaces • Operate on binary data except for cosine • Cosine has its own problem • Cosine assumes a Euclidean space • Euclidean space is not well-motivated choice if the vectors we are dealing with are vectors of probability or counts 33/40
8.5.2 Probabilistic measures(1) • Transform semantic similarity into the similarity of two probability distribution • Transform matrices of counts in Figure 8.3, 8.4 and 8.5 into matrices of conditional probability • Ex) (American, Astronaut) P(American|astronaut) = ½ = 0.5 • Measures of (dis-)similarity between probability distributions • ( Table 8.9 ) • 3 measures of dissimilarity between probability distributions investigated by Dagan et al.(1997) • KL divergence • Measures how much information is lost if we assume distribution q when the true distribution is p • Two Problems for Practical applications • Get value of infinity when qi=0 and pi≠ 0 • Asymmetric ( D(p||q) ≠ D(q||p) ) 34/40
8.5.2 Probabilistic measures(2) • Measures of similarity between probability distributions (conti’) 2. Information radius (IRAD) • Symmetric and no problem with infinite values. • Measures how much information is lost if we describe the two words that correspond to p and q with their average distribution • norm. • A measure of the expected proportion of events that are going to be different between the distributions p and q 35/40
8.5.2 Probabilistic measures(3) • Measures of similarity between probability distributions (conti’) • norm’s example ( by figure 8.5 ) p1 = P(Soviet | cosmonaut) = 0.5 p2 = 0 p3 = P(spacewalking | cosmonaut)=0.5 q1 = 0 q2 = P(American | astronaut) = 0.5 q3 = P(spacewalking | astronaut) = 0.5 36/40
8.6 The Role of Lexical Acquisition in Statistical NLP(1) • Lexical acquisition plays a key role in statistical NLP because available lexical resources are always lacking in some way. • The cost of building lexical resources manually. • The quantitative part of lexical acquisition almost always has to be done automatically. • Many lexical resources were designed for human consumption. The best solution : the augmentation of a manual resource by automatic means. • The main reason : The inherent productivity of language. 38/40
What does the future hold for lexical acquisition? • Look harder for sources of prior knowledge that can constrain the process of lexical acquisition. • Much of the hard work of lexical acquisition will be in building interfaces that admit easy specification of prior knowledge and easy correction of mistake made in automatic learning. • Linguistic theory-important source of prior knowledge- has been surprisingly underutilized in Statistical NLP. • Dictionaries are only one source of information that can be important in lexical acquisition in addition to text corpora. • ( Other source : encyclopedias, thesauri, gazeteers, collections of technical vocabulary etc.) • If we succeed in emulating human acquisition of language by tapping into this rich source of information, then a breakthrough in the effectiveness of lexical acquisition can be expected. 40/40