Ch.8 Lexical Acquisition

Ch.8 Lexical Acquisition 10-05-2009 1/40

Introduction • In this chapter, we will look at the acquisition of more complex syntactic and semantic properties of words. • Main areas covered in this chapter are • Verb sub categorization • Attachment Ambiguity • Selectional Preference • Semantic categorization 3/40

Introduction(1) • The General Goal of Lexical Acquisition •  Develop algorithms and statistical techniques for filling the holes in existing Machine-Readable Dictionary by looking at the occurrence patterns of words in large text corpora. • Many lexical acquisition problems besides collocations • Selectional preference • Subcategorization • Semantic categorization 4/40

Introduction(2) • Lexicon • That part of the grammar of a language which includes the lexical entries for all the words and/or morphemes in the language and which may also include various other information, depending on the particular theory of grammar. • (8.1) • The children ate the cake with their hands. (b) The children ate the cakewith blue icing. 5/40

tn fp tp fn target selected 8.1 Evaluation Measures(1) • Evaluation in IR makes frequent use of the notions of precision and recall, and their use has crossed over into work on evaluating Statistical NLP models • Precision & Recall • Precision = tp / (tp+fp) • Recall = tp / (tp+fn) 6/40

8.1 Evaluation Measures(2) • F measure •  Combine precision and recall in a single measure of overall performance. P : Precision R : Recall α : a factor which determines the weighting of P & R. α= 0.5 is chosen often for equal weighting 7/40

8.2 Verb Subcategorization(1) • Verb Subcategorization : Verbs subcategorize for different semantic categories. • Verb Subcategorization Frame : A particular set of syntactic category that a verb can appear with is called a subcategorization frame. 9/40

8.2 Verb Subcategorization(1) Each category has several subcategories that express their semantic arguments using different syntactic means. The class of verbs with semantic arguments “theme” and “recipient” has a subcategory that expresses these arguments with an object and a prepositional phrase and another subcategory that in addition permits a double-object construction. • Donate + object(theme) + prepositional phrase(recipient) • Gave + double-object He donated a large sum of the money to a church. He gave the church a large sum of money.

8.2 Verb Subcategorization(2) • Knowing the possible subcategorization frames for verbs is important for parsing. • a. She told [the man] [where Peter grew up]. • b. She found [the place [where Peter grew up]]. • This information is not stored in dictionaries • 50% of parse failures can be due to missing subcategorizing frames • A simple & effective algorithm(Lerner) was proposed by Brent in 1993. 10/40

8.2 Verb Subcategorization(3) •  Lerner (by Brent) • There are two steps of this algorithm. • Cues : Define a regular pattern of words and syntactic categories which indicates the presence of the frame with high certainty. • Certainty is formalized as probability of error • For a certain cue Cj we define error Ej • Hypothesis testingis done by contradiction • We assume that frame is not appropriate for the verb and call is Ho (Null Hypothesis). • we reject the hypothesis if Cj indicates with high probability that our Ho is wrong. 11/40

8.2 Verb Subcategorization(4) • Cue for frame “NP NP” (transitive verb) • (OBJ | SUBJ_OBJ | CAP) (PUNC | CC)pronoun or capitalized followed by punctuation • Example) • […]greet-V Peter-CAP ,-PUNC […] • I came Thursday, before the storm started. • Frame “NP NP” • reject H0. 12/40

8.2 Verb Subcategorization(5) • Hypotheses testing • H0 •  If pE < α, then we reject H0 • Precision : • close to 100% (when α = 0.02) • Recall : 47 ~ 100% n : # of times vi occurs in corpus m : # of frame f j occurs vi(f j)=0 : Verb vi does not permit frame f j C(vi,c j) : # of times that vi occurs with cue c j εj: error rate for cuef j 13/40

8.2 Verb Subcategorization(6) • Manning’s addition • Use tagger and run the cue detection on the output of the tagger • Allowing low-reliability cue and additional cues based on tagger output increases the number of cues significantly • More error prone, but much more abundant cues. Examples: She compared the results with earlier findings. He relies on relatives. 14/40

8.2 Verb Subcategorization(7) • Table 8.3 • Learned subcategorization frames • Verb Correct Incorrect OALD • bridge 1 1 1 • burden 2 2 • depict 2 3 • Emanate 1 1 • leak 1 5 • occupy 1 3 • remark 1 1 4 • retire 2 1 5 • shed 1 2 • troop 0 3 Two of the errors are prepositional phrases (PPs) to bridge between and to retire in. OALD (NP in-PP is not included) “And here we are 10 years later with the same problems,” Mr. Smith remarked. 15/40

8.3 Attachment Ambiguity(1) • (8.14) The children ate the cake with a spoon. • I saw the man with a telescope Syntactic ambiguity • (Log) Likelihood Ratio [a common and good way of comparing between two exclusive alternatives] • Problem: ignores preference for attaching phrase “low” in parse tree 16/40

8.3 Attachment Ambiguity(2) • Chrysler confirmed that it would end its troubled venture with Maserati. 16/40

8.3.1 Hindle and Rooth Event space: all V NP PP sequences, • How likely for a preposition to attach with a verb or noun VAp: Is there a PP headed by p which attaches to v NAp: Is there a PP headed by p which attaches to n Both can be 1: • He put the book on World War II on the table • She sent him into the nursery to gather up his toys. 18/40

8.3.1 Hindle and Rooth 18/40

8.3.2 General remarks on PP attachment(1) • Model’s limitations • Only consider the identity of the preposition, noun and the verb • Consider only the most basic case of PP immediately after an NP object which is modifying either the immediately preceding n or v. The board approved [its acquisition] [by Royal Trustco Ltd.] [of Toronto] [for $27 a share] [at its monthly meeting] • Other attachment issues • Attachment ambiguity in noun compounds • [[Door bell] manufacturer] : left-branching • [Woman [aid worker]] : right-branching 19/40

8.4 Selectional Preferences(1) • Selectional Preference(or Selectional restriction) •  Most verbs prefer arguments of a particular type. • Preference ↔ Rules eat + non-food argument Example) eating one’s words. 21/40

8.4 Selectional Preferences(2) • Acquisition of selectional preference is important in Statistical NLP for a number of reasons Durian is missing in dictionary then we can infer part of its meaning from selection restrictions • Another important use is ranking the parse of a sentence • Give high scores to the parses where verb has natural arguments 22/40

8.4 Selectional Preferences(3) • Resnik’s Model(1993,1996) • Selectional Preference Strength  How strongly the verb constrains its direct object.  Two Assumptions • Take only head noun • Classes of nouns. • P(C) : overall probability distribution of noun classes • P(C|v): probability distribution of noun classes in the direct object position of v 23/40

8.4 Selectional Preferences(4) • Table 8.5 Selectional Preference Strength 24/40

8.4 Selectional Preferences(5) • The Notion of the Resnik’s Model (conti’) 2. Selectional Association between a verb v and a class c • A rule for assigning strength to nouns Ex) (8.31) Susan interrupted the chair. 25/40

8.4 Selectional Preferences(6) • Estimate the Probability of P(c|v) = P(v,c) / P(v) N : total number of verb-object pairs in the corpus words(c) : set of all nouns in class c |classes(n)| : number of noun classes that contain n as a member C(v,n) : number of verb-object pairs with v as the verb and n as the head of the object NP 26/40

8.4 Selectional Preferences(7) • Resnik’s experiments on the Brown corpus (1996) : Table 8.6 • Left half : typical objects • Right half : atypical objects • For most verbs, association strength predicts which object is typical • Most errors the model makes are due to the fact that it performs a form of disambiguation, by choosing the highest A(v,c) for A(v,n) • Implicit object alternation • Mike ate the cake. • Mike ate. • The more constraints a verb puts on its object, the more likely it is to permit the implicit-object construction • Selectional Preference Strength (SPS) is seen as the more basic phenomenon which explains the occurrence of implicit-objects as well as association strength 27/40

8.4 Selectional Preferences(8) 27/40

8.5 Semantic Similarity (1) • Lexical Acquisition The Acquisition of meaning • Semantic Similarity • Automatically acquiring a relative measure of how similar a new word is to known words is much easier than determining what the meaning actually is • Most often used for generalization under the assumption that semantically similar words behave similarly ex) Susan had never eaten a fresh durian before. • Similarity-based Generalization VS. Class-based Generalization • Similarity-based generalization : Consider the closest neighbors • Class-based generalization : Consider the whole class • Usage of Semantic Similarity • Query expansion : astronaut  cosmonaut • k nearest neighborsclassification 28/40

8.5 Semantic Similarity(2) • A Notion of Semantic Similarity • Extension of synonymy and refers to cases of near-synonymy like the pair dwelling/abode • Two words are from the Same domain or topic ex) Doctor, nurse, fever, intravenous • Judgements of Semantic Similarity explained by the degree ofcontextual interchangeability ( Miller and Charles – 1991) • Ambiguity presents a problem for all notions of semantic similarity  When applied to ambiguous words, semantically similar usually means ‘similar to the appropriate sense’ ex) litigation ≒ suit (≠ clothes) • Similarity Measures • Vector space measures • Probabilistic measures 29/40

8.5 Semantic Similarity(2) 29/40

8.5.1 Vector space measures(1) • The two words whose semantic similarity we want to compute are represented as vectors in a multi-dimensional space. • A document-by-word matrix A ( Figure 8.3 ) • Entry contains the number of times word j occurs in document i. • A word-by-word matrix B ( Figure 8.4 ) • Entry contains the number of times word j co-occurs with word i. • A modifier-by-head matrix C ( Figure 8.5 ) • Entry contains the number of times that head J is modified by modifier i. • Different spaces get at different types of semantic similarity • Document-Word, Word-Word spaces capture topical similarity • Modifier-Head space captures more fine grained similarity 30/40

8.5.1 Vector space measures(2) • 3Similarity measures for binary vectors ( Table 8.7 ) • Matching coefficient simply counts the number of dimension on which both vectors are non-zero. • Dice coefficient normailizes for length of the vectors and the total number of non zero entries. • Jaccard (or Tanimoto) coefficient penalizes a small number of shared entries more than the Dice coefficient does. 31/40

8.5.1 Vector space measures(3) • Similarity measures for binary vectors (conti’) • Overlap coefficient has a value of 1.0 if every dimension with a non-zero value for the first vector is also non-zero for the second vector. • Cosine penalizes less in cases where the number of non-zero entries is very different. • Real-valued vector space • More powerful representation for linguistic objects. • The length of a vector 32/40

8.5.1 Vector space measures(4) • Real-valued vector space(conti’) • The dot product between two vectors • The cosine measure • The Euclidean distance • The advantage of vector spaces as a representational medium. • Simplicity. • Computational efficiency. • The disadvantage of vector spaces • Operate on binary data except for cosine • Cosine has its own problem • Cosine assumes a Euclidean space • Euclidean space is not well-motivated choice if the vectors we are dealing with are vectors of probability or counts 33/40

8.5.2 Probabilistic measures(1) • Transform semantic similarity into the similarity of two probability distribution • Transform matrices of counts in Figure 8.3, 8.4 and 8.5 into matrices of conditional probability • Ex) (American, Astronaut)  P(American|astronaut) = ½ = 0.5 • Measures of (dis-)similarity between probability distributions • ( Table 8.9 ) • 3 measures of dissimilarity between probability distributions investigated by Dagan et al.(1997) • KL divergence • Measures how much information is lost if we assume distribution q when the true distribution is p • Two Problems for Practical applications • Get value of infinity when qi=0 and pi≠ 0 • Asymmetric ( D(p||q) ≠ D(q||p) ) 34/40

8.5.2 Probabilistic measures(2) • Measures of similarity between probability distributions (conti’) 2. Information radius (IRAD) • Symmetric and no problem with infinite values. • Measures how much information is lost if we describe the two words that correspond to p and q with their average distribution • norm. • A measure of the expected proportion of events that are going to be different between the distributions p and q 35/40

8.5.2 Probabilistic measures(3) • Measures of similarity between probability distributions (conti’) • norm’s example ( by figure 8.5 ) p1 = P(Soviet | cosmonaut) = 0.5 p2 = 0 p3 = P(spacewalking | cosmonaut)=0.5 q1 = 0 q2 = P(American | astronaut) = 0.5 q3 = P(spacewalking | astronaut) = 0.5  36/40

8.6 The Role of Lexical Acquisition in Statistical NLP(1) • Lexical acquisition plays a key role in statistical NLP because available lexical resources are always lacking in some way. • The cost of building lexical resources manually. • The quantitative part of lexical acquisition almost always has to be done automatically. • Many lexical resources were designed for human consumption.  The best solution : the augmentation of a manual resource by automatic means. • The main reason : The inherent productivity of language. 38/40

What does the future hold for lexical acquisition? • Look harder for sources of prior knowledge that can constrain the process of lexical acquisition. •  Much of the hard work of lexical acquisition will be in building interfaces that admit easy specification of prior knowledge and easy correction of mistake made in automatic learning. • Linguistic theory-important source of prior knowledge- has been surprisingly underutilized in Statistical NLP. • Dictionaries are only one source of information that can be important in lexical acquisition in addition to text corpora. • ( Other source : encyclopedias, thesauri, gazeteers, collections of technical vocabulary etc.) • If we succeed in emulating human acquisition of language by tapping into this rich source of information, then a breakthrough in the effectiveness of lexical acquisition can be expected. 40/40

Ch.8 Lexical Acquisition

Ch.8 Lexical Acquisition

Presentation Transcript

Ch 8

8. Lexical Acquisition

Ch\8

CH. 8

Ch. 8

Ch.8

Ch. 8

Ch.8

Ch.8

Ch 8

Ch. 8

Ch. 8

CH.8

Ch. 8

Automatic acquisition for low frequency lexical items

Acquisition of Lexical Knowledge for NLP

Lexical Acquisition

Lexical Tone Acquisition through Typed Interactions

Ch 8

CH 8

CH 8

Ch. 7: Language Acquisition