200 likes | 360 Views
Lexical Acquisition. Extending our information about words, particularly quantitative information. Why lexical acquisition?. “one cannot learn a new language by reading a bilingual dictionary” -- Mercer Parsing ‘postmen’ requires context
E N D
Lexical Acquisition Extending our information about words, particularly quantitative information
Why lexical acquisition? • “one cannot learn a new language by reading a bilingual dictionary” -- Mercer • Parsing ‘postmen’ requires context • quantitative information is difficult to collect by hand • e.g., priors on word senses • productivity of language • Lexicons need to be updated for new words and usages
Machine-readable Lexicons contain... • Lexical vs syntactic information • Word senses • Classifications, subclassifications • Collocations • Arguments, preferences • Synonyms, antonyms • Quantitative information
Gray area between lexical and syntactic • The rules of grammar are syntactic. • S ::= NP V NP • S ::= NP [V NP PP] • But which one to use, when? • The children ate the cake with their hands. • The children ate the cake with blue icing.
Outline of chapter • verb subcategorization • Which arguments (e.g. infinitive, DO) does a particular verb admit? • attachment ambiguity • What does the modifier refer to? • selectional preferences • Does a verb tend to restrict its object to a certain class? • semantic similarity between words • This new word is most like which words?
Verb subcategorization frames • Assign to each verb the sf’s legal for it. (see diagram) • Crucial for parsing. • She told the man where Peter grew up. • (NP NP S) • She found the place where Peter grew up. • (NP NP)
Brent’s method (1993) • Learn subcategorizations given a corpus, lexical analyzer, and cues. • A cue is a pair <L,SF>: • L is a star-free regular expression over lexemes • (OBJ | SUBJ-OBJ | CAP) (PUNC | CC) • SF is a subcategorization frame • NP NP • Strategy: find verb sf’s for which the cues provide strong evidence.
Brent’s method (cont’d) • Compute the error rate of the cue E = Pr(false positives) • For each verb v and cue c = <L,SF>, • Test the hypothesis H0 that verb v does not admit SF. • pE = • If pE < a threshold, reject H0.
Subcategorization Frames: Ideas • Hypothesis testing gives high precision, low recall. • Unreliable cues are necessary and helpful (independence assumption) • Find SF’s for verb classes, rather than verbs, using a buggy tagger. • As long as error estimates are incorporated into pE, it works great. • Manning did this, and improved recall.
Attachment Ambiguity: PPs • NP V NP PP -- Does PP mdify V or NP? • Assumption: there is only one meaningful parse for each sentence: • The children ate the cakewith a spoon. • Bush sent 100,000 soldiersinto Kuwait. • Brazil honored their dealwith the IMF. • Straw man: compare co-occurrence counts between pairs <send, into> and <soldiers, into>.
Bias defeats simple counting • Prob(into | send) > Prob(into | soldiers). • Sometimes there will be strong association between PP and both V and NP. • Ford ended its venturewith Fiat. • In this case, there is a bias toward “low attachment” -- attaching PP to the nearer referent, NP.
Hindle and Ruth (1993) • Elegant (?) method of quantifying the low attachment bias • Express P(first PP after object attaches to object) and P(first PP after object attaches to verb) as a function of P(NA) = P(there is a PP following the object attaching to object) and P(VA) = P(there is a PP following the object attaching to verb) • Estimate P(NA) and P(VA) based on counting
Estimating P(NA) and P(VA) • <v,n,p> are a particular verb, noun, and preposition • P(VAp | v) = • (# times p attaches to v)/(# occs of v) • P(NAp | n) = • (# times p attaches to n)/(# occs of v) • The two are treated as independent!
Attachment of first PP • P(Attach(p,n) | v,n) = P(NAp | n) • Whenever there is a PP attaching to the noun, the first such PP attaches to the noun! • P(Attach(p,v) | v,n) = P((not NAp) | n) P(VAp | v) • Whenever there is no PP attaching to the noun, AND a PP attaching to verb… • I (put the [book on the table) on WW2]
Selectional Preferences • Verbs prefer classes of subjects, objects: • Objects of ‘eat’ tend to be food items • Subjects of ‘think’ tend to be people • Subjects of ‘bark’ tend to be dogs • Used to • disambiguate word sense • infer class of new words • rank multiple parses
Disambiguate the class (Resnick) • She interrupted the chair. • A(nc) = D(P(nc | v) || P(nc)) = P(nc|v)log(P(nc|v)/P(nc)) • Relative entropy, or Kullback Leibler distance • A(furniture) = P(furniture | interrupted) * log((P(furniture | interrupted) / P(furniture))
Estimating P(nc | v) • P(nc | v) = P(nc,v) / P(v) • P(v) is estimated to be the proportion of occurrences v among all verbs • P(nc,v) is proposed to be • 1/N Σ(n in nc) C(v,n)/|classes(n)| • Now just take the class with highest A(nc) for maximum likelihood word sense.
Semantic similarity • Uses • classifying a new word • expand queries in IR • Are two words similar... • When they are used together? • IMF and Brazil • When they are on the same topic? • astronaut and spacewalking • When they function interchangeably? • Soviet and American • When they are synonymous? • astronaut and cosmonaut
Cosine is no panacea • Corresponds to Euclidean distance between points • Should document-space vectors be treated as points? • Alternative: treat them as probability distributions (after normalizing) • Now, no reason to use cosine. Why not try information-theoretic approach?
Alternatives distance metrics to cosine • Cosine of square roots (Goldszmidt) • L1 norm -- Manhattan distance • Sum of absolute value of difference of components • KL Distance • D(p || q) • Mutual information (why not?) • D(p ^ q || pq) • Information radius -- information lost describing both p and q by their midpoint. • IRAD(p,q) = D(p||m) + D(q||m)