Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing

Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing Preslav Nakov andMarti HearstComputer Science Division and SIMSUniversity of California, Berkeley Supported by NSF DBI-0317510 and a gift from Genentech

Overview • Unsupervised algorithm • Applied here to noun compound bracketing, but promising for structural ambiguity generally • Features • n-grams, 2 ,MI • Beyond the n-gram • surface features • paraphrases • State-of-the art accuracy

Noun Compound Bracketing (a) [ [ liver cell ] antibody ] (left bracketing) (b) [ liver [cell line] ] (right bracketing) • In (a), the antibodytargets the liver cell. • In (b), the cell lineis derived from the liver. liver cell line liver cell antibody

Related Work Pr that w1 precedes w2 • Marcus(1980), Pustejosky&al.(1993), Resnik(1993) • adjacency model: Pr(w1|w2) vs. Pr(w2|w3) • Lauer (1995) • dependency model: Pr(w1|w2) vs. Pr(w1|w3) • Keller & Lapata (2004): • use the Web • unigrams and bigrams • Girju & al. (2005) • supervised model • bracketing in context • requires WordNet senses to be given • This work: • 2 • Web • n-grams • paraphrases • surface features

Adjacency & Dependency (1) • right bracketing: [w1[w2w3] ] • w2w3 is a compound (modified by w1) • home health care • w1 and w2 independently modify w3 • adult male rat • left bracketing : [ [w1w2 ]w3] • only 1 modificational choice possible • law enforcement officer w1 w2 w3 w1 w2 w3

Adjacency & Dependency (2) • right bracketing: [w1[w2w3] ] • w2w3 is a compound (modified by w1) • w1 and w2 independently modify w3 • adjacency model • Is w2w3 a compound? • (vs. w1w2 being a compound) • dependency model • Does w1 modify w3? • (vs. w1 modifying w2) w1 w2 w3 w1 w2 w3 w1 w2 w3

Frequencies • Adjacency model • Compare #(w1,w2) to #(w2,w3) • Dependency model • Compare #(w1,w2) to #(w1,w3) Frequencyof w1w2 w1 w2 w3 left right w1 w2 w3

Probabilities • Adjacency model • Compare Pr(w1w2|w2) to Pr(w2w3|w3) • Dependency model • Compare Pr(w1w2|w2) to Pr(w1w3|w3) Pr that w1 modifies w2 w1 w2 w3 left right w1 w2 w3

Probabilities: Estimation • Using page hits as a proxy for n-gram counts • Pr(w1w2|w2) = #(w1,w2) / #(w2) • #(w2) word frequency; query for “w2” • #(w1,w2) bigram frequency; query for “w1 w2” • smoothed by 0.5

Probabilities: Why? (1) • Why should we use: • (a) Pr(w1w2|w2), rather than • (b) Pr(w2w1|w1)? • Keller&Lapata (2004) calculate: • AltaVista queries: • (a): 70.49% • (b): 68.85% • British National Corpus: • (a): 63.11% • (b): 65.57%

Probabilities: Why? (2) • Why should we use: • (a) Pr(w1w2|w2), rather than • (b) Pr(w2w1|w1)? • Maybe to introduce a bracketing prior. • Just like Lauer (1995) did. • But otherwise, no reason to prefer either one. • Do we need probabilities? (association is OK) • Do we need a directed model? (symmetry is OK)

Association Models: 2 (Chi Squared) • A = #(wi,wj) • B = #(wi) – #(wi,wj) • C = #(wj) – #(wi,wj) • D = N – (A+B+C) • N = 8 trillion (= A+B+C+D) 8 billion Web pages x 1,000 words

Web-derived Surface Features • Authors often disambiguate noun compounds using surface markers, e.g.: • amino-acid sequence  left • brain stem’s cell  left • brain’s stem cell  right • The enormous size of the Web makes them frequent enough to be useful.

Web-derived Surface Features:Dash (hyphen) • Left dash • cell-cycle analysis left • Right dash • donor T-cell right • fiber optics-system  should be left.. • Double dash • T-cell-depletion unusable…

Web-derived Surface Features:Possessive Marker • Attached to the first word • brain’s stem cell  right • Attached to the second word • brain stem’s cell  left • Combined features • brain’s stem-cell  right

Web-derived Surface Features:Capitalization • don’t-care – lowercase – uppercase • Plasmodium vivax Malaria  left • plasmodium vivax Malaria  left • lowercase – uppercase–don’t-care • brain Stem cell  right • brain Stem Cell  right • Disabled on: • Roman digits • Single-letter words: e.g. vitamin D deficiency

Web-derived Surface Features:Embedded Slash • Left embedded slash • leukemia/lymphoma cell  right

Web-derived Surface Features:Parentheses • Single-word • growth factor (beta)  left • (brain) stem cell  right • Two-word • (growth factor) beta  left • brain (stem cell)  right

Web-derived Surface Features:Column, dot, semi-column • Following the first word • home. health care  right • adult, male rat  right • Following the second word • health care, provider  left • lung cancer: patients  left

Web-derived Surface Features:Dash to External Word • External word to the left • mouse-brain stem cell  right • External word to the right • tumor necrosis factor-alpha  left

Web-derived Surface Features:Problems & Solutions • Problem: search engines ignore punctuation • “brain-stem cell” does not work • Solution: • query for “brain stem cell” • obtain 1,000 document summaries • look for the features in these summaries

Other Web-derived Features:Abbreviation • After the second word • tumor necrosis factor (NF) right • After the third word • tumor necrosis (TN) factor  right • We query for e.g. “tumor necrosis tn factor” • Problems: • Roman digits: IV, VI • States: CA • Short words: me

Other Web-derived Features:Concatenation • Consider health care reform • healthcare : 79,500,000 • carereform : 269 • healthreform: 812 • Adjacency model • healthcare vs. carereform • Dependency model • healthcare vs. healthreform • Triples • “healthcarereform” vs. “health carereform”

Other Web-derived Features:Using Google’s * • Each * allows an one-word wildcard • Single star • “health care * reform” left • “health * care reform” right • More stars and/or reverse order • “care reform * * health” right • Adjacency model

Other Web-derived Features:Reorder • Reorders for “healthcare reform” • “care reform health” right • “reform health care” left

Other Web-derived Features:Internal Inflection Variability • First word • ??? • Second word • tyrosine kinase activation • tyrosine kinases activation

Other Web-derived Features:Switch The First Two Words • Predict right, if we can reorder • adult male ratas • male adult rat

Paraphrases (1) • The semantics of a noun compound is often made overt by a paraphrase (Warren,1978) • Prepositional • stem cells in the brain right • cells from the brain stem  right • Verbal • virus causinghuman immunodeficiency  left • pain associated witharthritis migraine  right • Copula • office building that is a skyscraper right

Paraphrases (2) • Lauer(1995), Keller&Lapata(2003), Girju&al. (2005) predict NC semantics by choosing the most likely preposition: • of, for, in, at, on, from, with, about, (like) • This could be problematic, when more than one preposition is possible • In contrast: • we try to predict syntax, not semantics • we do not disambiguate, just add up all counts • cells in (the) bone marrow  left • cells from (the) bone marrow  left

Paraphrases (3) • prepositional paraphrases: • We use: ~150 prepositions • verbal paraphrases: • We use: associated with, caused by, contained in, derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for. • copula paraphrases: • We use: is/was and that/which/who • optional elements: • articles: a, an, the • quantifiers: some, every, etc. • pronouns: this, these, etc.

Evaluation: Datasets • Lauer Set • 244 noun compounds (NCs) • from Grolier’s encyclopedia • inter-annotator agreement: 81.5% • Biomedical Set • 430 NCs • from MEDLINE • inter-annotator agreement: 88% (=.606)

Evaluation: Experiments • Exact phrase queries • Limited to English • Inflections: • Lauer Set: Carroll’s morphological tools • Biomedical Set: UMLS Specialist Lexicon

Results: Lauer (1) wrong N/A correct

Results Lauer (2) wrong N/A correct

Results Lauer (3)

Results: Bio (1) wrong N/A correct

Results Bio (2) wrong N/A correct

Individual Surface Features Performance: Bio

Paraphrase and Surface Features Performance • Lauer Set • Biomedical Set

Discussion Lauer Bio • Adjacency vs. Dependency • 2 vs. frequencies vs. probabilities

Conclusion • Introduced search engine statistics that go beyond the n-gram (applicable to other tasks) • surface features • paraphrases • Obtained new state-of-the-art results on NC bracketing • more robust than Lauer (1995) • more accurate than Keller&Lapata (2004)

Future Work • Recognize ambiguous cases • Bracket more than 3 nouns • Not just bracketing but dependences: • e.g. growth factoralpha • Bracket NPs in general (other POS) • augment Penn Treebank with NP-internal dependences • Application to other structural ambiguity problems: • Prepositional phrase attachment • Noun phrase coordination

The End Thank you!

Web Counts: Problems • Page hits are inaccurate • This may be ok (Keller&Lapata,2003) • The Web lacks linguistic annotation • Pr(health|care) = #(“health care”) / #(care) • health: noun • care: both verb and noun • can be adjacent by chance • can come from different sentences • Cannot find: • stem cells VERB PREPOSITION brain • protein synthesis’ inhibition

Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing