470 likes | 479 Views
This study delves into beyond n-grams, utilizing features such as surface characteristics and paraphrases for accurate bracketing of noun compounds. It explores adjacency and dependency models, comparing and estimating probabilities for different structural choices.
E N D
Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing Preslav Nakov andMarti HearstComputer Science Division and SIMSUniversity of California, Berkeley Supported by NSF DBI-0317510 and a gift from Genentech
Overview • Unsupervised algorithm • Applied here to noun compound bracketing, but promising for structural ambiguity generally • Features • n-grams, 2 ,MI • Beyond the n-gram • surface features • paraphrases • State-of-the art accuracy
Noun Compound Bracketing (a) [ [ liver cell ] antibody ] (left bracketing) (b) [ liver [cell line] ] (right bracketing) • In (a), the antibodytargets the liver cell. • In (b), the cell lineis derived from the liver. liver cell line liver cell antibody
Related Work Pr that w1 precedes w2 • Marcus(1980), Pustejosky&al.(1993), Resnik(1993) • adjacency model: Pr(w1|w2) vs. Pr(w2|w3) • Lauer (1995) • dependency model: Pr(w1|w2) vs. Pr(w1|w3) • Keller & Lapata (2004): • use the Web • unigrams and bigrams • Girju & al. (2005) • supervised model • bracketing in context • requires WordNet senses to be given • This work: • 2 • Web • n-grams • paraphrases • surface features
Adjacency & Dependency (1) • right bracketing: [w1 [w2w3] ] • w2w3 is a compound (modified by w1) • home health care • w1 and w2 independently modify w3 • adult male rat • left bracketing : [ [w1w2 ] w3] • only 1 modificational choice possible • law enforcement officer w1 w2 w3 w1 w2 w3
Adjacency & Dependency (2) • right bracketing: [w1[w2w3] ] • w2w3 is a compound (modified by w1) • w1 and w2 independently modify w3 • adjacency model • Is w2w3 a compound? • (vs. w1w2 being a compound) • dependency model • Does w1 modify w3? • (vs. w1 modifying w2) w1 w2 w3 w1 w2 w3 w1 w2 w3
Frequencies • Adjacency model • Compare #(w1,w2) to #(w2,w3) • Dependency model • Compare #(w1,w2) to #(w1,w3) Frequencyof w1w2 w1 w2 w3 left right w1 w2 w3
Probabilities • Adjacency model • Compare Pr(w1w2|w2) to Pr(w2w3|w3) • Dependency model • Compare Pr(w1w2|w2) to Pr(w1w3|w3) Pr that w1 modifies w2 w1 w2 w3 left right w1 w2 w3
Probabilities: Dependency • Dependency model • Pr(left) Pr(w1w2|w2)Pr(w2w3|w3) • Pr(right) Pr(w1w3|w3)Pr(w2w3|w3) So we compare Pr(w1w2|w2) to Pr(w1w3|w3) BUT! No cancellation in Lauer’s model: right w1 w2 w3 left
Probabilities: Estimation • Using page hits as a proxy for n-gram counts • Pr(w1w2|w2) = #(w1,w2) / #(w2) • #(w2) word frequency; query for “w2” • #(w1,w2) bigram frequency; query for “w1 w2” • smoothed by 0.5
Probabilities: Why? (1) • Why should we use: • (a) Pr(w1w2|w2), rather than • (b) Pr(w2w1|w1)? • Keller&Lapata (2004) calculate: • AltaVista queries: • (a): 70.49% • (b): 68.85% • British National Corpus: • (a): 63.11% • (b): 65.57%
Probabilities: Why? (2) • Why should we use: • (a) Pr(w1w2|w2), rather than • (b) Pr(w2w1|w1)? • Maybe to introduce a bracketing prior. • Just like Lauer (1995) did. • But otherwise, no reason to prefer either one. • Do we need probabilities? (association is OK) • Do we need a directed model? (symmetry is OK)
Association Models: 2 (Chi Squared) • A = #(wi,wj) • B = #(wi) – #(wi,wj) • C = #(wj) – #(wi,wj) • D = N – (A+B+C) • N = 8 trillion (= A+B+C+D) 8 billion Web pages x 1,000 words
Web-derived Surface Features • Authors often disambiguate noun compounds using surface markers, e.g.: • law-enforcement officer left • brain stem’s cell left • brain’s stem cell right • The enormous size of the Web makes them frequent enough to be useful.
Web-derived Surface Features:Dash (hyphen) • Left dash • cell-cycle analysis left • Right dash • donor T-cell right • fiber optics-system should be left.. • Double dash • T-cell-depletion unusable…
Web-derived Surface Features:Possessive Marker • Attached to the first word • brain’s stem cell right • Attached to the second word • brain stem’s cell left • Combined features • brain’s stem-cell right
Web-derived Surface Features:Capitalization • don’t-care – lowercase – uppercase • Plasmodium vivax Malaria left • plasmodium vivax Malaria left • lowercase – uppercase–don’t-care • brain Stem cell right • brain Stem Cell right • Disabled on: • Roman digits • Single-letter words: e.g., vitamin Ddeficiency
Web-derived Surface Features:Embedded Slash • Left embedded slash • leukemia/lymphoma cell right
Web-derived Surface Features:Parentheses • Single-word • growth factor (beta) left • (brain) stem cell right • Two-word • (growth factor) beta left • brain (stem cell) right
Web-derived Surface Features:Comma,dot,column,semi-column,… • Following the first word • home. health care right • adult, male rat right • Following the second word • health care, provider left • lung cancer: patients left
Web-derived Surface Features:Dash to External Word • Dash to an external word to the left • mouse-brain stem cell right • Dash to an external word to the right • tumor necrosis factor-alpha left
Web-derived Surface Features:Problems & Solutions • Problem: search engines ignore punctuation • “brain-stem cell” does not work • Solution: • query for “brain stem cell” • obtain 1,000 document summaries • look for the features in these summaries
Other Web-derived Features:Abbreviation • After the second word • tumor necrosis (TN) factor left • After the third word • tumor necrosis factor (NF) right • We query for e.g., “tumor necrosis tn factor” • Problems: • Roman digits: IV, vii • States: CA • Short words: me
Other Web-derived Features:Concatenation • Consider health care reform • healthcare: 79,500,000 • carereform: 269 • healthreform: 812 • Adjacency model • healthcare vs. carereform • Dependency model • healthcare vs. healthreform • Triples • “healthcarereform” vs. “health carereform”
Other Web-derived Features:Using Google’s * • Each * allows a 1 word wildcard • Single star • “health care * reform” left • “health * care reform” right • More stars and/or reverse order • “care reform * * health” right • “reform * * * health care” left • Adjacency model
Other Web-derived Features:Reorder • Reorders for “healthcare reform” • “care reform health” right • “reform health care” left
Other Web-derived Features:Internal Inflection Variability • First word • bone mineral density • bones mineral density • Second word • bone mineral density • bone minerals density right left
Other Web-derived Features:Switch The First Two Words • Predict right, if we can reorder • adult malerat as • male adult rat
Paraphrases (1) • The semantics of a noun compound is often made overt by a paraphrase (Warren,1978) • Prepositional • stem cellsin the brain right • stem cellsfrom the brain right • cellsfrom the brainstem left • Verbal • viruscausinghuman immunodeficiency left • painassociated witharthritis migraine left • Copula • office buildingthat is a skyscraper right
Paraphrases (2) • Lauer(1995), Keller&Lapata(2003), Girju&al.(2005) try to choose the best prepositional paraphrase as a proxy for the semantic interpretation of NCs • They use: of, for, in, at, on, from, with, about, (like) • This could be problematic, when more than one preposition is possible. • In contrast: • we try to predict syntax, not semantics • we do not need to disambiguate, just add up all counts • cellsin (the) bone marrow left (61,700) • cells from (the) bone marrow left (16,500) • marrow cells from (the) bone right (12)
Paraphrases (3) • prepositional paraphrases: • We use: ~150 prepositions • verbal paraphrases: • We use:associated with, caused by, contained in, derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for. • copula paraphrases: • We use:that/which/who and is/was • optional elements: • articles: a, an, the • quantifiers: some, every, etc. • pronouns: this, these, etc.
Evaluation: Datasets • Lauer Set • 244 noun compounds (NCs) • from Grolier’s encyclopedia • inter-annotator agreement: 81.5% • Biomedical Set • 430 NCs • from MEDLINE • inter-annotator agreement: 88% (=.606)
Evaluation: Experiments • Exact phrase queries • Limited to English • Inflections: • Lauer Set: Carroll’s morphological tools • Biomedical Set: UMLS Specialist Lexicon
Results: Lauer (1) wrong N/A correct
Results Lauer (2) wrong N/A correct
Results: Bio (1) wrong N/A correct
Results Bio (2) wrong N/A correct
Conclusion • Obtained new state-of-the-art results on NC bracketing • more robust than Lauer (1995) • more accurate than Keller&Lapata (2004) • Introduced search engine statistics that go beyond the n-gram • surface features • paraphrases • Works well for other structural ambiguity problems: • Prepositional phrase attachment • Noun phrase coordination
Future Work • Recognize ambiguous cases • 3-way classification • Bracket more than 3 nouns • Not just bracketing but dependences: • e.g., growth factoralpha • Bracket NPs in general (other POS) • augment Penn Treebank with NP-internal dependences
The End Thank you!
Web Counts: Problems • Page hits are inaccurate • This may be ok (Keller&Lapata,2003) • The Web lacks linguistic annotation • Pr(health|care) = #(“health care”) / #(care) • health: noun • care: both verb and noun • can be adjacent by chance • can come from different sentences • Cannot find: • stem cells VERB PREPOSITION brain • protein synthesis’ inhibition
Inter-annotator Agreement: Lauer Set • Lauer: 6 judges • Average: 81.50% • Best pair of annotators: 84.40% • Worse pair of annotators: 73.00% • Total of 308 examples. • 244 used • the rest: indeterminateorextraction errors • Problem: • Gold standard: Lauer, in context. • Human judges: no context!