1 / 47

Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing

This study delves into beyond n-grams, utilizing features such as surface characteristics and paraphrases for accurate bracketing of noun compounds. It explores adjacency and dependency models, comparing and estimating probabilities for different structural choices.

hillanthony
Download Presentation

Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing Preslav Nakov andMarti HearstComputer Science Division and SIMSUniversity of California, Berkeley Supported by NSF DBI-0317510 and a gift from Genentech

  2. Overview • Unsupervised algorithm • Applied here to noun compound bracketing, but promising for structural ambiguity generally • Features • n-grams, 2 ,MI • Beyond the n-gram • surface features • paraphrases • State-of-the art accuracy

  3. Noun Compound Bracketing (a) [ [ liver cell ] antibody ] (left bracketing) (b) [ liver [cell line] ] (right bracketing) • In (a), the antibodytargets the liver cell. • In (b), the cell lineis derived from the liver. liver cell line liver cell antibody

  4. Related Work Pr that w1 precedes w2 • Marcus(1980), Pustejosky&al.(1993), Resnik(1993) • adjacency model: Pr(w1|w2) vs. Pr(w2|w3) • Lauer (1995) • dependency model: Pr(w1|w2) vs. Pr(w1|w3) • Keller & Lapata (2004): • use the Web • unigrams and bigrams • Girju & al. (2005) • supervised model • bracketing in context • requires WordNet senses to be given • This work: • 2 • Web • n-grams • paraphrases • surface features

  5. Adjacency & Dependency (1) • right bracketing: [w1 [w2w3] ] • w2w3 is a compound (modified by w1) • home health care • w1 and w2 independently modify w3 • adult male rat • left bracketing : [ [w1w2 ] w3] • only 1 modificational choice possible • law enforcement officer w1 w2 w3 w1 w2 w3

  6. Adjacency & Dependency (2) • right bracketing: [w1[w2w3] ] • w2w3 is a compound (modified by w1) • w1 and w2 independently modify w3 • adjacency model • Is w2w3 a compound? • (vs. w1w2 being a compound) • dependency model • Does w1 modify w3? • (vs. w1 modifying w2) w1 w2 w3 w1 w2 w3 w1 w2 w3

  7. Frequencies • Adjacency model • Compare #(w1,w2) to #(w2,w3) • Dependency model • Compare #(w1,w2) to #(w1,w3) Frequencyof w1w2 w1 w2 w3 left right w1 w2 w3

  8. Probabilities • Adjacency model • Compare Pr(w1w2|w2) to Pr(w2w3|w3) • Dependency model • Compare Pr(w1w2|w2) to Pr(w1w3|w3) Pr that w1 modifies w2 w1 w2 w3 left right w1 w2 w3

  9. Probabilities: Dependency • Dependency model • Pr(left)  Pr(w1w2|w2)Pr(w2w3|w3) • Pr(right)  Pr(w1w3|w3)Pr(w2w3|w3) So we compare Pr(w1w2|w2) to Pr(w1w3|w3) BUT! No cancellation in Lauer’s model: right w1 w2 w3 left

  10. Probabilities: Estimation • Using page hits as a proxy for n-gram counts • Pr(w1w2|w2) = #(w1,w2) / #(w2) • #(w2) word frequency; query for “w2” • #(w1,w2) bigram frequency; query for “w1 w2” • smoothed by 0.5

  11. Probabilities: Why? (1) • Why should we use: • (a) Pr(w1w2|w2), rather than • (b) Pr(w2w1|w1)? • Keller&Lapata (2004) calculate: • AltaVista queries: • (a): 70.49% • (b): 68.85% • British National Corpus: • (a): 63.11% • (b): 65.57%

  12. Probabilities: Why? (2) • Why should we use: • (a) Pr(w1w2|w2), rather than • (b) Pr(w2w1|w1)? • Maybe to introduce a bracketing prior. • Just like Lauer (1995) did. • But otherwise, no reason to prefer either one. • Do we need probabilities? (association is OK) • Do we need a directed model? (symmetry is OK)

  13. Association Models: 2 (Chi Squared) • A = #(wi,wj) • B = #(wi) – #(wi,wj) • C = #(wj) – #(wi,wj) • D = N – (A+B+C) • N = 8 trillion (= A+B+C+D) 8 billion Web pages x 1,000 words

  14. Web-derived Surface Features • Authors often disambiguate noun compounds using surface markers, e.g.: • law-enforcement officer  left • brain stem’s cell  left • brain’s stem cell  right • The enormous size of the Web makes them frequent enough to be useful.

  15. Web-derived Surface Features:Dash (hyphen) • Left dash • cell-cycle analysis left • Right dash • donor T-cell right • fiber optics-system  should be left.. • Double dash • T-cell-depletion unusable…

  16. Web-derived Surface Features:Possessive Marker • Attached to the first word • brain’s stem cell  right • Attached to the second word • brain stem’s cell  left • Combined features • brain’s stem-cell  right

  17. Web-derived Surface Features:Capitalization • don’t-care – lowercase – uppercase • Plasmodium vivax Malaria  left • plasmodium vivax Malaria  left • lowercase – uppercase–don’t-care • brain Stem cell  right • brain Stem Cell  right • Disabled on: • Roman digits • Single-letter words: e.g., vitamin Ddeficiency

  18. Web-derived Surface Features:Embedded Slash • Left embedded slash • leukemia/lymphoma cell  right

  19. Web-derived Surface Features:Parentheses • Single-word • growth factor (beta)  left • (brain) stem cell  right • Two-word • (growth factor) beta  left • brain (stem cell)  right

  20. Web-derived Surface Features:Comma,dot,column,semi-column,… • Following the first word • home. health care  right • adult, male rat  right • Following the second word • health care, provider  left • lung cancer: patients  left

  21. Web-derived Surface Features:Dash to External Word • Dash to an external word to the left • mouse-brain stem cell  right • Dash to an external word to the right • tumor necrosis factor-alpha  left

  22. Web-derived Surface Features:Problems & Solutions • Problem: search engines ignore punctuation • “brain-stem cell” does not work • Solution: • query for “brain stem cell” • obtain 1,000 document summaries • look for the features in these summaries

  23. Other Web-derived Features:Abbreviation • After the second word • tumor necrosis (TN) factor  left • After the third word • tumor necrosis factor (NF) right • We query for e.g., “tumor necrosis tn factor” • Problems: • Roman digits: IV, vii • States: CA • Short words: me

  24. Other Web-derived Features:Concatenation • Consider health care reform • healthcare: 79,500,000 • carereform: 269 • healthreform: 812 • Adjacency model • healthcare vs. carereform • Dependency model • healthcare vs. healthreform • Triples • “healthcarereform” vs. “health carereform”

  25. Other Web-derived Features:Using Google’s * • Each * allows a 1 word wildcard • Single star • “health care * reform” left • “health * care reform” right • More stars and/or reverse order • “care reform * * health” right • “reform * * * health care” left • Adjacency model

  26. Other Web-derived Features:Reorder • Reorders for “healthcare reform” • “care reform health” right • “reform health care” left

  27. Other Web-derived Features:Internal Inflection Variability • First word • bone mineral density • bones mineral density • Second word • bone mineral density • bone minerals density  right  left

  28. Other Web-derived Features:Switch The First Two Words • Predict right, if we can reorder • adult malerat as • male adult rat

  29. Paraphrases (1) • The semantics of a noun compound is often made overt by a paraphrase (Warren,1978) • Prepositional • stem cellsin the brain right • stem cellsfrom the brain right • cellsfrom the brainstem left • Verbal • viruscausinghuman immunodeficiency left • painassociated witharthritis migraine left • Copula • office buildingthat is a skyscraper right

  30. Paraphrases (2) • Lauer(1995), Keller&Lapata(2003), Girju&al.(2005) try to choose the best prepositional paraphrase as a proxy for the semantic interpretation of NCs • They use: of, for, in, at, on, from, with, about, (like) • This could be problematic, when more than one preposition is possible. • In contrast: • we try to predict syntax, not semantics • we do not need to disambiguate, just add up all counts • cellsin (the) bone marrow left (61,700) • cells from (the) bone marrow  left (16,500) • marrow cells from (the) bone  right (12)

  31. Paraphrases (3) • prepositional paraphrases: • We use: ~150 prepositions • verbal paraphrases: • We use:associated with, caused by, contained in, derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for. • copula paraphrases: • We use:that/which/who and is/was • optional elements: • articles: a, an, the • quantifiers: some, every, etc. • pronouns: this, these, etc.

  32. Evaluation: Datasets • Lauer Set • 244 noun compounds (NCs) • from Grolier’s encyclopedia • inter-annotator agreement: 81.5% • Biomedical Set • 430 NCs • from MEDLINE • inter-annotator agreement: 88% (=.606)

  33. Evaluation: Experiments • Exact phrase queries • Limited to English • Inflections: • Lauer Set: Carroll’s morphological tools • Biomedical Set: UMLS Specialist Lexicon

  34. Results: Lauer (1) wrong N/A correct

  35. Results Lauer (2) wrong N/A correct

  36. Results Lauer (3)

  37. Results: Bio (1) wrong N/A correct

  38. Results Bio (2) wrong N/A correct

  39. Individual Surface Features Performance: Bio

  40. Conclusion • Obtained new state-of-the-art results on NC bracketing • more robust than Lauer (1995) • more accurate than Keller&Lapata (2004) • Introduced search engine statistics that go beyond the n-gram • surface features • paraphrases • Works well for other structural ambiguity problems: • Prepositional phrase attachment • Noun phrase coordination

  41. Future Work • Recognize ambiguous cases • 3-way classification • Bracket more than 3 nouns • Not just bracketing but dependences: • e.g., growth factoralpha • Bracket NPs in general (other POS) • augment Penn Treebank with NP-internal dependences

  42. The End Thank you!

  43. Web Counts: Problems • Page hits are inaccurate • This may be ok (Keller&Lapata,2003) • The Web lacks linguistic annotation • Pr(health|care) = #(“health care”) / #(care) • health: noun • care: both verb and noun • can be adjacent by chance • can come from different sentences • Cannot find: • stem cells VERB PREPOSITION brain • protein synthesis’ inhibition

  44. Inter-annotator Agreement: Lauer Set • Lauer: 6 judges • Average: 81.50% • Best pair of annotators: 84.40% • Worse pair of annotators: 73.00% • Total of 308 examples. • 244 used • the rest: indeterminateorextraction errors • Problem: • Gold standard: Lauer, in context. • Human judges: no context!

  45. Search Engines: 5/7/2005any language, inflections

  46. MSN over time: any language, inflections

  47. Google over time: any language, inflections

More Related