1 / 76

Using the Web for Natural Language Processing Problems

Marti Hearst School of Information, UC Berkeley Salesforce.com July 19, 2006. Using the Web for Natural Language Processing Problems. This research supported in part by NSF DBI-0317510. Natural Language Processing.

Download Presentation

Using the Web for Natural Language Processing Problems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Marti Hearst School of Information, UC Berkeley Salesforce.com July 19, 2006 Using the Web for Natural Language Processing Problems This research supported in part by NSF DBI-0317510

  2. Natural Language Processing • The ultimate goal: write programs that read and understand stories and conversations. • This is too hard! Instead we tackle sub-problems. • There have been notable successes lately: • Machine translation is vastly improved • Decent speech recognition in limited circumstances • Text categorization works with some accuracy

  3. Automatic Help Desk Translation at MS

  4. Why is text analysis difficult? • One reason: enormous vocabulary size. • The average English speaker’s vocabulary is around 50,000 words, • Many of these can be combined with many others, • And they mean different things when they do!

  5. How can a machine understand these? • Decorate the cake with the frosting. • Decorate the cake with the kids. • Throw out the cake with the frosting. • Get the sock from the cat with the gloves. • Get the glove from the cat with the socks. • It’s in the plastic water bottle. • It’s in the plastic bag dispenser.

  6. How to tackle this problem? • The field was stuck for quite some time. • CYC: hand-enter all semantic concepts and relations • A new approach started around 1990 • How to do it: • Get large text collections • Compute statistics over the words in those collections • Many different algorithms for doing this.

  7. Size Matters • Recent realization: bigger is better than smarter! • Banko and Brill ’01: “Scaling to Very, Very Large Corpora for Natural Language Disambiguation”, ACL

  8. Example Problem • Grammar checker example: Which word to use? <principal><principle> • Solution: look at which words surround each use: • I am in my third year as the principal of Anamosa High School. • School-principal transfers caused some upset. • This is a simple formulation of the quantum mechanical uncertainty principle. • Power without principle is barren, but principlewithout power is futile. (Tony Blair)

  9. Using Very, Very Large Corpora • Keep track of which words are the neighbors of each spelling in well-edited text, e.g.: • Principal: “high school” • Principle: “rule” • At grammar-check time, choose the spelling best predicted by the surrounding words. • Surprising results: • Log-linear improvement even to a billion words! • Getting more data is better than fine-tuning algorithms!

  10. The Effects of LARGE Datasets • From Banko & Brill ‘01

  11. How to Extend this Idea? • This is an exciting result … • BUT relies on having huge amounts of text that has been appropriately annotated!

  12. How to Avoid Labeling? • “Web as a baseline” (Lapata & Keller 04,05) • Main idea: apply web-determined counts to every problem imaginable. • Example: for t in {<principal><principle>} • Compute f(w1, t, w2) • The largest count wins

  13. Web as a Baseline • Works very well in some cases • machine translation candidate selection • article generation • noun compound interpretation • noun compound bracketing • adjective ordering • But lacking in others • spelling correction • countability detection • prepositional phrase attachment • How to push this idea further? Significantly better than the best supervised algorithm. Not significantly different from the best supervised.

  14. Using Unambiguous Cases • The trick: look for unambiguous cases to start • Use these to improve the results beyond what co-occurrence statistics indicate. • An Early Example: • Hindle and Rooth, “Structural Ambiguity and Lexical Relations”, ACL ’90, Comp Ling’93 • Problem: Prepositional Phrase attachment • I eat/v spaghetti/n1 with/p a fork/n2. • I eat/vspaghetti/n1 with/psauce/n2. • quadruple: (v, n1, p, n2) • Question: does n2 attach to v or to n1?

  15. Using Unambiguous Cases • How to do this with unlabeled data? • First try: • Parse some text into phrase structure • Then compute certain co-occurrences f(v, n1, p) f(n1, p) f(v, n1) • Problem: results not accurate enough • The trick: look for unambiguous cases: • Spaghetti with sauce is delicious. (pre-verbal) • I eat it with a fork. (object of preposition can’t attach to a pronoun) • Use these to improve the results beyond what co-occurrence statistics indicate.

  16. Unambiguous + Unlimited = Unsupervised • Apply the Unambiguous Case Idea to the Very, Very Large Corpora idea • The potential of these approaches are not fully realized • Our work: • Structural Ambiguity Decisions (work with Preslav Nakov) • PP-attachment • Noun compound bracketing • Coordination grouping • Semantic Relation Acquisition • Hypernym (ISA) relations • Verbal relations between nouns

  17. Structural Ambiguity Problems • Apply the U + U = U idea to structural ambiguity • Noun compound bracketing • Prepositional Phrase attachment • Noun Phrase coordination • Motivation: BioText project • In eukaryotes, the key to transcriptional regulation of the Heat Shock Response is the Heat Shock Transcription Factor (HSF). • Open-labeled long-term study of the subcutaneous sumatriptan efficacy and tolerability in acute migraine treatment. • BimL protein interact with Bcl-2 or Bcl-XL, or Bcl-w proteins (Immuno-precipitation (anti-Bcl-2 OR Bcl-XL or Bcl-w)) followed by Western blot (anti-EEtag) using extracts human 293T cells co-transfected with EE-tagged BimL and (bcl-2 or bcl-XL or bcl-w) plasmids)

  18. Applying U + U = U to Structural Ambiguity • We introduce the use of (nearly) unambiguous features: • surface features • paraphrases • Combined with very, very large corpora • Achieve state-of-the-art results without labeled examples.

  19. Noun Compound Bracketing (a) [ [ liver cell ] antibody ] (left bracketing) (b) [ liver [cell line] ] (right bracketing) In (a), the antibodytargets the liver cell. In (b), the cell lineis derived from the liver.

  20. Dependency Model • right bracketing: [w1[w2w3] ] • w2w3 is a compound (modified by w1) • home health care • w1 and w2 independently modify w3 • adult male rat • left bracketing : [ [w1w2 ]w3] • only 1 modificational choice possible • law enforcement officer w1 w2 w3 w1 w2 w3

  21. Related Work • Marcus(1980), Pustejosky&al.(1993), Resnik(1993) • adjacency model: Pr(w1|w2) vs. Pr(w2|w3) • Lauer (1995) • dependency model: Pr(w1|w2) vs. Pr(w1|w3) • Keller & Lapata (2004): • use the Web • unigrams and bigrams • Girju & al. (2005) • supervised model • bracketing in context • requires WordNet senses to be given • Our approach: • Web as data • 2 , n-grams • paraphrases • surface features

  22. Computing Bigram Statistics • Dependency Model, Frequencies • Compare #(w1,w2) to #(w1,w3) • Dependency model, Probabilities • Pr(left) = Pr(w1w2|w2)Pr(w2w3|w3) • Pr(right) = Pr(w1w3|w3)Pr(w2w3|w3) • So we compare Pr(w1w2|w2) to Pr(w1w3|w3) right w1 w2 w3 left

  23. Probabilities: Estimation • Using page hits as a proxy for n-gram counts • Pr(w1w2|w2) = #(w1,w2) / #(w2) • #(w2) word frequency; query for “w2” • #(w1,w2) bigram frequency; query for “w1 w2” • smoothed by 0.5

  24. Association Models: 2 (Chi Squared) • A = #(wi,wj) • B = #(wi) – #(wi,wj) • C = #(wj) – #(wi,wj) • D = N – (A+B+C) • N = 8 trillion (= A+B+C+D) 8 billion Web pages x 1,000 words

  25. Web-derived Surface Features • Authors often disambiguate noun compounds using surface markers, e.g.: • amino-acid sequence  left • brain stem’s cell  left • brain’s stem cell  right • The enormous size of the Web makes these frequent enough to be useful.

  26. Web-derived Surface Features:Dash (hyphen) • Left dash • cell-cycle analysis left • Right dash • donor T-cell right • fiber optics-system  should be left.. • Double dash • T-cell-depletion unusable…

  27. Web-derived Surface Features:Possessive Marker • Attached to the first word • brain’s stem cell  right • Attached to the second word • brain stem’s cell  left • Combined features • brain’s stem-cell  right

  28. Web-derived Surface Features:Capitalization • don’t-care – lowercase – uppercase • Plasmodium vivax Malaria  left • plasmodium vivax Malaria  left • lowercase – uppercase–don’t-care • brain Stem cell  right • brain Stem Cell  right • Disable this on: • Roman digits • Single-letter words: e.g. vitamin D deficiency

  29. Web-derived Surface Features:Embedded Slash • Left embedded slash • leukemia/lymphoma cell  right

  30. Web-derived Surface Features:Parentheses • Single-word • growth factor (beta)  left • (brain) stem cell  right • Two-word • (growth factor) beta  left • brain (stem cell)  right

  31. Web-derived Surface Features:Comma, dot, semi-colon • Following the first word • home. health care  right • adult, male rat  right • Following the second word • health care, provider  left • lung cancer: patients  left

  32. Web-derived Surface Features:Dash to External Word • External word to the left • mouse-brain stem cell  right • External word to the right • tumor necrosis factor-alpha  left

  33. Web-derived Surface Features:Problems & Solutions • Problem: search engines ignore punctuation in queries • “brain-stem cell” does not work • Solution: • query for “brain stem cell” • obtain 1,000 document summaries • scan for the features in these summaries

  34. Other Web-derived Features:Abbreviation • After the second word • tumor necrosis factor (NF) right • After the third word • tumor necrosis (TN) factor  right • We query for, e.g., “tumor necrosis tn factor” • Problems: • Roman digits: IV, VI • States: CA • Short words: me

  35. Other Web-derived Features:Concatenation • Consider health care reform • healthcare : 79,500,000 • carereform : 269 • healthreform: 812 • Adjacency model • healthcare vs. carereform • Dependency model • healthcare vs. healthreform • Triples • “healthcarereform” vs. “health carereform”

  36. Other Web-derived Features:Reorder • Reorders for “healthcare reform” • “care reform health” right • “reform health care” left

  37. Other Web-derived Features:Internal Inflection Variability • Vary inflection of second word • tyrosine kinase activation • tyrosine kinases activation

  38. Other Web-derived Features:Switch The First Two Words • Predict right, if we can reorder • adult male ratas • male adult rat

  39. Paraphrases • The semantics of a noun compound is often made overt by a paraphrase (Warren,1978) • Prepositional • stem cells in the brain right • cells from the brainstem right • Verbal • virus causinghuman immunodeficiency  left • Copula • office building that is a skyscraper right

  40. Paraphrases • prepositional paraphrases: • We use: ~150 prepositions • verbal paraphrases: • We use: associated with, caused by, contained in, derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for. • copula paraphrases: • We use: is/was and that/which/who • optional elements: • articles: a, an, the • quantifiers: some, every, etc. • pronouns: this, these, etc.

  41. Evaluation:Datasets • Lauer Set • 244 noun compounds (NCs) • from Grolier’s encyclopedia • inter-annotator agreement: 81.5% • Biomedical Set • 430 NCs • from MEDLINE • inter-annotator agreement: 88% (=.606)

  42. Evaluation:Experiments • Exact phrase queries • Limited to English • Inflections: • Lauer Set: Carroll’s morphological tools • Biomedical Set: UMLS Specialist Lexicon

  43. Co-occurrence Statistics • Lauer set • Bio set

  44. Paraphrase and Surface Features Performance • Lauer Set • Biomedical Set

  45. Individual Surface Features Performance: Bio

  46. Individual Surface Features Performance: Bio

  47. Results Lauer

  48. Results: Comparing with Others

  49. Results Bio

  50. Results for Noun Compound Bracketing • Introduced search engine statistics that go beyond the n-gram (applicable to other tasks) • surface features • paraphrases • Obtained new state-of-the-art results on NC bracketing • more robust than Lauer (1995) • more accurate than Keller&Lapata (2004)

More Related