760 likes | 774 Views
Marti Hearst School of Information, UC Berkeley Salesforce.com July 19, 2006. Using the Web for Natural Language Processing Problems. This research supported in part by NSF DBI-0317510. Natural Language Processing.
E N D
Marti Hearst School of Information, UC Berkeley Salesforce.com July 19, 2006 Using the Web for Natural Language Processing Problems This research supported in part by NSF DBI-0317510
Natural Language Processing • The ultimate goal: write programs that read and understand stories and conversations. • This is too hard! Instead we tackle sub-problems. • There have been notable successes lately: • Machine translation is vastly improved • Decent speech recognition in limited circumstances • Text categorization works with some accuracy
Why is text analysis difficult? • One reason: enormous vocabulary size. • The average English speaker’s vocabulary is around 50,000 words, • Many of these can be combined with many others, • And they mean different things when they do!
How can a machine understand these? • Decorate the cake with the frosting. • Decorate the cake with the kids. • Throw out the cake with the frosting. • Get the sock from the cat with the gloves. • Get the glove from the cat with the socks. • It’s in the plastic water bottle. • It’s in the plastic bag dispenser.
How to tackle this problem? • The field was stuck for quite some time. • CYC: hand-enter all semantic concepts and relations • A new approach started around 1990 • How to do it: • Get large text collections • Compute statistics over the words in those collections • Many different algorithms for doing this.
Size Matters • Recent realization: bigger is better than smarter! • Banko and Brill ’01: “Scaling to Very, Very Large Corpora for Natural Language Disambiguation”, ACL
Example Problem • Grammar checker example: Which word to use? <principal><principle> • Solution: look at which words surround each use: • I am in my third year as the principal of Anamosa High School. • School-principal transfers caused some upset. • This is a simple formulation of the quantum mechanical uncertainty principle. • Power without principle is barren, but principlewithout power is futile. (Tony Blair)
Using Very, Very Large Corpora • Keep track of which words are the neighbors of each spelling in well-edited text, e.g.: • Principal: “high school” • Principle: “rule” • At grammar-check time, choose the spelling best predicted by the surrounding words. • Surprising results: • Log-linear improvement even to a billion words! • Getting more data is better than fine-tuning algorithms!
The Effects of LARGE Datasets • From Banko & Brill ‘01
How to Extend this Idea? • This is an exciting result … • BUT relies on having huge amounts of text that has been appropriately annotated!
How to Avoid Labeling? • “Web as a baseline” (Lapata & Keller 04,05) • Main idea: apply web-determined counts to every problem imaginable. • Example: for t in {<principal><principle>} • Compute f(w1, t, w2) • The largest count wins
Web as a Baseline • Works very well in some cases • machine translation candidate selection • article generation • noun compound interpretation • noun compound bracketing • adjective ordering • But lacking in others • spelling correction • countability detection • prepositional phrase attachment • How to push this idea further? Significantly better than the best supervised algorithm. Not significantly different from the best supervised.
Using Unambiguous Cases • The trick: look for unambiguous cases to start • Use these to improve the results beyond what co-occurrence statistics indicate. • An Early Example: • Hindle and Rooth, “Structural Ambiguity and Lexical Relations”, ACL ’90, Comp Ling’93 • Problem: Prepositional Phrase attachment • I eat/v spaghetti/n1 with/p a fork/n2. • I eat/vspaghetti/n1 with/psauce/n2. • quadruple: (v, n1, p, n2) • Question: does n2 attach to v or to n1?
Using Unambiguous Cases • How to do this with unlabeled data? • First try: • Parse some text into phrase structure • Then compute certain co-occurrences f(v, n1, p) f(n1, p) f(v, n1) • Problem: results not accurate enough • The trick: look for unambiguous cases: • Spaghetti with sauce is delicious. (pre-verbal) • I eat it with a fork. (object of preposition can’t attach to a pronoun) • Use these to improve the results beyond what co-occurrence statistics indicate.
Unambiguous + Unlimited = Unsupervised • Apply the Unambiguous Case Idea to the Very, Very Large Corpora idea • The potential of these approaches are not fully realized • Our work: • Structural Ambiguity Decisions (work with Preslav Nakov) • PP-attachment • Noun compound bracketing • Coordination grouping • Semantic Relation Acquisition • Hypernym (ISA) relations • Verbal relations between nouns
Structural Ambiguity Problems • Apply the U + U = U idea to structural ambiguity • Noun compound bracketing • Prepositional Phrase attachment • Noun Phrase coordination • Motivation: BioText project • In eukaryotes, the key to transcriptional regulation of the Heat Shock Response is the Heat Shock Transcription Factor (HSF). • Open-labeled long-term study of the subcutaneous sumatriptan efficacy and tolerability in acute migraine treatment. • BimL protein interact with Bcl-2 or Bcl-XL, or Bcl-w proteins (Immuno-precipitation (anti-Bcl-2 OR Bcl-XL or Bcl-w)) followed by Western blot (anti-EEtag) using extracts human 293T cells co-transfected with EE-tagged BimL and (bcl-2 or bcl-XL or bcl-w) plasmids)
Applying U + U = U to Structural Ambiguity • We introduce the use of (nearly) unambiguous features: • surface features • paraphrases • Combined with very, very large corpora • Achieve state-of-the-art results without labeled examples.
Noun Compound Bracketing (a) [ [ liver cell ] antibody ] (left bracketing) (b) [ liver [cell line] ] (right bracketing) In (a), the antibodytargets the liver cell. In (b), the cell lineis derived from the liver.
Dependency Model • right bracketing: [w1[w2w3] ] • w2w3 is a compound (modified by w1) • home health care • w1 and w2 independently modify w3 • adult male rat • left bracketing : [ [w1w2 ]w3] • only 1 modificational choice possible • law enforcement officer w1 w2 w3 w1 w2 w3
Related Work • Marcus(1980), Pustejosky&al.(1993), Resnik(1993) • adjacency model: Pr(w1|w2) vs. Pr(w2|w3) • Lauer (1995) • dependency model: Pr(w1|w2) vs. Pr(w1|w3) • Keller & Lapata (2004): • use the Web • unigrams and bigrams • Girju & al. (2005) • supervised model • bracketing in context • requires WordNet senses to be given • Our approach: • Web as data • 2 , n-grams • paraphrases • surface features
Computing Bigram Statistics • Dependency Model, Frequencies • Compare #(w1,w2) to #(w1,w3) • Dependency model, Probabilities • Pr(left) = Pr(w1w2|w2)Pr(w2w3|w3) • Pr(right) = Pr(w1w3|w3)Pr(w2w3|w3) • So we compare Pr(w1w2|w2) to Pr(w1w3|w3) right w1 w2 w3 left
Probabilities: Estimation • Using page hits as a proxy for n-gram counts • Pr(w1w2|w2) = #(w1,w2) / #(w2) • #(w2) word frequency; query for “w2” • #(w1,w2) bigram frequency; query for “w1 w2” • smoothed by 0.5
Association Models: 2 (Chi Squared) • A = #(wi,wj) • B = #(wi) – #(wi,wj) • C = #(wj) – #(wi,wj) • D = N – (A+B+C) • N = 8 trillion (= A+B+C+D) 8 billion Web pages x 1,000 words
Web-derived Surface Features • Authors often disambiguate noun compounds using surface markers, e.g.: • amino-acid sequence left • brain stem’s cell left • brain’s stem cell right • The enormous size of the Web makes these frequent enough to be useful.
Web-derived Surface Features:Dash (hyphen) • Left dash • cell-cycle analysis left • Right dash • donor T-cell right • fiber optics-system should be left.. • Double dash • T-cell-depletion unusable…
Web-derived Surface Features:Possessive Marker • Attached to the first word • brain’s stem cell right • Attached to the second word • brain stem’s cell left • Combined features • brain’s stem-cell right
Web-derived Surface Features:Capitalization • don’t-care – lowercase – uppercase • Plasmodium vivax Malaria left • plasmodium vivax Malaria left • lowercase – uppercase–don’t-care • brain Stem cell right • brain Stem Cell right • Disable this on: • Roman digits • Single-letter words: e.g. vitamin D deficiency
Web-derived Surface Features:Embedded Slash • Left embedded slash • leukemia/lymphoma cell right
Web-derived Surface Features:Parentheses • Single-word • growth factor (beta) left • (brain) stem cell right • Two-word • (growth factor) beta left • brain (stem cell) right
Web-derived Surface Features:Comma, dot, semi-colon • Following the first word • home. health care right • adult, male rat right • Following the second word • health care, provider left • lung cancer: patients left
Web-derived Surface Features:Dash to External Word • External word to the left • mouse-brain stem cell right • External word to the right • tumor necrosis factor-alpha left
Web-derived Surface Features:Problems & Solutions • Problem: search engines ignore punctuation in queries • “brain-stem cell” does not work • Solution: • query for “brain stem cell” • obtain 1,000 document summaries • scan for the features in these summaries
Other Web-derived Features:Abbreviation • After the second word • tumor necrosis factor (NF) right • After the third word • tumor necrosis (TN) factor right • We query for, e.g., “tumor necrosis tn factor” • Problems: • Roman digits: IV, VI • States: CA • Short words: me
Other Web-derived Features:Concatenation • Consider health care reform • healthcare : 79,500,000 • carereform : 269 • healthreform: 812 • Adjacency model • healthcare vs. carereform • Dependency model • healthcare vs. healthreform • Triples • “healthcarereform” vs. “health carereform”
Other Web-derived Features:Reorder • Reorders for “healthcare reform” • “care reform health” right • “reform health care” left
Other Web-derived Features:Internal Inflection Variability • Vary inflection of second word • tyrosine kinase activation • tyrosine kinases activation
Other Web-derived Features:Switch The First Two Words • Predict right, if we can reorder • adult male ratas • male adult rat
Paraphrases • The semantics of a noun compound is often made overt by a paraphrase (Warren,1978) • Prepositional • stem cells in the brain right • cells from the brainstem right • Verbal • virus causinghuman immunodeficiency left • Copula • office building that is a skyscraper right
Paraphrases • prepositional paraphrases: • We use: ~150 prepositions • verbal paraphrases: • We use: associated with, caused by, contained in, derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for. • copula paraphrases: • We use: is/was and that/which/who • optional elements: • articles: a, an, the • quantifiers: some, every, etc. • pronouns: this, these, etc.
Evaluation:Datasets • Lauer Set • 244 noun compounds (NCs) • from Grolier’s encyclopedia • inter-annotator agreement: 81.5% • Biomedical Set • 430 NCs • from MEDLINE • inter-annotator agreement: 88% (=.606)
Evaluation:Experiments • Exact phrase queries • Limited to English • Inflections: • Lauer Set: Carroll’s morphological tools • Biomedical Set: UMLS Specialist Lexicon
Co-occurrence Statistics • Lauer set • Bio set
Paraphrase and Surface Features Performance • Lauer Set • Biomedical Set
Results for Noun Compound Bracketing • Introduced search engine statistics that go beyond the n-gram (applicable to other tasks) • surface features • paraphrases • Obtained new state-of-the-art results on NC bracketing • more robust than Lauer (1995) • more accurate than Keller&Lapata (2004)