820 likes | 962 Views
Marti Hearst School of Information, UC Berkeley UCB Neyman Seminar October 25, 2006. Unambiguous + Unlimited = Unsupervised Using the Web for Natural Language Processing Problems. This research supported in part by NSF DBI-0317510. Natural Language Processing.
E N D
Marti Hearst School of Information, UC Berkeley UCB Neyman Seminar October 25, 2006 Unambiguous + Unlimited = UnsupervisedUsing the Web for Natural Language Processing Problems This research supported in part by NSF DBI-0317510
Natural Language Processing • The ultimate goal: write programs that read and understand stories and conversations. • This is too hard! Instead we tackle sub-problems. • There have been notable successes lately: • Machine translation is vastly improved • Speech recognition is decent in limited circumstances • Text categorization works with some accuracy
How can a machine understand these differences? Get the cat with the gloves.
How can a machine understand these differences? Get the sock from the cat with the gloves. Get the glove from the cat with the socks.
How can a machine understand these differences? • Decorate the cake with the frosting. • Decorate the cake with the kids. • Throw out the cake with the frosting. • Throw out the cake with the kids.
Why is this difficult? • Same syntactic structure, different meanings. • Natural language processing algorithms have to deal with the specifics of individual words. • Enormous vocabulary sizes. • The average English speaker’s vocabulary is around 50,000 words, • Many of these can be combined with many others, • And they mean different things when they do!
How to tackle this problem? • The field was stuck for quite some time. • Hand-enter all semantic concepts and relations • A new approach started around 1990 • Get large text collections • Compute statistics over the words in those collections • There are many different algorithms.
Size Matters Recent realization: bigger is better than smarter! Banko and Brill ’01: “Scaling to Very, Very Large Corpora for Natural Language Disambiguation”, ACL
Example Problem • Grammar checker example: Which word to use? <principal><principle> • Solution: use well-edited text and look at which words surround each use: • I am in my third year as the principal of Anamosa High School. • School-principal transfers caused some upset. • This is a simple formulation of the quantum mechanical uncertainty principle. • Power without principle is barren, but principlewithout power is futile. (Tony Blair)
Using Very, Very Large Corpora • Keep track of which words are the neighbors of each spelling in well-edited text, e.g.: • Principal: “high school” • Principle: “rule” • At grammar-check time, choose the spelling best predicted by the surrounding words. • Surprising results: • Log-linear improvement even to a billion words! • Getting more data is better than fine-tuning algorithms!
The Effects of LARGE Datasets • From Banko & Brill ‘01
How to Extend this Idea? • This is an exciting result … • BUT relies on having huge amounts of text that has been appropriately annotated!
How to Avoid Manual Labeling? • “Web as a baseline” (Lapata & Keller 04,05) • Main idea: apply web-determined counts to every problem imaginable. • Example: for t in {<principal><principle>} • Compute f(w-1, t, w+1) • The largest count wins
Web as a Baseline • Works very well in some cases • machine translation candidate selection • article generation • noun compound interpretation • noun compound bracketing • adjective ordering • But lacking in others • spelling correction • countability detection • prepositional phrase attachment • How to push this idea further? Significantly better than the best supervised algorithm. Not significantly different from the best supervised.
Using Unambiguous Cases • The trick: look for unambiguous cases to start • Use these to improve the results beyond what co-occurrence statistics indicate. • An Early Example: • Hindle and Rooth, “Structural Ambiguity and Lexical Relations”, ACL ’90, Comp Ling’93 • Problem: Prepositional Phrase attachment • I eat/v spaghetti/n1 with/p a fork/n2. • I eat/vspaghetti/n1 with/psauce/n2. • Question: does n2 attach to v or to n1?
Using Unambiguous Cases • How to do this with unlabeled data? • First try: • Parse some text into phrase structure • Then compute certain co-occurrences f(v, n1, p) f(n1, p) f(v, n1) • Problem: results not accurate enough • The trick: look for unambiguous cases: • Spaghetti with sauce is delicious. (pre-verbal) • I eat with a fork. (no direct object) • Use these to improve the results beyond what co-occurrence statistics indicate.
Unambiguous + Unlimited = Unsupervised • Apply the Unambiguous Case Idea to the Very, Very Large Corpora idea • The potential of these approaches are not fully realized • Our work (with Preslav Nakov): • Structural Ambiguity Decisions • PP-attachment • Noun compound bracketing • Coordination grouping • Semantic Relation Acquisition • Hypernym (ISA) relations • Verbal relations between nouns • SAT Analogy problems
Applying U + U = U to Structural Ambiguity • We introduce the use of (nearly) unambiguous features: • Surface features • Paraphrases • Combined with ngrams • Use from very, very large corpora • Achieve state-of-the-art results without labeled examples.
Noun Compound Bracketing (a) [ [ liver cell ] antibody ] (left bracketing) (b) [ liver [cell line] ] (right bracketing) In (a), the antibodytargets the liver cell. In (b), the cell lineis derived from the liver.
Dependency Model • right bracketing: [w1[w2w3] ] • w2w3 is a compound (modified by w1) • home health care • w1 and w2 independently modify w3 • adult male rat • left bracketing : [ [w1w2 ]w3] • only 1 modificational choice possible • law enforcement officer w1 w2 w3 w1 w2 w3
Our U + U + U Algorithm • Compute bigram estimates • Compute estimates from surface features • Compute estimates from paraphrases • Combine these scores with a voting algorithm to choose left or right bracketing. • We use the same general approach for two other structural ambiguity problems.
Computing Bigram Statistics • Dependency Model, Frequencies • Compare #(w1,w2) to #(w1,w3) • Dependency model, Probabilities • Pr(left) = Pr(w1w2|w2)Pr(w2w3|w3) • Pr(right) = Pr(w1w3|w3)Pr(w2w3|w3) • So we compare Pr(w1w2|w2) to Pr(w1w3|w3) right w1 w2 w3 left
Using ngrams to estimate probabilities • Using page hits as a proxy for n-gram counts • Pr(w1w2|w2) = #(w1,w2) / #(w2) • #(w2) word frequency; query for “w2” • #(w1,w2) bigram frequency; query for “w1 w2” • smoothed by 0.5 • Use 2 to determine if w1 is associated with w2 (thus indicating left bracketing), and same for w1 with w3
Our U + U + U Algorithm • Compute bigram estimates • Compute estimates from surface features • Compute estimates from paraphrases • Combine these scores with a voting algorithm to choose left or right bracketing.
Web-derived Surface Features • Authors often disambiguate noun compounds using surface markers, e.g.: • amino-acid sequence left • brain stem’s cell left • brain’s stem cell right • The enormous size of the Web makes these frequent enough to be useful.
Web-derived Surface Features:Dash (hyphen) • Left dash • cell-cycle analysis left • Right dash • donor T-cell right • Double dash • T-cell-depletion unusable…
Web-derived Surface Features:Possessive Marker • Attached to the first word • brain’s stem cell right • Attached to the second word • brain stem’s cell left • Combined features • brain’s stem-cell right
Web-derived Surface Features:Capitalization • anycase – lowercase – uppercase • Plasmodium vivax Malaria left • plasmodium vivax Malaria left • lowercase – uppercase–anycase • brain Stem cell right • brain Stem Cell right • Disable this on: • Roman digits • Single-letter words: e.g. vitamin D deficiency
Web-derived Surface Features:Embedded Slash • Left embedded slash • leukemia/lymphoma cell right
Web-derived Surface Features:Parentheses • Single-word • growth factor (beta) left • (brain) stem cell right • Two-word • (growth factor) beta left • brain (stem cell) right
Web-derived Surface Features:Comma, dot, semi-colon • Following the first word • home. health care right • adult, male rat right • Following the second word • health care, provider left • lung cancer: patients left
Web-derived Surface Features:Dash to External Word • External word to the left • mouse-brain stem cell right • External word to the right • tumor necrosis factor-alpha left
Other Web-derived Features:Abbreviation • After the second word • tumor necrosis factor (NF) right • After the third word • tumor necrosis (TN) factor right • We query for, e.g., “tumor necrosis tn factor” • Problems: • Roman digits: IV, VI • States: CA • Short words: me
Other Web-derived Features:Concatenation • Consider health care reform • healthcare : 79,500,000 • carereform : 269 • healthreform: 812 • Adjacency model • healthcare vs. carereform • Dependency model • healthcare vs. healthreform • Triples • “healthcarereform” vs. “health carereform”
Other Web-derived Features:Reorder • Reorders for “healthcare reform” • “care reform health” right • “reform health care” left
Other Web-derived Features:Internal Inflection Variability • Vary inflection of second word • tyrosine kinase activation • tyrosine kinases activation
Other Web-derived Features:Switch The First Two Words • Predict right, if we can reorder • adult male ratas • male adult rat
Our U + U + U Algorithm • Compute bigram estimates • Compute estimates from surface features • Compute estimates from paraphrases • Combine these scores with a voting algorithm to choose left or right bracketing.
Paraphrases • The semantics of a noun compound is often made overt by a paraphrase (Warren,1978) • Prepositional • stem cells in the brain right • cells from the brainstem right • Verbal • virus causinghuman immunodeficiency left • Copula • office building that is a skyscraper right
Paraphrases • prepositional paraphrases: • We use: ~150 prepositions • verbal paraphrases: • We use: associated with, caused by, contained in, derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for. • copula paraphrases: • We use: is/was and that/which/who • optional elements: • articles: a, an, the • quantifiers: some, every, etc. • pronouns: this, these, etc.
Our U + U + U Algorithm • Compute bigram estimates • Compute estimates from surface features • Compute estimates from paraphrases • Combine these scores with a voting algorithm to choose left or right bracketing.
Evaluation:Datasets • Lauer Set • 244 noun compounds (NCs) • from Grolier’s encyclopedia • inter-annotator agreement: 81.5% • Biomedical Set • 430 NCs • from MEDLINE • inter-annotator agreement: 88% (=.606)
Co-occurrence Statistics • Lauer set • Bio set
Paraphrase and Surface Features Performance • Lauer Set • Biomedical Set