230 likes | 314 Views
Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005. Preslav Nakov EECS, Computer Science Division University of California, Berkeley Marti Hearst SIMS University of California, Berkeley. Outline. Introduction Related Work Models and Features.
E N D
Search Engine Statistics Beyond the n-gram:Application to Noun Compound BracketingCoNLL-2005 Preslav Nakov EECS, Computer Science Division University of California, Berkeley Marti Hearst SIMS University of California, Berkeley
Outline • Introduction • Related Work • Models and Features
Introduction • Noun compound bracketing-> Noun compound interpretation • liver cell antibody • [[liver cell] antibody] • liver cell line • [liver [cell line]] • POS equivalent, different syntactic trees
This Paper • A highly accurate unsupervised method for making bracketing decisions for noun compounds (NCs) • Current: using bigram estimates to compute adjacency and dependency scores • Improvement • χ2 measure • a new set of surface features for querying Web search engines • Evaluate on 2 domains, encyclopedia & bioscience
Related Work • NC syntax and semantics • Still active -> J. of Com. Speech and Language – Special Issue on Multiword Expressions • Adjacency model • Probabilistic dependency model, Laucer (1995) • Data sparseness (use categories instead) • 244 NCs from encyclopedia • Inter-annotator agreement 81.5% • Baseline 66.8% -> 77.5% • Adding POS -> state-of-the-art result of 80.7%
2003~2005 • Keller and Lapata (2003) • Use Web Search Engines for obtaining frequencies for unseen bigrams • (2004) apply to six NLP tasks including disambiguation of NCs • Simpler version (use frequency only) - 78.68% • Girju et al. (2005) supervised (decision tree) (5 WordNet semantic features) • 83.1%
Models and Features • Adjacency and dependency model • w1w2w3 -> [w1 [w2w3]] (two reasons) take on right bracketing • w2w3 is a compound (modified by w1) • home health care • Adjacency model checks 1. • w1 and w2 independently modify w3 • adult male rat • (Better) Dependency model checks 2. • Left bracketing -> only 1 choice • [law enforcement] agent
Computing Probabilities • Alternative • Calculations
χ2 measure • B=#(wi)-(A) • C=#(wj)-(A) • D=~N-A-B-C • N=8T =google 8B pages X 1000 words/page (Yang and Pedersen, 1997) χ2 better than MI
蛋包飯 • 蛋 2067593 • 蛋包2217 • 包 10207448 • 包飯3398 • 飯 1672224 • χ2 包飯750.34 > 蛋包67.32
Web-Derived Surface (1/2) • Authors sometimes (consciously or not) disambiguate the words they write by using surface-level markers to suggest the correct meaning. • Dash (hyphen) • left bracketing • cell cycle analysis -> cell-cycle • right bracketing less reliable • donor T-cell • fiber optics-system • t-cell-depletion • Possessive marker • brain’s stem cells, brain stem’s cells, brain’s stem-cells • Internal capitalization • Plasmodium vivax Malaria, brain Stem cells • disable this feature on Roman digits and single-letter words • vitamin D deficiency
Web-Derived Surface (2/2) • Embedded slashes • leukemia/lymphoma cell • growth factor (beta) or (growth factor) beta • (brain) stem cells • a comma, a dot or a colon • “health care, provider” or “lung cancer: patients” (weak indicator) • mouse-brain stem cells(weak indicator) • Unfortunately, Web SE ignore punctuation characters - hyphens, brackets, apostrophes, etc. • collect them indirectly – post-processing the resulting summaries (up to 1000 results) • Above features are clearly more reliable than others, we do not try to weight them • Features verifying • Counts returned by SE, page hits as a proxy for n-gram frequencies • from 1000 summaries
Other Web-Derived Features • Abbreviations • tumor necrosis factor (NF) • tumor necrosis (TN) factor • Concatenation • health care reform -> healthcare, carereform • Wildcard (*) • “health care * reform” <-> “health * care reform” • Reorder • reform health care <-> care reform health • myosin heavy chain, heavy chain myosin • Internal inflection variability • tyrosine kinase activation, tyrosine kinasesactivation • Switching • “adult male rat”, we would also expect “male adult rat”.
Paraphrases • Warren (1978) proposes • stem cells in the brain • cells from the brain stem • Copula paraphrase • office building that/which is a skyscraper • pain associated with arthritis migraine • search engines lack linguistic annotations • small set of hand-chosen paraphrases • associated with, caused by, contained in, derived from, focusing on, found in, involved in, located at/in, made of, performed by, preventing, related to and used by/in/for
Evaluations • Lauer’s Dataset (1995) • 244 unambiguous 3-noun NC-s • Biomedical Dataset (Nakov et al., 2005, SIG BioLink) • Open NLP tools • sentence splitted, tokenized, POS tagged and shallow parsed a set of 1.4 million MEDLINE abstracts (citations between 1994 and 2003) • 500 NCs, 361 left, 69 right, 70 ambiguous
Experiments • used MSN Search statistics for the n-grams and the paraphrases (unless the pattern contained a “*”) • MSN always returned exact numbers • Google for the surface features • Google and Yahoo rounded their page hits, which generally leads to lower accuracy (Yahoo was better than Google for these estimates)
Tools Mentioned • UMLS Specialist lexicon • 得到生物領域字不同的拼法 • http://www.nlm.nih.gov/pubs/factsheets/umlslex.html • Carroll’s morphological tools • http://www.cogs.susx.ac.uk/lab/nlp/carroll/morph.html
UMLS Lexicon • {base=AAAentry=E0000049 cat=noun variants=metareg variants=uncount acronym_of=abdominal aortic aneurysmectomy|E0429482 acronym_of=acne-associated arthritis|E0429483 acronym_of=acquired aplastic anemia|E0429484 acronym_of=acute anxiety attack|E0429485 acronym_of=androgenic anabolic agent|E0429486 acronym_of=aneurysm of ascending aorta acronym_of=aromatic amino acid|E0356310 acronym_of=acute apical abscess|E0356309 abbreviation_of=abdominal aortic aneurysm|E0006446} • {base=AAMDspelling_variant=A.A.M.D.entry=E0000050 cat=noun variants=groupuncount acronym_of=American Association on Mental Deficiency|E0000277}
Conclusions and Future Work • Improved upon the state-of-the-art approaches to NC bracketing • Future include • test on > 3 words • recognize the ambiguous case • Include determiners and modifiers • on other NLP problems • refine the parser output • Parser typically assume right bracketing