180 likes | 273 Views
Category-Based Pseudowords. Preslav Nakov & Marti Hearst University of California at Berkeley EECS & SIMS Supported by Genentech and ARDA Aquaint. Word sense disambiguation. WSD task: determine the sense of a particular instance of a multi-sense word given its context
E N D
Category-Based Pseudowords Preslav Nakov & Marti Hearst University of California at Berkeley EECS & SIMS Supported by Genentech and ARDA Aquaint HLT/NAACL'03
Word sense disambiguation • WSD task: determine the sense of a particular instance of a multi-sense word given its context • classic ambiguous example: bank • homography • river bank • financial institution • polysemy • financial institution • building HLT/NAACL'03
Evaluation • Ideally: using a sense-tagged corpus • general purpose – e.g. SENSEVAL corpus • specific domain, e.g. biomedical • the National Library of Medicine test collection contains instances of 50 highly frequent ambiguous concepts from the UMLS Metathesaurus. • Moving to a new domain • a sense-tagged corpus may be unavailable • even when available, may be unsuitable • What if we use a different sense distinction: e.g. MeSH instead of the UMLS Metathesaurus? • What if we are also interested in less frequent words, e.g. need to evaluate an all-words system? HLT/NAACL'03
Pseudowords • building a sense-tagged corpus is very expensive, so create an artificial one • pseudoword: composite comprised of two or more words, chosen at random (Gale et al.’92), (Schuetze’92): • e.g. banana and door banana_door • accepted as an upper bound of the true system’s accuracy HLT/NAACL'03
Problems Chosen entirely at random, and thus: • difficult to characterize in terms of the type of ambiguity being modeled • optimistic in their estimations (Gaustad’01) • highly likely to combine semantically distinct words • real ambiguous words have senses similar in meaning and difficult to distinguish HLT/NAACL'03
The solution Use lexical category membership HLT/NAACL'03
MeSH and Medline • we use MeSH (Medical Subject Headings) • example: Eye has the following codes A01.456.505.420 (child of Face) A09.371 (child of Sense Organs) • average number of senses: 2.12 • we cut after the first period to allow generalization (e.g. A01 and A09) • 71.18% - single class, 22.14% - two classes • the ambiguity drops to 1.39 • Medline abstracts - 180,226 • training: 120,150 • testing: 60,076 HLT/NAACL'03
Pseudowords generation (1) Build a list C of the category couples and their frequencies in the training corpus HLT/NAACL'03
Pseudowords generation (2) Generate pseudowords with the following characteristics: • represent a real ambiguity class pair (met in the training corpus) • the number of pseudowords drawn from a particular class pair is proportional to the pair’s frequency • only unambiguous words are used as pseudowords constituents • multi-word concepts are allowed as elements, e.g. general systems theory + glutathione s-tranferase HLT/NAACL'03
Pseudowords generation (3) Pseudowords for the lower bound • in real texts, the more frequent sense for a two-sense distinction occurs around 92% of the time (Sanderson & van Rijsbergen’99) • evenly distributed senses are harder • so we build a balanced list W of pairs: • we calculate the mean corpus word frequency E and then find the words with freq. in [E/2;3E/2] • in the particular experiment: E=45.21, which gave a list of 64,596 pairs HLT/NAACL'03
Pseudowords generation (4) • importance sampling • 1) Select a category pair c1,c2 from C by sampling from a multinomial distribution with parameters proportional to the frequencies of the elements of C. • 2) Sample uniformly to draw two random distinct words w1 and w2 whose classes correspond to the classes selected in step 1). • 3) If the word pair w1,w2 has been sampled already, go to step 1) and try again. • we sampled 1,000 pseudowords (88,758 instances) out of the possible 64,596 HLT/NAACL'03
Sample pseudowords • the more unusual pairs come from less frequent categories HLT/NAACL'03
Classifier • Naïve Bayes classifier • simple, commonly used for WSD, and among the best performing • we used a symmetric context window: • 10, 20, 40 and 300 words on each side • category name as a proxy for the sense • ambiguous MeSH categories as target • UNambiguous MeSH categories as features (we use a class-based model, and not a word-based one) HLT/NAACL'03
Abbreviations • we have no real disambiguated corpus • use abbreviations, as suggested in (Liu et al.,’02) • represent real ambiguous words • but may be due to accident • intermediate position between entirely random pseudowords and real ambiguous words • we generated 98,841 abbreviations (332,020 instances in total) such that: • their expansions are fully and unambiguously mapped to MeSH • they represent exactly two distinct categories used an abbr. extraction tool described in (Schwartz&Hearst’03) HLT/NAACL'03
Sample abbreviations HLT/NAACL'03
Evaluation • Category based • baseline – choose the more frequent class (shown for abbreviations) • pessimistic – evenly distributed constituents • realistic – random constituents (frequency at least 5) • abbreviations • Non-category based • optimistic – completely random (the standard way to generate) HLT/NAACL'03
Conclusions • We introduced category based pseudowords based on distributions from lexical category co-occurrence: • give a more accurate lower bound • allow detailed study (many samples) of a particular sense ambiguity • represent a better motivated word grouping in pseudowords HLT/NAACL'03
Thank you! Your questions? HLT/NAACL'03