230 likes | 362 Views
An Experiment in Using Lexical Disambiguation to Enhance Information Access. Robert Wilensky, Isaac Cheng, Timotius Tjahjadi, and Heyning Cheng. Goal. Enhance information access by fully automated text categorization by adding searching by word sense Applied to the World Wide Web.
E N D
An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi, and Heyning Cheng
Goal • Enhance information access by • fully automated text categorization • by adding searching by word sense • Applied to the World Wide Web
Manual vs. Automatically Created Directories • Manual classification of documents is • Expensive • Not scalable • Hard to keep up with the rapid growth and changes of information sources such as the Web • Would like fully automatic classification • no training set • no rules • appeal instead to “intrinsic semantics”
Lexical Disambiguation • Problem: Determine the intended sense of ambiguous word • Approach: Based on Yarowsky, et al. • Thesaurus categories as proxies for senses • We used Roget’s 5th • Training: Count nearby word-category co-occurrence • Deployment: Add up the word-category evidence
Counting Co-occurrences of Terms with Categories …while storks and cranes make their nests in the bank… Result is category co-occurrence vector for each term. [Tools, Animals]
Automatic Topic Assignment Based on Word Sense • Hearst • Topic word-category association vectors • Fisher and Wilensky • Contrasted different algorithms • Concluded that exploiting word senses may improve topic assignment • We use prior prob. dist. of word senses, (and more recently, disambiguation per se.)
IAGO 0.1 vs. 1.0 • IAGO 0.1: • Eliminated short (< 100 content words) pages • Trained on newswire text • IAGO 1.0: • Trained on Encarta encyclopedia • Estimated word sense priors on the Web (used 10 million words of random web documents) • ignored proper nouns • augmented stop-list to deal with various problems • Tested categorization by mapping Yahoo categories to ours • Tested disambiguation on newswire, then sampled Web.
Classification Results Then: (version 0.1) Now: (version 1.0) Category Name Precision Recall ------------- --------- ------ ComputerScience 31.6% 17.1% FinanceInvestment 94.4% 22.0% FitnessExercise 100.0% 4.3% MotionPictures 100.0% 57.1% Music 97.5% 58.3% Nutrition 80.3% 35.6% Occupation 100.0% 13.1% TheEnvironment n/a 0.0% Travel 50.0% 5.7% Overall precision = 88% Overall recall = 23% Category Name Precision Recall ------------- --------- ------ ComputerScience 87.5% 19.4% FinanceInvestment 100.0% 13.4% FitnessExercise 100.0% 1.8% MotionPictures 100.0% 54.8% Music 98.2% 42.4% Nutrition 97.9% 29.9% Occupation 97.8% 30.3% TheEnvironment n/a 0.0% Travel 75.0% 15.4% Overall precision = 97% Overall recall = 21% (92.3%and20.4% if no adjustment by hand)
IAGO! 1.0 Internet Directory • Used engine to classify a few tens of thousands of web documents into Roget’s categories.
Application to Text Searching • Present user with set of known word senses from which to select • e.g., keyword = “rock” • =stone • =kind of music • Retrieve by word, filter by word sense • Rank by number of matching word senses
Is it Useful? • Results in the literature generally suggest disambiguation not useful for long queries, and utility is highly sensitive to disambiguation accuracy. • However, 40% of search queries on the web are reported to be single words. • So, does disambiguation work well enough to aid with single word queries?
Usefulness • Let r be the frequency of the most common of (non-overlapping) senses. • Can show that, to be better than just using keyword retrieval, disambiguation accuracy needs to be at least 50%, increasing in accuracy as r increases, but need not be highly accurate. (In fact, it can perform below the baseline.) • IAGO! 1.0 performs well above this level.
Usefulness • Key word retrieval will produce word sense retrieval precision and recall of r and 1 for common sense, (1-r) , 1 for less common • A disambiguation method that was correct p of the time would have precision and recall values of and p for a word sense with frequency r. • Using E as the metric, can show that p needs to be at least for a disambiguation method to outperform keyword retrieval • For small r, p must be greater than 50%. For large r, this compares favorably with keyword retrieval even with fairly low disambiguation accuracy. • E.g., with a 90/10 distribution of word senses, then, for the more common word sense case, E, with a beta of .5, is better for a disambiguation algorithm with an accuracy over 77% than for keyword retrieval. (For the less common word sense, a “disambiguation” algorithm that is completely random gives a superior result.)
More results • Latest implementation (by Heyning Cheng) reduces training to about 1 hour (from about 24); classifying 1000 documents takes about 10 minutes. • Also improved performance of disambiguation. This made it practical to use disambiguation in topic assignment: • I.e, produces slightly better results; also appears to be less sensitive to changes in stoplist, and can be made to run quickly. • Disambiguation with a substantially smaller window size (even as small as 5) did not reduce accuracy; in some cases, a half-window size of 10 out-performed one of 50.
More results (con’t) • Weighted word sense priors by IDF of the term
More Results • Excluding low-utility or confusing Roget’s categories (down to about 200) improved recall to about 40% on the 1000 document test set. • The “purity” of topic assignment (% of all word senses disambiguated to the assigned topic) seems correlated with accuracy at least as well as IAGO’s ranking algorithm.
Future Work • Get better word sense proxies! • Word-sense searching • Create word sense index • Support word-sense searching within more general searches. • Improve disambiguation by exploiting priors. • Test against synonym expansion methods • Automatic topic-categorization • Handle multi-word phrases; proper names
Future Plans: Longer Term • Disambiguation • Handle non-nouns • Better word sense source • Automatic grouping of thesaural word senses • Topic-categorization • Multiple topic assignment • Quality • Summarization via same techniques • Other linguistic choices, e.g., thematic roles