360 likes | 404 Views
Word Sense Disambiguation and Information Retrieval. By Guitao Gao Qing Ma Prof: Jian-Yun Nie. Outline. Introduction WSD Approches Conclusion. Introduction. Task of Information Retrieval Content Repesentation Indexing Bag of words indexing Problems: Synonymy : query expansion
E N D
Word Sense Disambiguation and Information Retrieval By Guitao Gao Qing Ma Prof: Jian-Yun Nie
Outline • Introduction • WSD Approches • Conclusion
Introduction • Task of Information Retrieval • Content Repesentation • Indexing • Bag of words indexing • Problems: • Synonymy: query expansion • Polysemy: Word Sense Disambiguation
WSDApproaches • Disambiguation based on manually created rules • Disambiguation using machine readable dictionaries • Disambiguation using thesauri • Disambiguation based on unsupervised machine learning with corpora
Disambiguation based on manually created rules • Weiss’ approach [Lesk 1988] : • set of rules to disambiguate five words • context rule: within 5 words • template rule: specific location • accuracy : 90% • IR improvement: 1% • Small & Rieger’s approach [Small 1982] : • Expert system
Disambiguation using machine readable dictionaries • Lesk’s approach [Lesk 1988] : • Senses are represented by different definitions • Looked up context words definitions • Find co-occurring words • Select most similar sense • Accuracy: 50% - 70%. • Problem: no enough overlapping words between definitions
Disambiguation using machine readable dictionaries • Wilks’ approach [Wilks 1990] : • Attempt to solve Lesk’s problem • Expanding dictionary definition • Use Longman Dictionary of Contemporary English ( LDOCE ) • more word co-occurring evidence collected • Accuracy: between 53% and 85%.
Wilks’ approach [Wilks 1990] Commonly co-occurring words in LDOCE. [Wilks 1990]
Disambiguation using machine readable dictionaries • Luk’s approach [Luk 1995]: • Statistical sense disambiguation • Use definitions from LDOCE • co-occurrence data collected from Brown corpus • defining concepts : 1792 words used to write definitions of LDOCE • LDOCE pre-processed :conceptual expansion
Entry in LDOCE Conceptual expansion 1. (an order given by a judge which fixes) a punishment for a criminal found guilty in court found guilty in court { {order, judge, punish, crime, criminal,find, guilt, court}, 2. a group of words that forms a statement, command, exclamation, or question, usu. contains a subject and a verb, and (in writing) begins with a capital letter and ends with one of the marks. ! ? {group, word, form, statement, command, question, contain, subject, verb, write, begin, capital, letter, end, mark} } Luk’s approach [Luk 1995]: Noun “sentence” and its conceptual expansion [Luk 1995]
Luk’s approach [Luk 1995] cont. • Collect co-occurrence data of defining concepts by constructing a two-dimensional Concept Co-occurrence Data Table (CCDT) • Brown corpus divided into sentences • collect conceptual co-occurrence data for each defining concept which occurs in the sentence • Insert collect data in the Concept Co-occurrence Data Table.
Luk’s approach [Luk 1995] cont. • Score each sense S with respect to context C [Luk 1995]
Luk’s approach [Luk 1995] cont. • Select sense with the highest score • Accuracy: 77% • Human accuracy: 71%
Approaches using Roget's Thesaurus [Yarowsky 1992] • Resources used: • Roget's Thesaurus • Grolier Multimedia Encyclopedia • Senses of a word: categories in Roget's Thesaurus • 1042 broad categories covering areas like, tools/machinery or animals/insects
Approaches using Roget's Thesaurus [Yarowsky 1992] cont. tool, implement, appliance, contraption, apparatus, utensil, device,gadget, craft, machine, engine, motor, dynamo, generator, mill, lathe,equipment, gear, tackle, tackling, rigging, harness, trappings, fittings,accoutrements, paraphernalia, equipage, outfit, appointments,furniture, material, plant, appurtenances, a wheel, jack, clockwork,wheel-work, spring, screw, Some words placed into the tools/machinery category[Yarowsky 1992]
Approaches using Roget's Thesaurus [Yarowsky 1992] cont. • Collect context for each category: • From Grolier Encyclopedia • each occurrence of each member of the category • extracts 100 surrounding words Sample occurrence of words in the tools/machinery category [Yarowsky 1992]
Approaches using Roget's Thesaurus [Yarowsky 1992] cont. • Identify and weight salient words: Sample salient words for Roget categories 348 and 414 [Yarowsky 1992] • To disambiguate a word: sums up the weights of all salient words appearing in context • Accuracy: 92% disambiguating 12 words
Introduction to WordNet(1) • Online thesaurus system • Synsets: Synonymous Words • Hierachical Relationship
Introduction to WordNet(2) [Sanderson 2000]
Voorhees’ Disambg. Experiment • Calculation of Semantic Distance: Synset and Context words • Word’s Sense: Synset closest to Context Words • Retrieval Result: Worse than non-Disambig.
Gonzalo’s IR experiment(1) Two Questions • Can WordNet really offer any potential for text retrieval • How is text Retrieval performance affected by the disambiguation errors?
Gonzalo’s IR experiment(2) • Text Collection: Summary and Document Experiments • 1. Standard Smart Run • 2. Indexed In Terms of Word-Sense • 3. Indexed In Terms of Synset • 4. Introduction of Disambiguation Error
Gonzalo’s IR experiment(3) Experiements %correct document retrieved Indexed by synsets 62.0 Indexing by word senses 53.2 Indexing by words 48.0 Indexing by synsets(5% error) 62.0 Id. with 10% errors 60.8 Id. with 20% errors 56.1 Id. with 30% errors 54.4 Id. with all possible 52.6 Id. with 60% errors 49.1
Gonzalo’s IR experiment(4) • Disambiguation with WordNet can improve text retrieval • Solution lies in reliable Automatic WSD technique
Disambiguation With Unsupervised Learning Yarowsky’s Unsupervised Method • One Sense Per Collocation eg: Plant(manufacturing/life) • One Sense Per Discourse eg: defense(War/Sports)
Yarowsky’s Unsupervised Method cont. Algorithm Details • Step1:Store Word and its contexts as line eg:….zonal distribution of plant life….. • Step2: Identify a few words that represent the word Sense eg. plant(manufacturing/life) • Step3a: Get rules from the training set plant + X => A, weight plant + Y => B, weight • Step3b:Use the rules created in 3a to classify all occurrences of plant sample set.
Yarowsky’s Unsupervised Method cont. • Step3c: Use one-sense-per-discourse rule to filter or augment this addition • Step3d: Repeat Step 3 a-b-c iteratively. • Step4: the training converges on a stable residual set. • Step 5: the result will be a set of rules. Those rules will be used to disambiguate the word “plant”. eg. plant + growth => life plant + car => manufacturing
Yarowsky’s Unsupervised Method cont. Advantages of this method: • Better accuracy compared to other unsupervised method • No need for costly hand-tagged training sets(supervised method)
Schütze and Pedersen’s approach[Schütze 1995] • Source of word sense definitions • Not using a dictionary or thesaurus • Only using only the corpus to be disambiguated(Category B TREC-1 collection ) • Thesaurus construction • Collect a (symmetric ) term-term matrix C • Entry cij : number of times that words i and j co-occur in a symmetric window of total size k • Use SVD to reduce the dimensionality
Schütze and Pedersen’s approach[Schütze 1995] cont. • Thesaurus vector: columns • Semantic similarity: cosine between columns • Thesaurus: associate each word with its nearest neighbors • Context vector: summing thesaurus vectors of context words
Schütze and Pedersen’s approach[Schütze 1995] cont. • Disambiguation algorithm • Identify context vectors corresponding to all occurrences of a particular word • Partition them into regions of high density • Tag a sense for each such region • Disambiguating a word: • Compute context vector of its occurrence • Find the closest centroid of a region • Assign the occurrence the sense of that centroid
Schütze and Pedersen’s approach[Schütze 1995] cont. • Accuracy: 90% • Application to IR • replacing the words by word senses • sense based retrieval’s average precision for 11 points of recall increased 4% with respect to word based. • Combine the ranking for each document: • average precision increased: 11% • Each occurrence is assigned n(2,3,4,5) senses; • average precision increased: 14% for n=3
Conclusion • How much can WSD help improve IR effectiveness? Open question • Weiss: 1%, Voorhees’ method : negative • Krovetz and Croft, Sanderson : only useful for short queries • Schütze and Pedersen’s approaches and Gonzalo’s experiment : positive result • WSD must be accurate to be useful for IR • Schütze and Pedersen’s, Yarowsky’s algorithm: promising for IR • Luk’s approach : robust for data sparse, suitable for small corpus.
References [Krovetz 92] R. Krovetz & W.B. Croft (1992). Lexical Ambiguity and Information Retrieval, in ACM Transactions onInformation Systems, 10(1). Gonzalo 1998] J. Gonzalo, F. Verdejo, I. Chugur and J. Cigarran, “Indexing with WordNet synsets can improve Text Retrieval”, Proceedings of the COLING/ACL ’98 Workshop on Usage of WordNet for NLP, Montreal,1998 [Gonzalo 1992] R. Krovetz & W.B. Croft . “Lexical Ambiguity and Information Retrieval”, in ACM Transactions on Information Systems, 10(1), 1992 [Lesk 1988] M. Lesk , “They said true things, but called them by wrong names”– vocabulary problems in retrieval systems, in Proc. 4th Annual Conference of the University of Waterloo Centre for the New OED, 1988 [Luk 1995] A.K. Luk. “Statistical sense disambiguation with relatively small corpora using dictionary definitions”. In Proceedings of the 33rd Annual Meeting of the ACL, Columbus, Ohio, June 1995. Association for Computational Linguistics. [Salton 83] G. Salton & M.J. McGill (1983). Introduction To Modern Information Retrieval. TheSMART and SIRE experimental retrieval systems, in New York: McGraw-Hill [Sanderson 1997] Sanderson, M. Word Sense Disambiguation and Information Retrieval, PhD Thesis, Technical Report (TR-1997-7) of the Department of Computing Science at the University of Glasgow, Glasgow G12 8QQ, UK. [Sanderson 2000] Sanderson, Mark, “Retrieving with Good Sense”, http://citeseer.nj.nec.com/sanderson00retrieving.html, 2000
References cont. [Schütze 1995] H. Schütze & J.O. Pedersen. “Information retrieval based on word senses”, in Proceedings of the Symposium on Document Analysis and Information Retrieval, 4: 161-175. [Small 1982] S. Small & C. Rieger , “Parsing and comprehending with word experts (a theoryand its realisation)” in Strategies for Natural Language Processing, W.G. Lehnert & M.H. Ringle, Eds., LEA: 89-148, 1982 [Voorhees 1993] E. M. Voorhees, “Using WordNet™ to disambiguate word sense for text retrieval, in Proceedings of ACM SIGIR Conference”, (16): 171-180. 1993 [Weiss 73] S.F. Weiss (1973). Learning to disambiguate, in Information Storage and Retrieval, 9:33-41, 1973 [Wilks 1990] Y. Wilks, D. Fass, C. Guo, J.E. Mcdonald, T. Plate, B.M. Slator (1990). ProvidingMachine Tractable Dictionary Tools, in Machine Translation, 5: 99-154, 1990 [Yarowsky 1992] D. Yarowsky, `“Word sense disambiguation using statistical models of Roget’s categories trained on large corpora, in Proceedings of COLING Conference”: 454-460, 1992 [Yarowsky 1994] Yarowsky, D. “Decision lists for lexical ambiguity resolution:Application to Accent Restoration in Spanish and French.” In Proceedings of the 32rd Annual Meeting of the Association for Computational Linguistics, Las Cruces, NM, 1994 [Yarowsky 1995] Yarowsky, D. “Unsupervised word sense disambiguation rivaling supervised methods.” In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 189-- 196, Cambridge, MA, 1995