230 likes | 356 Views
Adding Semantics to Information Retrieval. By Kedar Bellare. 20 th April 2003. Motivation. Current IR techniques term-based Semantics of document and query not considered Problems like polysemy and synonymy Lot of advances in NLP and Statistical Modeling of Semantics
E N D
Adding Semantics to Information Retrieval By KedarBellare 20th April 2003
Motivation • Current IR techniques term-based • Semantics of document and query not considered • Problems like polysemy and synonymy • Lot of advances in NLP and Statistical Modeling of Semantics • Is Semantic IR really required?
Organization • Traditional IR • Statistics for Semantics – Latent Semantic Indexing • Semantic Resources for Semantics – Use ofSemantic Nets, Conceptual Graphs, WordNet etc. in IR. • Conclusion
Information Retrieval An information retrieval system does not inform the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request.
Current IR • Preprocessing of Documents • Inverted Index • Removing stopwords and Stemming • Representation of Documents • Vector Space Model – TF and IDF • Document Clustering • Improvements to the above • Better weighting of Document Vectors • Link analysis – PageRank and Anchor Text
Latent Semantic Indexing • Problems with Traditional Approaches • Synonymy – Automobile and Car • Polysemy – Jaguar means both a Car and Animal • LSI – Linear Algebra for capturing “Latent Semantics” of documents • Method of dimensionality reduction
LSI • Compares document vectors in Latent Semantic Space • Two documents can have high similarity value even if no terms shared • Attempts to remove minor differences in terminology during indexing • Truncated SVD – used for construction of Latent Semantic Space
Singular Value Decomposition • Given a term-document matrix At x d converts it into product of three matrices Tt x r, Sr x r and Dd x r such that A = T S DT • T and D are orthogonal, S is diagonal and r is rank of A • Reduced space corresponds to axes of greatest variation
What LSI does? • Uses truncated SVD • Instead of r – dimensional space uses a factor k Āt x d = Tt x k Sk x k DTd x k • Truncated SVD – captures underlying structure in association of terms and documents
Using the SVD model • Comparison of terms – entries of the matrix T S2T T • Comparison of documents – entries of the matrix D S2DT • Comparison of term and document – entries of the matrix TSDT • Query in SVD model – q’ = qT T S-1
Why LSI works? • Although lot of empirical evidence no concrete proof of why LSI works • No major degradation – Theorem of Eckart and Young • States that the distance of two matrices is minimum • Still does not explain improvements in recall and precision
Why LSI works? (contd.) • Papadimitriou et. al. • Assumes documents generated from set of topics with disjoint vocabularies • If term-document matrix A is perturbed, they prove that LSI recovers topic information and removes the noise • Kontostathis et. al. • Essentially claims that LSI’s ability to trace term co-occurrences is what helps in improved recall
Advantages & Disadvantages • Advantages • Synonymy • Term Dependence • Disadvantages • Storage • Efficiency
Semantic Resources • Semantic Nets - E.g. John gave Mary the book • Applied in UNL – Eg. Only a few farmers could use information technology in early 1990s
Semantic Resources (contd.) • Conceptual Graphs – E.g.A bird is singing in a Sycamore tree • Conceptual Dependency – E.g. I gave the man a book • Lexical Resources – WordNet
Applications of Semantic Resources in IR • UNL • Used in improving document vectors • Conceptual Graphs • Graph matching of query and document • CDs • FERRET – Comparison of CD patterns • WordNet • Query Expansion using WordNet
Conclusion • Various things need to be considered before applying to Web • Storage • Efficiency • Knowledge Content of Query • Clearly, semantic method needed for eliminating synonymy and polysemy • Currently, traditional models with minor hacks serve the purpose • However, in conclusion : Statistical or Conceptual or combination of both to model Document Semantics is definitely required
References [1] M. W. Berry, S. T. Dumais, and G. W. O’Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4), pages 573–595, 1995. [2] S. Chakrabarti. Mining the Web - Discovering Knowledge from Hypertext Data. Morgan Kau.mann Publishers, San Francisco, 2002. [3] S. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by Latent Semantic Analysis. Journal of the Society for Information Science 41 (6), pages 391–407, 1990. [4] A. Kontostathis and W. M. Pottenger. A mathematical view of Latent Semantic Indexing: Tracing Term Co-occurences. Technical report, Lehigh University, 2002. [5] R. Mandala, T. Takenobu, and T. Hozumi. The use of WordNet in Information Retrieval. In COLING/ACL Workshop on the Usage of WordNet in Natural Language Processing Systems, pages 31–37, 1998.
References (contd.) [6] M. L. Mauldin. Retrieval performance in FERRET: a conceptual information retrieval system. In Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, pages 347–355. ACM Press, 1991. [7] G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, and K. J.Miller. Introduction to WordNet: an on-line lexical database. International Journal of Lexicography 3 (4), pages 235 – 244, 1990. [8] M. Montes-y-Gomez, A. Lopez, and A. F. Gelbukh. Information retrieval with Conceptual Graph matching. In Database and Expert Systems Applications, pages 312–321, 2000. [9] C. H. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala. Latent Semantic Indexing: A probabilistic analysis. pages 159–168, 1998. [10] E. Rich and K. Knight. Artificial Intelligence. Tata McGraw-Hill Publishers, New Delhi, 2002.
References (contd.) [11] G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975. [12] C. Shah, B. Chowdhary, and P. Bhattacharyya. Constructing better Document Vectors using Universal Networking Language (UNL). In Proceedings of International Conference on Knowledge-Based Computer Systems (KBCS) 2002. NCST, Navi Mumbai, India, 1995. [13] H. Uchida, M. Zhu, and S. T. Della. UNL : A gift for a millenium. Technical report, The United Nations University, 2000. [14] C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.