1 / 27

Latent Semantic Analysis for Document Indexing

Learn about Latent Semantic Analysis (LSA) for indexing documents efficiently and overcoming synonymy and polysemy issues. Discover the principles, history, and applications of LSA.

hanchett
Download Presentation

Latent Semantic Analysis for Document Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing by Latent Semantic AnalysisScot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil

  2. Outline • The Problem • Some History • LSA • A Small Example • Efficiency • Other applications • Summary

  3. The Problem • Given a collection of documents: retrieve documents that are relevant to a given query • Match terms in documents to terms in query

  4. The Problem • The vector space method • term (rows) by document (columns) matrix, based on occurrence • translate into vectors in a vector space • one vector for each document • cosine to measure distance between vectors (documents) • small angle = large cosine = similar • large angle = small cosine = dissimilar

  5. The Problem • Two problems that arose using the vector space model: • synonymy: many ways to refer to the same object, e.g. car and automobile • leads to poor recall • polysemy: most words have more than one distinct meaning, e.g. Jaguar • leads to poor precision

  6. The Goal • Latent Semantic Indexing was proposed to address these two problems with the vector space model

  7. Some History • Latent Semantic Indexing was developed at Bellcore (now Telcordia) in the late 1980s (1988). It was patented in 1989. • http://lsi.argreenhouse.com/lsi/LSI.html • The first papers about LSI: • Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S. (1988), "Using latent semantic analysis to improve information retrieval." In Proceedings of CHI'88: Conference on Human Factors in Computing, New York: ACM, 281-285. • Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R.A. (1990) "Indexing by latent semantic analysis." Journal of the Society for Information Science, 41(6), 391-407.

  8. LSA: The idea • Idea (Deerwester et al): • “We would like a representation in which a set of terms, which by itself is incomplete and unreliable evidence of the relevance of a given document, is replaced by some other set of entities which are more reliable indicants. We take advantage of the implicit higher-order (or latent) structure in the association of terms and documents to reveal such relationships.” • The assumption is that co-occurrence says something about semantics: words about the same things are likely to occur in the same contexts • If we have many words and contexts, small differences in co-occurrence probabilities can be compiled together to give information about semantics.

  9. LSA: Overview • Build a matrix with rows representing words and columns representing context (a document or word string) • Apply SVD • unique mathematical decomposition of a matrix into the product of three matrices: • two with orthonormal columns-- (orthonormal)? • one with singular values on the diagonal • tool for dimension reduction • similarity measure based on co-occurrence • finds optimal projection into low-dimensional space

  10. LSA Methods • Start with a Term-by-Document matrix • Optionally weight cells • Apply Singular Value Decomposition: • t = # of terms • d = # of documents • n = min(t, d) • Approximate using k (semantic) dimensions:

  11. LSA: • SVD • can be viewed as a method for rotating the axes in n-dimensional space, so that the first axis runs along the direction of the largest variation among the documents • the second dimension runs along the direction with the second largest variation • and so on • generalized least-squares method

  12. LSA • Rank-reduced Singular Value Decomposition (SVD) performed on matrix • all but the k highest singular values are set to 0 • produces k-dimensional approximation of the original matrix • this is the “semantic space” • Compute similarities between entities in semantic space (usually with cosine)

  13. A Small Example Technical Memo Titles c1: Human machine interface for ABC computer applications c2: A survey of user opinion of computersystemresponsetime c3: The EPSuserinterface management system c4: System and humansystem engineering testing of EPS c5: Relation of user perceived responsetime to error measurement m1: The generation of random, binary, ordered trees m2: The intersection graph of paths in trees m3: Graphminors IV: Widths of trees and well-quasi-ordering m4: Graphminors: A survey

  14. A Small Example – 2 r (human.user) = -.38 r (human.minors) = -.29

  15. A Small Example – 3 • T =

  16. A Small Example – 4 • S =

  17. A Small Example – 5 • D =

  18. A Small Example – 7

  19. r (human.user) = .94 r (human.minors) = -.83 A Small Example – 7

  20. A Small Example – 2 again r (human.user) = -.38 r (human.minors) = -.29

  21. Correlation: Raw data 0.92 -0.72 1.00

  22. Some Issues with LSI • SVD Algorithm complexity O(n^2k^3) • n = number of terms + documents • k = number of dimensions in semantic space (typically small ~50 to 350) • Although lot of empirical evidence no concrete proof of why LSI works

  23. performance k Semantic Dimension • Finding optimal dimension for semantic space • precision-recall improve as dimension is increased until hits optimal, then slowly decreases until it hits standard vector model • in many tasks 150-350 works well, still room for research • choosing k is difficult • overfitting (superfluous dimensions) vs. underfitting (not enough dimensions)

  24. Other Applications • Has proved to be a valuable tool in many areas as well as IR • summarization • cross-language IR • topics segmentation • text classification • question answering • LSA can pass the TOEFL

  25. LSA can Pass the TOEFL • Task: • Multiple-choice test for synonym • Given one word, find best match out of 4 alternatives • Training: • Corpus of 30,473 articles from Grolier’s Academic • Used first ~150 words from each article => 60,768 unique • words that occur at least twice • 300 singular vectors • Result • LSI gets 52.5% correct (corrected for guessing) • Non-LSI similarity gets 15.8% (other paper 29.5%) correct • Average (foreign) human test taker gets 52.7% • Landauer, T. K. and Dumais, S. T. (1997) A solution to Plato's problem: the Latent Semantic Analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104(2) 211-240.

  26. LSA can mark essays • LSA judgments of the quality of sentences correlate at r = 0.81 with expert ratings • LSA can judge how good an essay (on a well-defined set topic) is by computing the average distance between the essay to be marked and a set of model essays • The correlation are equal to between-human correlations • “If you wrote a good essay and scrambled the words you would get a good grade," Landauer said. "But try to get the good words without writing a good essay!”

  27. Good References • The group at the University of Colorado at Boulder has a web site where you can try out LSA and download papers • http://lsa.colorado.edu/ • Papers are also available at: • http://lsi.research.telcordia.com/lsi/LSI.html

More Related