1 / 18

Latent Semantic Analysis

Latent Semantic Analysis. Keith Trnka. Latent Semantic Indexing. Application of Latent Semantic Analysis (LSA) to Information Retrieval Motivations Unreliable evidence synonomy: Many words refer to same object Affects recall polysemy: Many words have multiple meanings Affects precision.

natalie
Download Presentation

Latent Semantic Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Latent Semantic Analysis Keith Trnka

  2. Latent Semantic Indexing • Application of Latent Semantic Analysis (LSA) to Information Retrieval • Motivations • Unreliable evidence • synonomy: Many words refer to same objectAffects recall • polysemy: Many words have multiple meaningsAffects precision

  3. Source: Deerwater et al. 1990 LSI Motivation (example)

  4. LSA Solution • Terms are overly noisy • analogous to overfitting in Term-by-Document matrix • Terms and documents should be represented by vectors in a “latent” semantic space • LSI essentially infers knowledge from co-occurrence of terms • Assume “errors” (sparse data, non-co-occurrences) are normal and account for them

  5. LSA Methods • Start with a Term-by-Document matrix (A, like fig. 15.5) • Optionally weight cells • Apply Singular Value Decomposition: • t = # of terms • d = # of documents • n = min(t, d) • Approximate using k (semantic) dimensions:

  6. LSA Methods (cont’d) • So that the Euclidean distance is minimized (hence, a least squares method) • Each row of T is a measure of similarity for a term to a semantic dimension • Likewise for D

  7. LSA Application • Querying for Information Retrieval: query is a psuedo-document: • weighted sum over all terms in query of rows of T • compare similarity to all documents in D using cosine similarity measure • Document similarity: vector comparison of D • Term similarity: vector comparison of T

  8. performance k LSA Application (cont’d) • Choosing k is difficult • commonly k = 100, 150, 300 or so • overfitting (superfluous dimensions) vs. underfitting (not enough dimensions) • What are the k semantic dimensions? • undefined

  9. Source: Dumais 1997 LSI Performance

  10. Considerations of LSA • Conceptually high recall: query and document terms may be disjoint • Polysemes not handled well • LSI: Unsupervised/completely automatic • Language independent • CL-LSI: Cross-Language LSI • weakly trained • Computational complexity is high • optimization: random sampling methods • Formal Linear Algebra foundations • Models language acquisition in children

  11. Fun Applications of LSA • Synonyms on TOEFL • Train on a corpus of newspapers, Grolier’s encyclopedia, children’s reading • Use Term similarity and random guessing for unknown terms • Best results at k = 400 • The model got 51.5 correct, or 64.4% (52.5% corrected for guessing). By comparison a large sample of applicants to U. S. colleges from non-English speaking countries who took tests containing these items averaged 51.6 items correct, or 64.5% (52.7% corrected for guessing).

  12. Subject-Matter Knowledge • Introduction to Psychology multiple choice tests • from New Mexico State University and the University of Colorado, Boulder • LSA scored significantly less than average, although received a passing grade • LSA has trouble with knowledge not in corpus

  13. Text Coherence/Comprehension • Kintcsh and colleagues • Convert a text to a logic and test for coherence • Slow, by hand • LSA • Compute cosine distance between sentences or paragraphs (windows of discourse) and the next one • Predicted comprehension with r = .93

  14. Essay Grading

  15. Summary • Latent Semantic Analysis has a wide variety of useful applications • Useful in cognitive psychology research • Useful as a generic IR technique • Possibly useful for lazy TAs?

  16. References (Papers, etc.) • http://www-nlp.stanford.edu/fsnlp/ir/fsnlp-slides-ir.pdf (Manning and Shutze) • http://www.cs.utk.edu/~lsi/ • Using LSI for Information Retrieval, Information Filtering and Other Things (Dumais 1997) • Indexing by Latent Semantic Analysis (Deerwester, et al. 1990) • Automatic Cross-Language IR Using LSI (Dumais et al 1996) • http://lsa.colorado.edu/ (Landauer, et al) • An Introduction to Latent Semantic Analysis (1998) • A Solution to Plato’s Problem… (1997) • How Well Can Passage Meaning be Derived without Using Word Order? … (1997)

  17. References (Books) • NLP Related • Foundations of Statistical NLP Manning and Schutze 2003 • SVD Related • Linear AlgebraStrang 1998 • Matrix Computations (3rd ed.)Golub and Van Loan 1996

  18. LSA Example/The End

More Related