Latent Semantic Analysis

Latent Semantic Analysis Keith Trnka

Latent Semantic Indexing • Application of Latent Semantic Analysis (LSA) to Information Retrieval • Motivations • Unreliable evidence • synonomy: Many words refer to same objectAffects recall • polysemy: Many words have multiple meaningsAffects precision

Source: Deerwater et al. 1990 LSI Motivation (example)

LSA Solution • Terms are overly noisy • analogous to overfitting in Term-by-Document matrix • Terms and documents should be represented by vectors in a “latent” semantic space • LSI essentially infers knowledge from co-occurrence of terms • Assume “errors” (sparse data, non-co-occurrences) are normal and account for them

LSA Methods • Start with a Term-by-Document matrix (A, like fig. 15.5) • Optionally weight cells • Apply Singular Value Decomposition: • t = # of terms • d = # of documents • n = min(t, d) • Approximate using k (semantic) dimensions:

LSA Methods (cont’d) • So that the Euclidean distance is minimized (hence, a least squares method) • Each row of T is a measure of similarity for a term to a semantic dimension • Likewise for D

LSA Application • Querying for Information Retrieval: query is a psuedo-document: • weighted sum over all terms in query of rows of T • compare similarity to all documents in D using cosine similarity measure • Document similarity: vector comparison of D • Term similarity: vector comparison of T

performance k LSA Application (cont’d) • Choosing k is difficult • commonly k = 100, 150, 300 or so • overfitting (superfluous dimensions) vs. underfitting (not enough dimensions) • What are the k semantic dimensions? • undefined

Source: Dumais 1997 LSI Performance

Considerations of LSA • Conceptually high recall: query and document terms may be disjoint • Polysemes not handled well • LSI: Unsupervised/completely automatic • Language independent • CL-LSI: Cross-Language LSI • weakly trained • Computational complexity is high • optimization: random sampling methods • Formal Linear Algebra foundations • Models language acquisition in children

Fun Applications of LSA • Synonyms on TOEFL • Train on a corpus of newspapers, Grolier’s encyclopedia, children’s reading • Use Term similarity and random guessing for unknown terms • Best results at k = 400 • The model got 51.5 correct, or 64.4% (52.5% corrected for guessing). By comparison a large sample of applicants to U. S. colleges from non-English speaking countries who took tests containing these items averaged 51.6 items correct, or 64.5% (52.7% corrected for guessing).

Subject-Matter Knowledge • Introduction to Psychology multiple choice tests • from New Mexico State University and the University of Colorado, Boulder • LSA scored significantly less than average, although received a passing grade • LSA has trouble with knowledge not in corpus

Text Coherence/Comprehension • Kintcsh and colleagues • Convert a text to a logic and test for coherence • Slow, by hand • LSA • Compute cosine distance between sentences or paragraphs (windows of discourse) and the next one • Predicted comprehension with r = .93

Essay Grading

Summary • Latent Semantic Analysis has a wide variety of useful applications • Useful in cognitive psychology research • Useful as a generic IR technique • Possibly useful for lazy TAs?

References (Papers, etc.) • http://www-nlp.stanford.edu/fsnlp/ir/fsnlp-slides-ir.pdf (Manning and Shutze) • http://www.cs.utk.edu/~lsi/ • Using LSI for Information Retrieval, Information Filtering and Other Things (Dumais 1997) • Indexing by Latent Semantic Analysis (Deerwester, et al. 1990) • Automatic Cross-Language IR Using LSI (Dumais et al 1996) • http://lsa.colorado.edu/ (Landauer, et al) • An Introduction to Latent Semantic Analysis (1998) • A Solution to Plato’s Problem… (1997) • How Well Can Passage Meaning be Derived without Using Word Order? … (1997)

References (Books) • NLP Related • Foundations of Statistical NLP Manning and Schutze 2003 • SVD Related • Linear AlgebraStrang 1998 • Matrix Computations (3rd ed.)Golub and Van Loan 1996

LSA Example/The End

Latent Semantic Analysis