Latent Semantic Indexing

Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University http://www.public.asu.edu/~jye02

Singular Value Decomposition orthogonal diagonal orthogonal

Some Properties of SVD

Some Properties of SVD • That is, Ak is the optimal approximation in terms of the approximation error measured by the Frobenius norm, among all matrices of rank k • Forms the basics of LSI (Latent Semantic Indexing) in informational retrieval

Low rank approximation by SVD

Applications of SVD • Pseudoinverse • Range, null space and rank • Matrix approximation • Other examples http://en.wikipedia.org/wiki/Singular_value_decomposition

LSI (Latent Semantic Indexing) • Introduction • Latent Semantic Indexing • LSI • Query • Updating • An example

Problem Introduction • Traditional term-matching method doesn’t work well in information retrieval • We want to capture the concepts instead of words. Concepts are reflected in the words. However, • One term may have multiple meaning • Different terms may have the same meaning.

LSI (Latent Semantic Indexing) • LSI approach tries to overcome the deficiencies of term-matching retrieval by treating the unreliability of observed term-document association data as a statistical problem. • The goal is to find effective models to represent the relationship between terms and documents. Hence a set of terms, which is by itself incomplete and unreliable, will be replaced by some set of entities which are more reliable indicants.

LSI, the Method • Document-Term M • Decompose M by SVD. • Approximating M using truncated SVD

LSI, the Method (cont.) Each row and column of A gets mapped into the k-dimensional LSI space, by the SVD.

Query • A query q is also mapped into this space, by • Compare the similarity in the new space • Intuition: Dimension reduction through LSI brings together “related” axes in the vector space.

Example

Example (cont.)

Example (cont. Mapping)

Example (cont. Query) Query: Application and Theory

Example (cont. Query)

How to set the value of k? • LSI is useful only if k << n. • If k is too large, it doesn't capture the underlying latent semantic space; if k is too small, too much is lost. • No principled way of determining the best k.

How well does LSI work? • Effectiveness of LSI compared to regular term-matching depends on nature of documents. • Typical improvement: 0 to 30% better precision. • Advantage greater for texts in which synonymy and ambiguity are more prevalent. • Best when recall is high. • Costs of LSI might outweigh improvement. • SVD is computationally expensive; limited use for really large document collections • Inverted index not possible

References • Mini tutorial on the Singular Value Decomposition • http://www.cs.brown.edu/research/ai/dynamics/tutorial/Postscript/SingularValueDecomposition.ps • Basics of linear algebra • http://www.stanford.edu/class/cs229/section/section_lin_algebra.pdf

Latent Semantic Indexing