Latent Semantic Indexing: A probabilistic Analysis

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala

Motivation • Application in several areas: • querying • clustering, identifying topics • Other: • synonym recognition (TOEFL..) • Psychology test • essay scoring

Motivation • Latent Semantic Indexing is • Latent: Captures associations which are not explicit • Semantic: Represents meaning as a function of similarity to other entities • Cool: Lots of spiffy applications, and the potential for some good theory too

Overview • IR and two classical problems • How LSI works • Why LSI is effective: A probabilistic analysis

Information Retrieval • Text corpus with many documents (docs) • Given a query, find relevant docs • Classical problems: • synonymy: missing docs with reference to “automobile” when querying on “car” • polysemy: retrieving docs on internet when querying on “surfing” • Solution: Represent docs (and queries) by their underlying latent concepts

Information Retrieval • Represent each document as a word vector • Represent corpus as term-document matrix (T-D matrix) • A classical method: • Create new vector from query terms • Find documents with highest dot-product

Document vector space

Latent Semantic Indexing(LSI) • Process term-document (T-D) matrix to expose statistical structure • Convert high-dimensional space to lower-dimensional space, throw out noise, keep the good stuff • Related to principal component analysis (PCA), multiple dimensional scaling (MDS)

Parameters • U = universe of terms • n = number of terms • m = number of docs • A = n x m matrix with rank r • columns represent docs • rows represent terms

Singular Value Decomposition(SVD) • LSI uses SVD, a linear analysis method:

SVD • r is the rank of A • D diagonal matrix of the r singular values • U and V matrices composed of orthonormal columns • SVD is always possible • numerical methods for SVD exist • run time: O(m n c), where c denotes the average number of words per document

T-D Matrix Approximation

Synonymy • LSI used in several ways: e.g. detecting synonymy • A measure of similarity for two terms: • In original space: dot product of rows (terms) and of ( , entry in ) • Better: dot product of rows and of • ( , entry in )

“Semantic” Space

Synonymy (intuition) • Consider the term-term autocorrelation matrix • If two terms co-occur (e.g. supply-demand) we get nearly identical rows • Yields a small eigenvalue for • The eigenvector will likely be projected out in as it gives a weak eigenvalue

A Performance Evaluation • Landauer & Dumais • Perform LSI on 30,000 encyclopedia articles • Take synonym test from TOEFL • Choose most similar word • LSI - 64.4% (52.2% corrected for guessing) • People - 64.5% (52.7% corrected for guessing) • Correlated .44 with incorrect alternatives

A Probabilistic Analysisoverview • The model: • Topics sufficiently disjoint • Each doc drawn from a single (random) topic • Result: • With high probability (whp) : • Docs from the same topic will be similar • Docs from different topics will be dissimilar

The Probabilistic Model • K topics, each corresponding to a set of words • The sets are mutually disjoint • Below, all random choices are made uniformly at random • A corpus of m docs, each doc created as follows..

The Probabilistic Model (cont.) • choosing a doc: • choose length of the doc • choose a topic • Repeat times: • With prob choose a word from topic • With prob choose a word from other topics

Set up • Let vector assigned to doc by the rank-k LSI performed on the corpus. • The rank-k LSI is -skewed if • (intuition) Docs from the same topic should be similar (high dot product), …

The Result • Theorem: Assume the corpus is created from the model just described (k topics, etc.) . Then the rank-k LSI is -skewed with probability

Proof Sketch • Show with k topics, we obtain k orthogonal subspaces • Assume strictly disjoint topics ( ) • show that whp the k highest eigenvalues of indeed correspond to the k topics (are not intra-topic) • ( ) relax by using a matrix perturbation analysis

Extensions • Theory should go beyond explaining (ideally) • Potential for speed up: • project the doc vectors onto a suitably small space • perform LSI on this space • Yields O(m( n + c log n)) compared to O(mnc)

Future work • Learn more abstract algebra (math)! • Extensions: • docs spanning multiple topics? • polysemy? • other positive properties? • Another important role of theory: • Unify and generalize: spectral analysis has found applications elsewhere in IR

Latent Semantic Indexing: A probabilistic Analysis

Latent Semantic Indexing: A probabilistic Analysis

Presentation Transcript

Latent Semantic Indexing: A probabilistic Analysis

Latent Semantic Indexing

Latent Semantic Indexing

LATENT SEMANTIC INDEXING

Lecture 14: Latent Semantic Indexing +

Lecture 5: Probabilistic Latent Semantic Analysis

LATENT SEMANTIC INDEXING

Paper: Indexing by Latent Semantic Analysis

Latent Semantic Indexing

Lecture 5: Probabilistic Latent Semantic Analysis

Latent Semantic Indexing and Beyond

Indexing by Latent Semantic Analysis

Latent Semantic Indexing

Classification and clustering methods by probabilistic latent semantic indexing model

Probabilistic Latent Semantic Analysis

Latent Semantic Analysis

Latent Semantic Indexing

Latent Semantic Indexing

Lecture 15: Latent Semantic Indexing

Latent Semantic Analysis