Latent Semantic Indexing: A probabilistic Analysis

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala

Motivation • Application in several areas: • querying • clustering, identifying topics • Other: • synonym recognition (TOEFL..) • Psychology test • essay scoring

Motivation • Latent Semantic Indexing is • Latent: Captures associations which are not explicit • Semantic: Represents meaning as a function of similarity to other entities • Cool: Lots of spiffy applications, and the potential for some good theory too

Overview • IR and two classical problems • How LSI works • Why LSI is effective: A probabilistic analysis

Information Retrieval • Text corpus with many documents (docs) • Given a query, find relevant docs • Classical problems: • synonymy: missing docs with reference to “automobile” when querying on “car” • polysemy: retrieving docs on internet when querying on “surfing” • Solution: Represent docs (and queries) by their underlying latent concepts

Information Retrieval • Represent each document as a word vector • Represent corpus as term-document matrix (T-D matrix) • A classical method: • Create new vector from query terms • Find documents with highest dot-product

Document vector space

Latent Semantic Indexing(LSI) • Process term-document (T-D) matrix to expose statistical structure • Convert high-dimensional space to lower-dimensional space, throw out noise, keep the good stuff • Related to principal component analysis (PCA), multiple dimensional scaling (MDS)

Parameters • U = universe of terms • n = number of terms • m = number of docs • A = n x m matrix with rank r • columns represent docs • rows represent terms

Singular Value Decomposition(SVD) • LSI uses SVD, a linear analysis method:

SVD • r is the rank of A • D diagonal matrix of the r singular values • U and V matrices composed of orthonormal columns • SVD is always possible • numerical methods for SVD exist • run time: O(m n c), where c denotes the average number of words per document

T-D Matrix Approximation

Synonymy • LSI used in several ways: e.g. detecting synonymy • A measure of similarity for two terms: • In original space: dot product of rows (terms) and of ( , entry in ) • Better: dot product of rows and of • ( , entry in )

“Semantic” Space

Synonymy (intuition) • Consider the term-term autocorrelation matrix • If two terms co-occur (e.g. supply-demand) we get nearly identical rows • Yields a small eigenvalue for • The eigenvector will likely be projected out in as it gives a weak eigenvalue

A Performance Evaluation • Landauer & Dumais • Perform LSI on 30,000 encyclopedia articles • Take synonym test from TOEFL • Choose most similar word • LSI - 64.4% (52.2% corrected for guessing) • People - 64.5% (52.7% corrected for guessing) • Correlated .44 with incorrect alternatives

A Probabilistic Analysisoverview • The model: • Topics sufficiently disjoint • Each doc drawn from a single (random) topic • Result: • With high probability (whp) : • Docs from the same topic will be similar • Docs from different topics will be dissimilar

The Probabilistic Model • K topics, each corresponding to a set of words • The sets are mutually disjoint • Below, all random choices are made uniformly at random • A corpus of m docs, each doc created as follows..

The Probabilistic Model (cont.) • choosing a doc: • choose length of the doc • choose a topic • Repeat times: • With prob choose a word from topic • With prob choose a word from other topics

Set up • Let vector assigned to doc by the rank-k LSI performed on the corpus. • The rank-k LSI is -skewed if • (intuition) Docs from the same topic should be similar (high dot product), …

The Result • Theorem: Assume the corpus is created from the model just described (k topics, etc.) . Then the rank-k LSI is -skewed with probability

Proof Sketch • Show with k topics, we obtain k orthogonal subspaces • Assume strictly disjoint topics ( ) • show that whp the k highest eigenvalues of indeed correspond to the k topics (are not intra-topic) • ( ) relax by using a matrix perturbation analysis

Extensions • Theory should go beyond explaining (ideally) • Potential for speed up: • project the doc vectors onto a suitably small space • perform LSI on this space • Yields O(m( n + c log n)) compared to O(mnc)

Future work • Learn more abstract algebra (math)! • Extensions: • docs spanning multiple topics? • polysemy? • other positive properties? • Another important role of theory: • Unify and generalize: spectral analysis has found applications elsewhere in IR

Latent Semantic Indexing: A probabilistic Analysis

Latent Semantic Indexing: A probabilistic Analysis

Presentation Transcript

Indexing and Hashing

Latent Dirichlet Allocation

Chapter 9 Developmental Theories: Latent Trait and Life Course

Introduction to Semantic Web

Foundations of Probabilistic Answers to Queries

Semantic Enhancement

Information Extraction

Chapter 11: Indexing and Hashing

Practical Considerations of Latent Tuberculosis Infection

Semantic Inference for Question Answering

Feasting on Brains! From Web Services to Web 2.0 to the Semantic Web and back again…

Table of Contents

Table of Contents

Review

An Introduction to Factor Analysis

Semantic Web Service Systems

A Probabilistic Approach to Semantic Representation

Semantic Web Services Tutorial

Chapter 2 Modeling

Chapter 2 Modeling

Modeling the Internet and the Web: Text Analysis