1 / 24

Latent Semantic Indexing: A probabilistic Analysis

Latent Semantic Indexing: A probabilistic Analysis. Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala. Motivation . Application in several areas: querying clustering, identifying topics Other: synonym recognition (TOEFL..) Psychology test essay scoring. Motivation.

gamma
Download Presentation

Latent Semantic Indexing: A probabilistic Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala

  2. Motivation • Application in several areas: • querying • clustering, identifying topics • Other: • synonym recognition (TOEFL..) • Psychology test • essay scoring

  3. Motivation • Latent Semantic Indexing is • Latent: Captures associations which are not explicit • Semantic: Represents meaning as a function of similarity to other entities • Cool: Lots of spiffy applications, and the potential for some good theory too

  4. Overview • IR and two classical problems • How LSI works • Why LSI is effective: A probabilistic analysis

  5. Information Retrieval • Text corpus with many documents (docs) • Given a query, find relevant docs • Classical problems: • synonymy: missing docs with reference to “automobile” when querying on “car” • polysemy: retrieving docs on internet when querying on “surfing” • Solution: Represent docs (and queries) by their underlying latent concepts

  6. Information Retrieval • Represent each document as a word vector • Represent corpus as term-document matrix (T-D matrix) • A classical method: • Create new vector from query terms • Find documents with highest dot-product

  7. Document vector space

  8. Latent Semantic Indexing(LSI) • Process term-document (T-D) matrix to expose statistical structure • Convert high-dimensional space to lower-dimensional space, throw out noise, keep the good stuff • Related to principal component analysis (PCA), multiple dimensional scaling (MDS)

  9. Parameters • U = universe of terms • n = number of terms • m = number of docs • A = n x m matrix with rank r • columns represent docs • rows represent terms

  10. Singular Value Decomposition(SVD) • LSI uses SVD, a linear analysis method:

  11. SVD • r is the rank of A • D diagonal matrix of the r singular values • U and V matrices composed of orthonormal columns • SVD is always possible • numerical methods for SVD exist • run time: O(m n c), where c denotes the average number of words per document

  12. T-D Matrix Approximation

  13. Synonymy • LSI used in several ways: e.g. detecting synonymy • A measure of similarity for two terms: • In original space: dot product of rows (terms) and of ( , entry in ) • Better: dot product of rows and of • ( , entry in )

  14. “Semantic” Space

  15. Synonymy (intuition) • Consider the term-term autocorrelation matrix • If two terms co-occur (e.g. supply-demand) we get nearly identical rows • Yields a small eigenvalue for • The eigenvector will likely be projected out in as it gives a weak eigenvalue

  16. A Performance Evaluation • Landauer & Dumais • Perform LSI on 30,000 encyclopedia articles • Take synonym test from TOEFL • Choose most similar word • LSI - 64.4% (52.2% corrected for guessing) • People - 64.5% (52.7% corrected for guessing) • Correlated .44 with incorrect alternatives

  17. A Probabilistic Analysisoverview • The model: • Topics sufficiently disjoint • Each doc drawn from a single (random) topic • Result: • With high probability (whp) : • Docs from the same topic will be similar • Docs from different topics will be dissimilar

  18. The Probabilistic Model • K topics, each corresponding to a set of words • The sets are mutually disjoint • Below, all random choices are made uniformly at random • A corpus of m docs, each doc created as follows..

  19. The Probabilistic Model (cont.) • choosing a doc: • choose length of the doc • choose a topic • Repeat times: • With prob choose a word from topic • With prob choose a word from other topics

  20. Set up • Let vector assigned to doc by the rank-k LSI performed on the corpus. • The rank-k LSI is -skewed if • (intuition) Docs from the same topic should be similar (high dot product), …

  21. The Result • Theorem: Assume the corpus is created from the model just described (k topics, etc.) . Then the rank-k LSI is -skewed with probability

  22. Proof Sketch • Show with k topics, we obtain k orthogonal subspaces • Assume strictly disjoint topics ( ) • show that whp the k highest eigenvalues of indeed correspond to the k topics (are not intra-topic) • ( ) relax by using a matrix perturbation analysis

  23. Extensions • Theory should go beyond explaining (ideally) • Potential for speed up: • project the doc vectors onto a suitably small space • perform LSI on this space • Yields O(m( n + c log n)) compared to O(mnc)

  24. Future work • Learn more abstract algebra (math)! • Extensions: • docs spanning multiple topics? • polysemy? • other positive properties? • Another important role of theory: • Unify and generalize: spectral analysis has found applications elsewhere in IR

More Related