Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Latent Semantic Indexing(mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18

Speeding up cosine computation • What if we could take our vectors and “pack” them into fewer dimensions (say 50,000100) while preserving distances? • Now, O(nm) to compute cos(d,q) for all d • Then, O(km+kn) where k << n,m • Two methods: • “Latent semantic indexing” • Random projection

Briefly • LSI is data-dependent • Create a k-dim subspace by eliminating redundant axes • Pull together “related” axes – hopefully • car and automobile • Random projection is data-independent • Choose a k-dim subspace that guarantees good stretching properties with high probability between any pair of points. What about polysemy ?

Notions from linear algebra • Matrix A, vector v • Matrix transpose (At) • Matrix product • Rank • Eigenvalues l and eigenvector v: Av = lv

Overview of LSI • Pre-process docs using a technique from linear algebra called Singular Value Decomposition • Create a new (smaller) vector space • Queries handled (faster) in this new space

Singular-Value Decomposition • Recall mn matrix of terms  docs, A. • A has rank r  m,n • Define term-term correlation matrix T=AAt • T is a square, symmetric mm matrix • Let P be mrmatrix of eigenvectors of T • Define doc-doc correlation matrix D=AtA • D is a square, symmetric nn matrix • Let R be nrmatrix of eigenvectors of D

A’s decomposition • Given P(for T, mr) and R(for D, nr) formed by orthonormal columns (unit dot-product) • It turns out that A = PSRt • WhereS is a diagonal matrix with the eigenvalues of T=AAt in decreasing order. mn mr rn = rr Rt S P A

document k k 0 k 0 0 useless due to 0-col/0-row of Sk Dimensionality reduction • For some k << r, zeroout all but the k biggest eigenvalues in S[choice of k is crucial] • Denote by Sk this new version of S, having rank k • Typically k is about 100, while r (A’s rank) is > 10,000 = r Sk Rt S P A Ak k x n r x n m x r m x k

Guarantee • Akis a pretty good approximation to A: • Relative distances are (approximately) preserved • Of all mn matrices of rank k, Ak is the best approximation to A wrt the following measures: • minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = sk+1 • minB, rank(B)=k ||A-B||F2 = ||A-Ak||F2 = sk+12+ sk+22+...+ sr2 • Frobenius norm ||A||F2 = s12+ s22+...+ sr2

R,P are formed by orthonormal eigenvectors of the matrices D,T Reduction • Xk = Sk Rtis the doc-matrix k x n, hence reduced to k dim • Since we are interested in doc/q-correlation, we consider: • D=AtA=(P SRt)t(P SRt) = (SRt)t (SRt) • Approx S with Sk, thus get AtA Xkt Xk (both are n x n matr.) • We use Xkto define how to project A and Q: • Xk= SkRt , substitute Rt = S-1Pt A, so getPkt A. • In fact, SkS-1Pt = Pkt which is a k x m matrix • This means that to reduce a doc/query vector is enough to multiply it by Pkt thus paying O(km) per doc/query • Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn)

Which are the concepts ? • c-th concept = c-th row of Pkt (which is k x m) • Denote it by Pkt [c], whose size is m = #terms • Pkt [c][i] = strength of association between c-th concept and i-th term • Projected document: d’j= Pkt dj • d’j[c] = strenght of concept c in dj • Projected query: q’= Pkt q • q’[c] = strenght of concept c in q

Random Projections Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only !

An interesting math result Lemma (Johnson-Linderstrauss, ‘82) Let P be a set of n distinct points in m-dimensions. Given e > 0, there exists a function f : P  IRk such that for every pair of points u,v in P it holds: (1 - e) ||u - v||2 ≤ ||f(u) – f(v)||2 ≤(1 + e) ||u-v||2 Where k = O(e-2 log m) f() is called JL-embedding Setting v=0 we also get a bound on f(u)’s stretching!!!

What about the cosine-distance ? f(u)’s, f(v)’s stretching substituting formula above for ||u-v||2

E[ri,j] = 0 Var[ri,j] = 1 How to compute a JL-embedding? If we set R = ri,j to be a random mx k matrix, where the components are independent random variables with one of the following distributions

Finally... • Random projections hide large constants • k  (1/e)2 * log m, so it may be large… • it is simple and fast to compute • LSI is intuitive and may scale to any k • optimal under various metrics • but costly to compute

Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

Presentation Transcript

Latent Semantic Indexing: A probabilistic Analysis

Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI)

Detecting Cyberbullying using Latent Semantic Indexing(LSI)

Latent Semantic Indexing

Latent Semantic Indexing

Latent Semantic Kernels

LATENT SEMANTIC INDEXING

Lecture 14: Latent Semantic Indexing +

Latent Semantic Indexing SI650: Information Retrieva l

LATENT SEMANTIC INDEXING

Paper: Indexing by Latent Semantic Analysis

Latent Semantic Indexing

Latent Semantic Indexing and Beyond

Indexing by Latent Semantic Analysis

Latent Semantic Indexing

Latent Semantic Mapping (LSA)

Latent Semantic Indexing and Beyond

Latent Semantic Indexing

Latent Semantic Indexing

Lecture 15: Latent Semantic Indexing

Latent Semantic Indexing: A probabilistic Analysis