120 likes | 130 Views
Learn about Latent Semantic Analysis (LSA) for discovering hidden concepts in document data and representing documents, terms, and queries. Explore matrix decomposition and singular value decomposition methods.
E N D
An Example d1 : Romeo and Juliet. d2 : Juliet: O happy dagger! d3 : Romeo died by dagger. d4 : “Live free or die”, that’s the New-Hampshire’s motto. d5 : Did you know, New-Hampshire is in New-England. q: dies, dagger Which document should be returned and how the ranking should be?
Eigenvectors and Eigenvalues • Let A be an n × n matrix. • If x is an n-dimensional vector, then the matrix-vector product Ax is well-defined, and the result is again an n-dimensional vector. • In general, multiplication by a matrix changes the direction of a non-zero vector x, unless the vector is special and we have that Ax = x for some scalar .
Matrix Decomposition • Let S be the matrix with eigenvectors of A as columns. • Let be the diagonal matrix with the eigenvalues of A on the diagonal. • Then A = SS-1 • If A is symmetric then we have S-1=ST A = SST
Singular Value Decomposition • Let A be an m × n matrix with entries being real numbers and m > n. • Consider the n × n square matrix B = ATA. • B is symmetric • it has been shown that the eigenvalues of such (ATA) matrices are non-negative. • Since they are non-negative we can write them in decreasing order as squares of non-negative real numbers: 12 > . . . > n2 • For some index r (possibly n) the first r numbers are positive whereas the rest are zero. • S1 = [x1, . . . , xr] • y1=(1/1)Ax1 ... yr=(1/r)Axr • S2 = [y1, ..., yr] • We can show that A = S2 S1T • is diagonal and the values along the diagonal are 1, . . . , n which are called singular values. • If we denote S2 by S and S1 by U we have A = S UT
Example d1 : Romeo and Juliet. d2 : Juliet: O happy dagger! d3 : Romeo died by dagger. d4 : “Live free or die”, that’s the New-Hampshire’s motto. d5 : Did you know, New-Hampshire is in New-England. q: dies, dagger
Latent Concepts • Latent Semantic Indexing (LSI) is a method for discovering hidden concepts in document data. • Each document and term (word) is then expressed as a vector with elements corresponding to these concepts. • Each element in a vector gives the degree of participation of the document or term in the corresponding concept. • Goal is not to describe the concepts verbally, but to be able to represent the documents and terms in a unified way for exposing • document-document, • document-term, and • term-term similarities which are otherwise hidden…
Matrix Matrix A can be written: A = SUT Let's "neglect" the last three singular values of as being too "small"... Also, just keep two columns from S obtaining S2 and two rows from UT obtaining U2T Matrix A is approximated as: A2 = S2U2T In general: Ak = SkUkT where a good value for k is determined empirically.
Representing Documents, Terms, and Queries • Represent documents by the column vectors of U2T • Represent terms by the row vectors S2 • Represent queries by the centroid vector of their terms