1 / 12

Latent Semantic Analysis

Learn about Latent Semantic Analysis (LSA) for discovering hidden concepts in document data and representing documents, terms, and queries. Explore matrix decomposition and singular value decomposition methods.

jmullen
Download Presentation

Latent Semantic Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Latent Semantic Analysis

  2. An Example d1 : Romeo and Juliet. d2 : Juliet: O happy dagger! d3 : Romeo died by dagger. d4 : “Live free or die”, that’s the New-Hampshire’s motto. d5 : Did you know, New-Hampshire is in New-England. q: dies, dagger Which document should be returned and how the ranking should be?

  3. Eigenvectors and Eigenvalues • Let A be an n × n matrix. • If x is an n-dimensional vector, then the matrix-vector product Ax is well-defined, and the result is again an n-dimensional vector. • In general, multiplication by a matrix changes the direction of a non-zero vector x, unless the vector is special and we have that Ax =  x for some scalar .

  4. Matrix Decomposition • Let S be the matrix with eigenvectors of A as columns. • Let  be the diagonal matrix with the eigenvalues of A on the diagonal. • Then A = SS-1 • If A is symmetric then we have S-1=ST A = SST

  5. Singular Value Decomposition • Let A be an m × n matrix with entries being real numbers and m > n. • Consider the n × n square matrix B = ATA. • B is symmetric • it has been shown that the eigenvalues of such (ATA) matrices are non-negative. • Since they are non-negative we can write them in decreasing order as squares of non-negative real numbers: 12 > . . . > n2 • For some index r (possibly n) the first r numbers are positive whereas the rest are zero. • S1 = [x1, . . . , xr] • y1=(1/1)Ax1 ... yr=(1/r)Axr • S2 = [y1, ..., yr] • We can show that A = S2 S1T •  is diagonal and the values along the diagonal are 1, . . . , n which are called singular values. • If we denote S2 by S and S1 by U we have A = S  UT

  6. Example d1 : Romeo and Juliet. d2 : Juliet: O happy dagger! d3 : Romeo died by dagger. d4 : “Live free or die”, that’s the New-Hampshire’s motto. d5 : Did you know, New-Hampshire is in New-England. q: dies, dagger

  7. Document-term matrix

  8. Latent Concepts • Latent Semantic Indexing (LSI) is a method for discovering hidden concepts in document data. • Each document and term (word) is then expressed as a vector with elements corresponding to these concepts. • Each element in a vector gives the degree of participation of the document or term in the corresponding concept. • Goal is not to describe the concepts verbally, but to be able to represent the documents and terms in a unified way for exposing • document-document, • document-term, and • term-term similarities which are otherwise hidden…

  9. Matrix  Matrix A can be written: A = SUT Let's "neglect" the last three singular values of  as being too "small"... Also, just keep two columns from S obtaining S2 and two rows from UT obtaining U2T Matrix A is approximated as: A2 = S2U2T In general: Ak = SkUkT where a good value for k is determined empirically.

  10. Matrices 2,S2,U2

  11. Representing Documents, Terms, and Queries • Represent documents by the column vectors of U2T • Represent terms by the row vectors S2 • Represent queries by the centroid vector of their terms

  12. Geometry

More Related