1 / 32

Iterative residual rescaling: An analysis and generalization of LSI

Iterative residual rescaling: An analysis and generalization of LSI. Rie Kubota Ando & Lillian Lee. Iterative residual rescaling: An analysis and generalization of LSI . In the 24th Annual International ACM SIGIR Conference (SIGIR'2001), 2001. Presenter: 游斯涵. Introduction.

percy
Download Presentation

Iterative residual rescaling: An analysis and generalization of LSI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Iterative residual rescaling: An analysis and generalization of LSI Rie Kubota Ando & Lillian Lee. Iterative residual rescaling: An analysis and generalization of LSI. In the 24th Annual International ACM SIGIR Conference (SIGIR'2001), 2001. Presenter: 游斯涵

  2. Introduction • The disadvantage of VSM: • Documents that do not share terms are mapped to orthogonal vectors even if they are clearly related. • LSI attempts to overcome this shortcomings by projecting the term-document matrix onto a lower-dimensional subspace.

  3. Introduction of IRR Weight • LSI • IRR doc U VT A SVD term eigenvalue eigenvector eigenvector rescaling

  4. Frobenius norm and matrix 2-norm • Frobenius norm: • 2-norm:

  5. Analyzing LSI • Topic-based similarities • C: an n-document collection • D: m-by-n term-document matrix • k: underlying topics (k<n) • Relevance score: for each document and each topic: for each document: True topic-based similarity between and then we can get a n-by-n matrix S topic topic doc doc S doc doc topic doc

  6. The optimum subspace • Give a subspace of , and B form an orthonormal basis of

  7. The optimum subspace • We have m-by-n term-document matrix D • D= the projection of D onto is

  8. The optimum subspace • Deviation matrix: find a subspace such that the entries of it are small. • The optimum subspace • Optimum error if optimum error is high, then we cannot expect the optimum subspace to fully reveal the topic dominances.

  9. Left singular vector span Z’s range and The singular value decomposition and LSI • SVD • Gained on the left singular vector by following observation: be the projection of onto the span of let be the residual vector:

  10. Analysis of LSI

  11. Non-uniformity and LSI • A crucial quantity in our analysis is the dominance of a given topic t

  12. doc S doc topic doc doc topic Non-uniformity and LSI • Topic mingling • If the topic mingling is high means the similarity of each document with different topics is high, then the topics will be fairly difficult to distinguish.

  13. Non-uniformity and LSI • let be the ith largest singular value of . Then

  14. Non-uniformity and LSI • Define • We can get the ratio: the more largest topic dominates the collection, the higher this ratio will tend to be.

  15. Non-uniformity and LSI • Original error: Let denote the VSM space then as • Root original error ( Input error )

  16. Non-uniformity and LSI • Let be the h-dimension LSI subspace spanned by the first h left singular vectors of D if must be close to when the topic-document distribution is relatively uniform.

  17. Notation for related values • is topic mingling • For we write the approximation becomes closer as the optimum error (or optimum error) becomes smaller.

  18. Ando’s IRR algorithm • IRR algorithm

  19. Introduction of IRR

  20. Ando’s IRR algorithm find the max x which approximate R

  21. Ando’s IRR algorithm

  22. Auto-scale method • Automatic scaling factor determination: topic When approximately single-topic doc

  23. Auto-scale method • Implement auto-scale We set q to a linear function of f(D)

  24. Dimension selection • Stopping criterion: residual ratio (effective for both LSI and IRR)

  25. Evaluation Matrix • Kappa average precision: • Pair-wise average precision: the measured similarity for any two intra-topic documents( share at least one topic) should be higher than for any two cross-topic documents which have no topics in common. Denote the document pair with the jth largest measure cosine Non intra-topic probability

  26. Evaluation Matrix • Clustering: let C be a cluster-topic contingency table is the number of documents in cluster i that relevance to topic j. define:

  27. Experimental setting • (1)Choose two TREC topics (can choose more than two) • (2)Specified seven distribution type: • (25,25), (30,20), (35,15), (40,10), (43,7), (45,5), (46,4) • Each document was relevant to exactly one of the pre-select topics. • (3)Extracted single-word stemmed terms using TALENT and removed stop-words. • (4)Create term-document matrix, and length-normalized the document vector. • (5)implement AUTO-SCALE, set

  28. Controlled-distribution results • The chosen scaling factor increases on average as the non-uniformity goes up.

  29. Controlled-distribution results lowest S(C) Highest S(C)

  30. Controlled-distribution results

  31. Conclusion • Provided a new theoretical analysis of LSI. • Showing a precise relationship between LSI’s performance and the uniformity of the underlying topic-document distribution. • Extend Ando’s IRR algorithm. • IRR provide a very good performance in comparison to LSI.

  32. IRR on summarization doc sentence term turn to term VT U IRR Put all document as a query to count the similarity

More Related