Iterative residual rescaling: An analysis and generalization of LSI

Iterative residual rescaling: An analysis and generalization of LSI Rie Kubota Ando & Lillian Lee. Iterative residual rescaling: An analysis and generalization of LSI. In the 24th Annual International ACM SIGIR Conference (SIGIR'2001), 2001. Presenter: 游斯涵

Introduction • The disadvantage of VSM: • Documents that do not share terms are mapped to orthogonal vectors even if they are clearly related. • LSI attempts to overcome this shortcomings by projecting the term-document matrix onto a lower-dimensional subspace.

Introduction of IRR Weight • LSI • IRR doc U VT A SVD term eigenvalue eigenvector eigenvector rescaling

Frobenius norm and matrix 2-norm • Frobenius norm: • 2-norm:

Analyzing LSI • Topic-based similarities • C: an n-document collection • D: m-by-n term-document matrix • k: underlying topics (k<n) • Relevance score: for each document and each topic: for each document: True topic-based similarity between and then we can get a n-by-n matrix S topic topic doc doc S doc doc topic doc

The optimum subspace • Give a subspace of , and B form an orthonormal basis of

The optimum subspace • We have m-by-n term-document matrix D • D= the projection of D onto is

The optimum subspace • Deviation matrix: find a subspace such that the entries of it are small. • The optimum subspace • Optimum error if optimum error is high, then we cannot expect the optimum subspace to fully reveal the topic dominances.

Left singular vector span Z’s range and The singular value decomposition and LSI • SVD • Gained on the left singular vector by following observation: be the projection of onto the span of let be the residual vector:

Analysis of LSI

Non-uniformity and LSI • A crucial quantity in our analysis is the dominance of a given topic t

doc S doc topic doc doc topic Non-uniformity and LSI • Topic mingling • If the topic mingling is high means the similarity of each document with different topics is high, then the topics will be fairly difficult to distinguish.

Non-uniformity and LSI • let be the ith largest singular value of . Then

Non-uniformity and LSI • Define • We can get the ratio: the more largest topic dominates the collection, the higher this ratio will tend to be.

Non-uniformity and LSI • Original error: Let denote the VSM space then as • Root original error ( Input error )

Non-uniformity and LSI • Let be the h-dimension LSI subspace spanned by the first h left singular vectors of D if must be close to when the topic-document distribution is relatively uniform.

Notation for related values • is topic mingling • For we write the approximation becomes closer as the optimum error (or optimum error) becomes smaller.

Ando’s IRR algorithm • IRR algorithm

Introduction of IRR

Ando’s IRR algorithm find the max x which approximate R

Ando’s IRR algorithm

Auto-scale method • Automatic scaling factor determination: topic When approximately single-topic doc

Auto-scale method • Implement auto-scale We set q to a linear function of f(D)

Dimension selection • Stopping criterion: residual ratio (effective for both LSI and IRR)

Evaluation Matrix • Kappa average precision: • Pair-wise average precision: the measured similarity for any two intra-topic documents( share at least one topic) should be higher than for any two cross-topic documents which have no topics in common. Denote the document pair with the jth largest measure cosine Non intra-topic probability

Evaluation Matrix • Clustering: let C be a cluster-topic contingency table is the number of documents in cluster i that relevance to topic j. define:

Experimental setting • (1)Choose two TREC topics (can choose more than two) • (2)Specified seven distribution type: • (25,25), (30,20), (35,15), (40,10), (43,7), (45,5), (46,4) • Each document was relevant to exactly one of the pre-select topics. • (3)Extracted single-word stemmed terms using TALENT and removed stop-words. • (4)Create term-document matrix, and length-normalized the document vector. • (5)implement AUTO-SCALE, set

Controlled-distribution results • The chosen scaling factor increases on average as the non-uniformity goes up.

Controlled-distribution results lowest S(C) Highest S(C)

Controlled-distribution results

Conclusion • Provided a new theoretical analysis of LSI. • Showing a precise relationship between LSI’s performance and the uniformity of the underlying topic-document distribution. • Extend Ando’s IRR algorithm. • IRR provide a very good performance in comparison to LSI.

IRR on summarization doc sentence term turn to term VT U IRR Put all document as a query to count the similarity

Iterative residual rescaling: An analysis and generalization of LSI