220 likes | 296 Views
Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement.
E N D
Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement. In the 23th Annual International ACM SIGIR Conference (SIGIR'2000), 2000. Presenter: 游斯涵
Introduction • Some studies of applying modified or generalized SVD: (improving the precision of similarities) • SDD ( semi discrete decomposition) • T. G. Kolda and D. P. O’Learly • Proposed to reduce the storage and computational costs of LSI. • R-SVD ( Riemannian SVD) • E. p. Jiang and M. W. Berry • User feedback can be integrated into LSI models. (theoretical of LSI) • MDS (Multidimensional Scaling)、Bayesian regression model、Probabilistic models.
Introduction • Find the problem with SVD • SVD: • The topics underlying outlier documents tend to be lost as we chose lower number of dimensions. • Dimensional reduction comes from two sources: • outlier document • minor term • The thinking of this paper: • not to consider the outlier document as “noise”, all documents assume to be equal. • Try to eliminate noise from the minor terms but not eliminate the influence of the outlier documents. Outlier documents Documents very different from other documents
Compare with SVD • Same • Trying to find a smaller set of basis vectors for a reduced space. • Differ • Scale the length of each residual vector • Treat documents and terms in a nonsymmetrical way.
term Algorithm-basis vector creation • Input: term-document matrix D, scaling factor q • Output: basis vectors For ( i=1;until reaching some criterion ;i=i+1) the first unit eigenvector of End for m*m m*n Doc
n n m m Algorithm-basis vector creation m = n = n m
Algorithm-document vector creation • Dimension reduction n n m = k m k There are two important variables in this algorithm: (scaling factor) and (the number of dimensions)
Find the eigenvector of example
example Find it’s eigenvector
Probabilistic model • Basis vectors: • Follows a Gaussian distribution • Multivariate Normal (MVN) Distribution
Probabilistic model • The log likelihood for the document vectors reduced to dimension k is computed as (Ding) Maximize this Negligible because it changes slowly
parameter • : set 1 to 10,increment of 1. • : selection of dimension by log-likelihood
experiment • Test data: • TREC collections • 20 topics • Total umber of documents is 684 disjoint pool2 training data Test data pool1 15 document set 15 document set Each set range from 31~126 Number of topic range from 6~20
Baseline algorithm • Three algorithm • SVD taking the left singular vectors as the basis vector • Term-document without any basis conversion (term frequency) • This paper algorithm
67.7 62.2 60 evaluation • Assumption • Similarity should be higher for any document pair relevant to the same topic (intra-topic pair).
evaluation • Preservation rate (document length): • Reduction rate (越大越好) : 1 - Preservation rate • Dimensional reduction rate (越大越好) : 1 - ( # of dimensions / max # of dimensions)
Selection dimension • Log-likelihood method: • Training-based method: • Choose the dimension which make the preservation rate closer to the average preservation rate. • Random guess-based method:
result 17.8%
result Dimension reduction rate 43% higher than SVD on average This algorithm shows 35.8% higher reduction rate than SVD
conclusion • This algorithm achieved higher precision (up 17.8%) of similarity measurement with higher reduction rate (43% higher) than the baseline algorithm. • Scaling factor can become dynamic to improve the performance.