280 likes | 308 Views
Information Retrieval through Various Approximate Matrix Decompositions. Kathryn Linehan Advisor: Dr. Dianne O’Leary. Information Retrieval. Extracting information from databases We need an efficient way of searching large amounts of data Example: web search engine.
E N D
Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary
Information Retrieval • Extracting information from databases • We need an efficient way of searching large amounts of data • Example: web search engine
Querying a Document Database • We want to return documents that are relevant to entered search terms • Given data: • Term-Document Matrix, A • Entry ( i , j ): importance of term i in document j • Query Vector, q • Entry ( i ): importance of term i in the query
Term-Document Matrix • Entry ( i, j) : weight of term i in document j Document 1 2 3 4 Term Example: Mark Twain Samuel Clemens Purple Fairy Example taken from [5]
Query Vector • Entry ( i ) : weight of term i in the query Term Example: search for “Mark Twain” Mark Twain Samuel Clemens Purple Fairy Example taken from [5]
Document Scoring Document 1 2 3 4 • Doc 1 and Doc 3 will be returned as relevant, but Doc 2 will not Term Scores Mark Doc 1 Twain Samuel Doc 2 Clemens Doc 3 Purple Doc 4 Fairy Example taken from [5]
Can we do better if we replace the matrix by an approximation? • Singular Value Decomposition (SVD) • Nonnegative Matrix Factorization (NMF) • CUR Decomposition
Nonnegative Matrix Factorization (NMF) • W and H are nonnegative k x n m x n m x k Storage: k(m + n) entries
NMF • Multiplicative update algorithm of Lee and Seung found in [1] • Find W, H to minimize • Random initialization for W,H • Gradient descent method • Slow due to matrix multiplications in iteration
NMF Validation A: 5 x 3 random dense matrix. Average over 5 runs. B: 500 x 200 random sparse matrix. Average over 5 runs.
NMF Validation B: 500 x 200 random sparse matrix. Rank(NMF) = 80.
CUR Decomposition • C (R) holds c (r) sampled and rescaled columns (rows) of A • U is computed using C and R C U R c x r r x n , m x n m x c where k is a rank parameter Storage: (nz(C) + cr + nz(R)) entries
CUR Implementations • CUR algorithm in [3] by Drineas, Kannan, and Mahoney • Linear time algorithm • Improvement: Compact Matrix Decomposition (CMD) in [6] by Sun, Xie, Zhang, and Faloutsos • Modification: use ideas in [4] by Drineas, Mahoney, Muthukrishnan (no longer linear time) • Other Modifications: our ideas • Deterministic CUR code by G. W. Stewart [2]
Sampling • Column (Row) norm sampling [3] • Prob(col j) = (similar for row i) • Subspace Sampling [4] • Uses rank-k SVD of A for column probabilities • Prob(col j) = • Uses “economy size” SVD of C for row probabilities • Prob(row i) = • Sampling without replacement
Computation of U • Linear U [3]: approximately solves • Optimal U: solves
Deterministic CUR • Code by G. W. Stewart [2] • Uses a RRQR algorithm that does not store Q • We only need the permutation vector • Gives us the columns (rows) for C (R) • Uses an optimal U
Compact Matrix Decomposition (CMD) Improvement • Remove repeated columns (rows) in C (R) • Decreases storage while still achieving the same relative error [6] A: 50 x 30 random sparse matrix, k = 15. Average over 10 runs.
CUR: Sampling with Replacement Validation A: 5 x 3 random dense matrix. Average over 5 runs. Legend: Sampling, U
Sampling without Replacement: Scaling vs. No Scaling • Invert scaling factor applied to
CUR: Sampling without Replacement Validation A: 5 x 3 random dense matrix. Average over 5 runs. B: 500 x 200 random sparse matrix. Average over 5 runs. Legend: Sampling, U, Scaling
CUR Comparison B: 500 x 200 random sparse matrix. Average over 5 runs. Legend: Sampling, U, Scaling
Judging Success: Precision and Recall • Measurement of performance for document retrieval • Average precision and recall, where the average is taken over all queries in the data set • Let Retrieved = number of documents retrieved, Relevant = total number of relevant documents to the query, RetRel = number of documents retrieved that are relevant. • Precision: • Recall:
LSI Results Term-document matrix size: 5831 x 1033. All matrix approximations are rank 100 approximations (CUR: r = c = k). Average query time is less than 10-3 seconds for all matrix approximations.
LSI Results Term-document matrix size: 5831 x 1033. All matrix approximations are rank 100 approximations. (CUR: r = c = k)
Matrix Approximation Results Rel. Error (F-norm) Storage (nz) Runtime (sec) SVD 0.8203 686500 22.5664 NMF 0.8409 686400 23.0210 CUR: cn,lin 1.4151 17242 0.1741 CUR: cn,opt 0.9724 16358 0.2808 CUR: sub,lin 1.2093 16175 48.7651 CUR: sub,opt 0.9615 16108 49.0830 CUR: w/oR,no 0.9931 17932 0.3466 CUR: w/oR,yes 0.9957 17220 0.2734 CUR:GWS 0.9437 25020 2.2857 LTM -- 52003 --
Conclusions • We may not be able to store an entire term-document matrix and it may be too expensive to compute an SVD • We can achieve LSI results that are almost as good with cheaper approximations • Less storage • Less computation time
Completed Project Goals • Code/validate NMF and CUR • Analyze relative error, runtime, and storage of NMF and CUR • Improve CUR algorithm of [3] • Analyze use of NMF and CUR in LSI
References [1] Michael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52(1):155-173, September 2007. M.W. Berry, S.A. Pulatova, and G.W. Stewart. Computing sparse reduced-rank approximations to sparse matrices. Technical Report UMIACS TR-2004-34 CMSC TR-4591, University of Maryland, May 2004. Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36(1):184-206, 2006. Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844-881, 2008. Tamara G. Kolda and Dianne P. O'Leary. A semidiscrete matrix decomposition for latent semantic indexing in information retrieval. ACM Transactions on Information Systems, 16(4):322-346, October 1998. Jimeng Sun, Yinglian Xie, Hui Zhang, and Christos Faloutsos. Less is more: Sparse graph mining with compact matrix decomposition. Statistical Analysis and Data Mining, 1(1):6-22, February 2008. [2] [3] [4] [5] [6]