100 likes | 232 Views
Using the Cell to Perform Latent Semantic Indexing. Luke Georgalas Andrew Raim. What Is LSI?. Information retrieval technique performed on a corpus of documents Perform Principal Component Analysis (PCA) to reduce dimensionality of a term-document matrix
E N D
Using the Cell to Perform Latent Semantic Indexing Luke Georgalas Andrew Raim
What Is LSI? • Information retrieval technique performed on a corpus of documents • Perform Principal Component Analysis (PCA) to reduce dimensionality of a term-document matrix • Semantically related terms and documents get grouped together
What Can You Do With LSI? • Compare documents in a corpus to see how related they are • Compare terms in a corpus to see how related they are • Compare terms with documents • Incorporate new documents into the concept space and find closest matches (queries can be considered documents)
How Does LSI Work? • Term-document matrix is constructed • Singular Value Decomposition (SVD) is performed • M = T S D’ DT Mtd T = r r r d t d t r * Diagram is from Dr. Charles Nicholas’ lecture notes for CMSC676
How Does LSI Work? • D, a representation of M in r dimensions • T, a matrix for transforming new documents • Diagonal matrix gives relative importance of dimensions • Dimensions represent semantic concepts * This slide is from Dr. Charles Nicholas’ lecture notes for CMSC676
On The Cell? • Term-document matrices get very large with large coropra • One reason this is not commonly used in search engines is because SVD is slow
Existing Serial Libraries • LAPACK (U of Texas linear algebra library) • QR algorithm O(n3) • DQDS algorithm – only calculates S • Divide-and-conquer – only calculates S • Bisection method and inverse iteration – only calculates singular values and vectors of interest
Existing Serial Libraries • GSL (GNU Scientific Library) • Golub-Reinsch algorithm • Modified Golub-Reinsch algorithm for M >> N • Jacobi orthagonalization
Existing Serial Libraries • Stream Hestenes SVD • Modified Hestenes algorithm to support stream processing • Requires R2 processing elements to work most efficiently * http://www.lcs.mit.edu/publications/pubs/pdf/MIT-LCS-TM-641.pdf
Possible Projects • Algorithm – SVD only • Start with a serial routine • See what we can parallelize • See what we can vectorize • Application - indexer • Create a cell-optimized version of SVD • Index a corpus • Run queries