1 / 10

Using the Cell to Perform Latent Semantic Indexing

Using the Cell to Perform Latent Semantic Indexing. Luke Georgalas Andrew Raim. What Is LSI?. Information retrieval technique performed on a corpus of documents Perform Principal Component Analysis (PCA) to reduce dimensionality of a term-document matrix

keene
Download Presentation

Using the Cell to Perform Latent Semantic Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using the Cell to Perform Latent Semantic Indexing Luke Georgalas Andrew Raim

  2. What Is LSI? • Information retrieval technique performed on a corpus of documents • Perform Principal Component Analysis (PCA) to reduce dimensionality of a term-document matrix • Semantically related terms and documents get grouped together

  3. What Can You Do With LSI? • Compare documents in a corpus to see how related they are • Compare terms in a corpus to see how related they are • Compare terms with documents • Incorporate new documents into the concept space and find closest matches (queries can be considered documents)

  4. How Does LSI Work? • Term-document matrix is constructed • Singular Value Decomposition (SVD) is performed • M = T S D’  DT Mtd T = r  r r  d t  d t  r * Diagram is from Dr. Charles Nicholas’ lecture notes for CMSC676

  5. How Does LSI Work? • D, a representation of M in r dimensions • T, a matrix for transforming new documents • Diagonal matrix  gives relative importance of dimensions • Dimensions represent semantic concepts * This slide is from Dr. Charles Nicholas’ lecture notes for CMSC676

  6. On The Cell? • Term-document matrices get very large with large coropra • One reason this is not commonly used in search engines is because SVD is slow

  7. Existing Serial Libraries • LAPACK (U of Texas linear algebra library) • QR algorithm O(n3) • DQDS algorithm – only calculates S • Divide-and-conquer – only calculates S • Bisection method and inverse iteration – only calculates singular values and vectors of interest

  8. Existing Serial Libraries • GSL (GNU Scientific Library) • Golub-Reinsch algorithm • Modified Golub-Reinsch algorithm for M >> N • Jacobi orthagonalization

  9. Existing Serial Libraries • Stream Hestenes SVD • Modified Hestenes algorithm to support stream processing • Requires R2 processing elements to work most efficiently * http://www.lcs.mit.edu/publications/pubs/pdf/MIT-LCS-TM-641.pdf

  10. Possible Projects • Algorithm – SVD only • Start with a serial routine • See what we can parallelize • See what we can vectorize • Application - indexer • Create a cell-optimized version of SVD • Index a corpus • Run queries

More Related