1 / 26

Latent Semantic Indexing for the Routing Problem

Latent Semantic Indexing for the Routing Problem. Doctorate course “Web Information Retrieval” PhD Student Irina Veredina University of Trento June 5, 2006. Index. The Problem The Concept Advantages and Drawbacks LSI and VSM: comparison LSI and Routing Problem Conclusions.

lorna
Download Presentation

Latent Semantic Indexing for the Routing Problem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Latent Semantic Indexingfor the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina University of Trento June 5, 2006 University of Trento

  2. Index • The Problem • The Concept • Advantages and Drawbacks • LSI and VSM: comparison • LSI and Routing Problem • Conclusions University of Trento

  3. 1. The Problem Vector Space Model (VSM) that has long been a standard in IR has its flaw: • It ignores both the order and association between terms! University of Trento

  4. 1. The Problem (cont.) The document by term matrix is sufficient to represent the collection. But: • some of the information contained there could actually hinder the process of document retrieval! University of Trento

  5. Index • The Problem • The Concept • Advantages and Drawbacks • LSI and VSM: comparison • LSI and Routing Problem • Conclusions University of Trento

  6. 2. The Concept The solution: A smaller, more tractable representation of terms and documents that retains only the most important information from the original matrix may actually improve both the quality and the speed of the retrieval system University of Trento

  7. 2. The Concept (cont.) Latent Semantic Indexing (LSI) is a technique that projects queries and documents into a space with “latent” semantic dimensions. University of Trento

  8. 2. The Concept (cont.) LSI is a method for dimensionality reduction: • a high-dimensional space is represented in low-dimensional space (often in two- or three-dimensional) University of Trento

  9. 2. The Concept (cont.) LSI is the application of the particular mathematical technique, called Singular Value Decomposition, to a word-by-document matrices. SVD (and hence LSI) is a least-squares method. University of Trento

  10. 2. The Concept (cont.) How SVD works? SVD takes the matrix A and represents it as A´ in a lower dimensional space such that the “distance” between the two matrices is minimized: Δ=||A-A´||2 University of Trento

  11. Index • The Problem • The Concept • Advantages and Drawbacks • LSI and VSM: comparison • LSI and Routing Problem • Conclusions University of Trento

  12. 3. Advantages and Drawbacks Advantages of LSI: • Synonymy (the same underlying concept can be described using different terms) • Polysemy (describes the words that have more than one meaning) • Dependence (improving performance by adding common phrases as search items) University of Trento

  13. 3. Advantages and Drawbacks Drawbacks of LSI: • Storage (SVD representation is more compact) • Efficiency (with LSI the query must be compared to every document in the collection) University of Trento

  14. Index • The Problem • The Concept • Advantages and Drawbacks • LSI and VSM: comparison • LSI and Routing Problem • Conclusions University of Trento

  15. 4. LSI and VSM: comparison Two collections of data: MED and CISI. • MED – LSI improves average precision from .45 to .51 with the largest benefits found at high recall • CISI – no significant differences between LSI and VSM is found University of Trento

  16. Index • The Problem • The Concept • Advantages and Drawbacks • LSI and VSM: comparison • LSI and Routing Problem • Conclusions University of Trento

  17. 5. LSI and Routing Problem The routing problem is just a special case of the classification problem, since there are only two groups of documents, relevant and nonrelevant. University of Trento

  18. 5. LSI and Routing Problem (cont.) To test the performance of LSI when applied to the routing task the technique of cross-validation is used. University of Trento

  19. 5. LSI and Routing Problem (cont.) Cross-validation: The strategy is to remove one document at a time from the collection, and then use the remaining documents to try to predict the relevance of missing document. Precision and recall are used to evaluate the performance. University of Trento

  20. 5. LSI and Routing Problem (cont.) Results: LSI does not greatly improve performance over the vector space model for the routing problem, although the difference is measurable: Evaluation method VSM LSI Avg.precision 0.405 0.451 Avg.recall 0.758 0.811 University of Trento

  21. 5. LSI and Routing Problem (cont.) To obtain a significant improvement in retrieval performance LSI can be used in conjunction with statistical classification. University of Trento

  22. 5. LSI and Routing Problem (cont.) The general statistical classification problem: A population consists of two or more groups, and there exists a training sample for which the class of each element is known and a test sample for which the class is unknown. The goal is to produce a classification rule which will predict the class of the unknown elements. University of Trento

  23. 5. LSI and Routing Problem (cont.) Results: The performance is significantly improved: Evaluation method VSM LSI TDA Avg.precision 0.405 0.451 0.604 Avg.recall 0.758 0.811 0.830 TDA – method for text-based discriminant analysis. University of Trento

  24. Index • The Problem • The Concept • Advantages and Drawbacks • LSI and VSM: comparison • LSI and Routing Problem • Conclusions University of Trento

  25. 6. Conclusions LSI addresses the problem of term independence by re-expressing the term document matrix in a new coordinate system to capture the most significant components of the term association structure. University of Trento

  26. Thank You! University of Trento

More Related