260 likes | 476 Views
Latent Semantic Indexing for the Routing Problem. Doctorate course “Web Information Retrieval” PhD Student Irina Veredina University of Trento June 5, 2006. Index. The Problem The Concept Advantages and Drawbacks LSI and VSM: comparison LSI and Routing Problem Conclusions.
E N D
Latent Semantic Indexingfor the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina University of Trento June 5, 2006 University of Trento
Index • The Problem • The Concept • Advantages and Drawbacks • LSI and VSM: comparison • LSI and Routing Problem • Conclusions University of Trento
1. The Problem Vector Space Model (VSM) that has long been a standard in IR has its flaw: • It ignores both the order and association between terms! University of Trento
1. The Problem (cont.) The document by term matrix is sufficient to represent the collection. But: • some of the information contained there could actually hinder the process of document retrieval! University of Trento
Index • The Problem • The Concept • Advantages and Drawbacks • LSI and VSM: comparison • LSI and Routing Problem • Conclusions University of Trento
2. The Concept The solution: A smaller, more tractable representation of terms and documents that retains only the most important information from the original matrix may actually improve both the quality and the speed of the retrieval system University of Trento
2. The Concept (cont.) Latent Semantic Indexing (LSI) is a technique that projects queries and documents into a space with “latent” semantic dimensions. University of Trento
2. The Concept (cont.) LSI is a method for dimensionality reduction: • a high-dimensional space is represented in low-dimensional space (often in two- or three-dimensional) University of Trento
2. The Concept (cont.) LSI is the application of the particular mathematical technique, called Singular Value Decomposition, to a word-by-document matrices. SVD (and hence LSI) is a least-squares method. University of Trento
2. The Concept (cont.) How SVD works? SVD takes the matrix A and represents it as A´ in a lower dimensional space such that the “distance” between the two matrices is minimized: Δ=||A-A´||2 University of Trento
Index • The Problem • The Concept • Advantages and Drawbacks • LSI and VSM: comparison • LSI and Routing Problem • Conclusions University of Trento
3. Advantages and Drawbacks Advantages of LSI: • Synonymy (the same underlying concept can be described using different terms) • Polysemy (describes the words that have more than one meaning) • Dependence (improving performance by adding common phrases as search items) University of Trento
3. Advantages and Drawbacks Drawbacks of LSI: • Storage (SVD representation is more compact) • Efficiency (with LSI the query must be compared to every document in the collection) University of Trento
Index • The Problem • The Concept • Advantages and Drawbacks • LSI and VSM: comparison • LSI and Routing Problem • Conclusions University of Trento
4. LSI and VSM: comparison Two collections of data: MED and CISI. • MED – LSI improves average precision from .45 to .51 with the largest benefits found at high recall • CISI – no significant differences between LSI and VSM is found University of Trento
Index • The Problem • The Concept • Advantages and Drawbacks • LSI and VSM: comparison • LSI and Routing Problem • Conclusions University of Trento
5. LSI and Routing Problem The routing problem is just a special case of the classification problem, since there are only two groups of documents, relevant and nonrelevant. University of Trento
5. LSI and Routing Problem (cont.) To test the performance of LSI when applied to the routing task the technique of cross-validation is used. University of Trento
5. LSI and Routing Problem (cont.) Cross-validation: The strategy is to remove one document at a time from the collection, and then use the remaining documents to try to predict the relevance of missing document. Precision and recall are used to evaluate the performance. University of Trento
5. LSI and Routing Problem (cont.) Results: LSI does not greatly improve performance over the vector space model for the routing problem, although the difference is measurable: Evaluation method VSM LSI Avg.precision 0.405 0.451 Avg.recall 0.758 0.811 University of Trento
5. LSI and Routing Problem (cont.) To obtain a significant improvement in retrieval performance LSI can be used in conjunction with statistical classification. University of Trento
5. LSI and Routing Problem (cont.) The general statistical classification problem: A population consists of two or more groups, and there exists a training sample for which the class of each element is known and a test sample for which the class is unknown. The goal is to produce a classification rule which will predict the class of the unknown elements. University of Trento
5. LSI and Routing Problem (cont.) Results: The performance is significantly improved: Evaluation method VSM LSI TDA Avg.precision 0.405 0.451 0.604 Avg.recall 0.758 0.811 0.830 TDA – method for text-based discriminant analysis. University of Trento
Index • The Problem • The Concept • Advantages and Drawbacks • LSI and VSM: comparison • LSI and Routing Problem • Conclusions University of Trento
6. Conclusions LSI addresses the problem of term independence by re-expressing the term document matrix in a new coordinate system to capture the most significant components of the term association structure. University of Trento
Thank You! University of Trento