480 likes | 663 Views
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI). Jasminka Dob š a Faculty of organization and informatics, Vara ž din. Outline. Information retrieval in vector space model (VSM) or bag of words representation
E N D
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics, Varaždin
Outline • Information retrieval in vector space model (VSM) or bag of words representation • Techniques for conceptual indexing • Latent semantic indexing • Concept indexing • Comparison: Academic example • Experiment • Further work
Information retrieval in VSM 1/3 • Task of information retrieval: to extract documents that are relevant for user query in document collection • In VSM documents are presented in high dimensional space • Dimension of space depends on number of indexing terms which are chosen to be relevant for the collection (4000-5000 in my experiments) • VSM is implemented by forming term-document matrix
Term-document matrix is mn matrix wheremis number of terms and n is number of documents row of term-document matrix = term column of term-document matrix = document Figure 1. Term-document matrix Information retrieval in VSM 2/3
Information retrieval in VSM 3/3 • query has the same shape as document (m dimensional vector) • measure of similarity between query q and a document ajis acosine of angle between those two vectors
Measures for evaluation: Recall Precision Average precision Recall Precision riis number of relevant documents among i highest ranked documents rnis total number of relevant documents in collection Average precision – average precision for distinct levels of recall Retrieval performance evaluation
Techniquesfor conceptual indexing • In term-matching method similarity between query and the document is tested lexically • Polysemy (words having multiple meaning) and synonymy (multiple words having the same meaning) are two fundamental problems in efficient information retrieval • Here we compare two techniques for conceptual indexing based on projection of vectors of documents (in means of least squares) on lower-dimensional vector space • Latent semantic indexing (LSI) • Concept indexing (CI)
Latent semantic indexing • Introduced in 1990; improved in 1995 • S. Deerwester, S. Dumas, G. Furnas, T. Landauer, R. Harsman: Indexing by latent semantic analysis, J. American Society for Information Science, 41, 1990, pp. 391-407 • M. W. Berry, S.T. Dumas, G.W. O’Brien: Using linear algebra for intelligent information retrieval, SIAM Review, 37, 1995, pp. 573-595 • Based on spectral analysis of term-document matrix
Latent semantic indexing • For every m×n matrix A there is singular value decomposition (SVD) U orthogonal m×m matrix whose columns are left singular vectors of A diagonal matrix on whose diagonal are singular values of matrix A in descending order V orthogonal n×n matrix whose columns are right singular vectors of A
Latent semantic indexing • For LSI truncated SVD is used where Ukis m×k matrix whose columns are first k left singular vectors of A k is k×k diagonal matrix whose diagonal is formed by k leading singular values of A Vkis n×k matrix whose columns are first k right singular vectors of A • Rows of Uk =terms • Rows of Vk= documents
Latent semantic indexing • Using the truncated LSI we include only first k independent linear components of A (singular vectors and values) • Documents are projected in means of least squares on space spread by first k singular vectors of A (LSI space) • First k components capture the major associational structure in in the term-document matrix and throw out the noise • Minor differences in terminology used in documents are ignored • Closeness of objects (queries and documents) is determined by overall pattern of term usage, so it is context based • Documents which contain synonyms are closer in LSI space than in original space; documents which contain polysemy in different context are more far in LSI space than in original space
Concept indexing (CI) • Indexing using concept decomposition (CD) instead of SVD like in LSI • Concept decomposition was introduced in 2001 I.S.Dhillon, D.S. Modha: Concept decomposition for large sparse text data using clustering, Machine Learning, 42:1, 2001, pp. 143-175
Concept decomposition • First step: clustering of documents in term-document matrix A on k groups • Clustering algorithms: • Spherical k-means algorithm • Fuzzy k-means algorithm • Spherical k-means algorithm is a variant of k-means algorithm which uses the fact that vectors of documents are of the unit norm • Centroids of groups = concept vectors • Concept matrix is matrix whose columns are centroids of groups cj – centroid of j-th group
Concept decomposition • Second step: calculating the concept decomposition • Concept decomposition Dk of term-document matrix A is least squares approximation of A on the space of concept vectors where Z is solution of the least squares problem • Rows of Ck = terms • Columns of Z = documents
Comparison: Academic example • Collection of 15 documents (Titles of books) • 9 from the field of data mining • 5 from the field of linear algebra • 1 combination of these fields (application of linear algebra for data mining) • List of terms was formed • By words contained in at least two documents • Words on stop list were ejected • Stemming was performed • On term-document matrix we apply • Truncated SVD (k=2) • Concept decomposition (k=2)
Queries • Q1: Datamining • Relevant documents : All data mining documents • Q2: Using linearalgebra for datamining • Relevant document: D6
Collections • MEDLINE • 1033 documents • 30 queries • Relevant judgements • CRANFIELD • 1400 documents • 225 queries • Relevant judgements
Test A • Comparison of errors of approximation term-document matrix by 1) k-rank SVD 2) k-rank CD
Test B • Average inner product between concept vectors cj, j=1,2,…,k • Comparison of average inner product for • Concept vectors obtained by spherical k-means algorithm • Concept vectors obtained by fuzzy k-means algorithm
Test C • Comparison of mean average precision of information retrieval and precision-recall plots • Mean average precision for term-matching method: • MEDLINE : 43,54 • CRANFIELD : 20,89
Test D • Correlation between mean average precision (MAP) and clustering quality • Measure of cluster quality – generalized within groups sum of square errors function Jfuzz • aj , j=1,2,…,n are vectors of documents, • ci , i=1,2,…,k are concept vectors • ij is the fuzzy membership degree of document aj in the group whose concept isci • b1,is weight exponent
MEDLINE - Correlation (clustering quality and MAP) • 46 observations for rank of approximation k[1,100] • Correlation between mean average precision and Jfuzzis r=-0,968198with significancep<<0,01 • Correlation between rank of approximation and mean average precision is r= 0,70247 ( p<<0,01) • Correlation between rank of approximation and Jfuzzis r= -0,831071 ( p<<0,01)
CRANFILD - Correlation (clustering quality and MAP) • 46 observations for rank of approximation k[1,100] • Correlation between mean average precision and Jfuzzis r=-0,988293 with significancep<<0,01 • Correlation between rank of approximation and mean average precision is r= 0,914489 ( p<<0,01) • Correlation between rank of approximation and Jfuzzis r= -0,904415 ( p<<0,01)
Conclusion 1/3 • By SVD approximation term-document matrix is projected on the first k left singular vectors, which for orthogonal base for LSI space • By CD approximation term-document matrix is projected on the k centroids of groups (concept vectors) • Concept vectors form the base for CI space; they tend to orthogonality as k raises • Concept vectors obtained by fuzzy k-means algorithm tend to orthogonality faster then those obtained by spherical k-means algorithm • CI using CD by fuzzy k-means algorithm gives higher MAP of information retrieval then LSI on both collections we have used
Conclusion 2/3 • CI using CD by spherical k-means algorithm gives lower (but comparable) MAP of information retrieval then LSI on both collections we have used • According the results of MAP k=75 for MEDLINE collection, and k=200 for CRANFIELD collection is good choice of rank of approximation • By LSI and CI documents are presented in smaller matrices: • For MEDLINE collection term-document matrix is stored in 5940×1033 matrix – approximations of documents are stored in 75×1033 matrix • For CRANFIELD collection term-document matrix is stored in 4758×1400 matrix - approximations of documents are stored in 200×1400 matrix
Conclusion 3/3 • LSI and CI work better on MEDLINE collection • When evaluated for different ranks of approximation MAP is more stable for LSI then for CI • There is high correlation between MAP and clustering quality
Further work • To apply CI on the problem of classification in supervised setting • To propose solutions of problem adding new documents in collection for CI method • Adding new documents in collection requires recomputation of SVD or CD • It is computationally inefficient • 2 approximation methods are developed for adding new document in collection for LSI method