Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI)

Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics, Varaždin

Outline • Information retrieval in vector space model (VSM) or bag of words representation • Techniques for conceptual indexing • Latent semantic indexing • Concept indexing • Comparison: Academic example • Experiment • Further work

Information retrieval in VSM 1/3 • Task of information retrieval: to extract documents that are relevant for user query in document collection • In VSM documents are presented in high dimensional space • Dimension of space depends on number of indexing terms which are chosen to be relevant for the collection (4000-5000 in my experiments) • VSM is implemented by forming term-document matrix

Term-document matrix is mn matrix wheremis number of terms and n is number of documents row of term-document matrix = term column of term-document matrix = document Figure 1. Term-document matrix Information retrieval in VSM 2/3

Information retrieval in VSM 3/3 • query has the same shape as document (m dimensional vector) • measure of similarity between query q and a document ajis acosine of angle between those two vectors

Measures for evaluation: Recall Precision Average precision Recall Precision riis number of relevant documents among i highest ranked documents rnis total number of relevant documents in collection Average precision – average precision for distinct levels of recall Retrieval performance evaluation

Techniquesfor conceptual indexing • In term-matching method similarity between query and the document is tested lexically • Polysemy (words having multiple meaning) and synonymy (multiple words having the same meaning) are two fundamental problems in efficient information retrieval • Here we compare two techniques for conceptual indexing based on projection of vectors of documents (in means of least squares) on lower-dimensional vector space • Latent semantic indexing (LSI) • Concept indexing (CI)

Latent semantic indexing • Introduced in 1990; improved in 1995 • S. Deerwester, S. Dumas, G. Furnas, T. Landauer, R. Harsman: Indexing by latent semantic analysis, J. American Society for Information Science, 41, 1990, pp. 391-407 • M. W. Berry, S.T. Dumas, G.W. O’Brien: Using linear algebra for intelligent information retrieval, SIAM Review, 37, 1995, pp. 573-595 • Based on spectral analysis of term-document matrix

Latent semantic indexing • For every m×n matrix A there is singular value decomposition (SVD) U orthogonal m×m matrix whose columns are left singular vectors of A  diagonal matrix on whose diagonal are singular values of matrix A in descending order V orthogonal n×n matrix whose columns are right singular vectors of A

Latent semantic indexing • For LSI truncated SVD is used where Ukis m×k matrix whose columns are first k left singular vectors of A k is k×k diagonal matrix whose diagonal is formed by k leading singular values of A Vkis n×k matrix whose columns are first k right singular vectors of A • Rows of Uk =terms • Rows of Vk= documents

(Truncated) SVD

Latent semantic indexing • Using the truncated LSI we include only first k independent linear components of A (singular vectors and values) • Documents are projected in means of least squares on space spread by first k singular vectors of A (LSI space) • First k components capture the major associational structure in in the term-document matrix and throw out the noise • Minor differences in terminology used in documents are ignored • Closeness of objects (queries and documents) is determined by overall pattern of term usage, so it is context based • Documents which contain synonyms are closer in LSI space than in original space; documents which contain polysemy in different context are more far in LSI space than in original space

Concept indexing (CI) • Indexing using concept decomposition (CD) instead of SVD like in LSI • Concept decomposition was introduced in 2001 I.S.Dhillon, D.S. Modha: Concept decomposition for large sparse text data using clustering, Machine Learning, 42:1, 2001, pp. 143-175

Concept decomposition • First step: clustering of documents in term-document matrix A on k groups • Clustering algorithms: • Spherical k-means algorithm • Fuzzy k-means algorithm • Spherical k-means algorithm is a variant of k-means algorithm which uses the fact that vectors of documents are of the unit norm • Centroids of groups = concept vectors • Concept matrix is matrix whose columns are centroids of groups cj – centroid of j-th group

Concept decomposition • Second step: calculating the concept decomposition • Concept decomposition Dk of term-document matrix A is least squares approximation of A on the space of concept vectors where Z is solution of the least squares problem • Rows of Ck = terms • Columns of Z = documents

Comparison: Academic example • Collection of 15 documents (Titles of books) • 9 from the field of data mining • 5 from the field of linear algebra • 1 combination of these fields (application of linear algebra for data mining) • List of terms was formed • By words contained in at least two documents • Words on stop list were ejected • Stemming was performed • On term-document matrix we apply • Truncated SVD (k=2) • Concept decomposition (k=2)

Documents 1/2

Documents 2/2

Terms

Projection of terms by SVD

Projection of terms by CD

Queries • Q1: Datamining • Relevant documents : All data mining documents • Q2: Using linearalgebra for datamining • Relevant document: D6

Projection of documents by SVD

Projection of documents by CD

Results of information retrieval (Q1)

Results of information retrieval (Q2)

Collections • MEDLINE • 1033 documents • 30 queries • Relevant judgements • CRANFIELD • 1400 documents • 225 queries • Relevant judgements

Test A • Comparison of errors of approximation term-document matrix by 1) k-rank SVD 2) k-rank CD

MEDLINE - errors of approximation

CRANFIELD - errors of approximation

Test B • Average inner product between concept vectors cj, j=1,2,…,k • Comparison of average inner product for • Concept vectors obtained by spherical k-means algorithm • Concept vectors obtained by fuzzy k-means algorithm

MEDLINE – average inner product

CRANFIELD – average inner product

Test C • Comparison of mean average precision of information retrieval and precision-recall plots • Mean average precision for term-matching method: • MEDLINE : 43,54 • CRANFIELD : 20,89

MEDLINE – mean average precision

CRANFIELD – mean average precision

MEDLINE – precision-recall plot

CRANFIELD – precision-recall plot

Test D • Correlation between mean average precision (MAP) and clustering quality • Measure of cluster quality – generalized within groups sum of square errors function Jfuzz • aj , j=1,2,…,n are vectors of documents, • ci , i=1,2,…,k are concept vectors • ij is the fuzzy membership degree of document aj in the group whose concept isci • b1,is weight exponent

MEDLINE - Correlation (clustering quality and MAP) • 46 observations for rank of approximation k[1,100] • Correlation between mean average precision and Jfuzzis r=-0,968198with significancep<<0,01 • Correlation between rank of approximation and mean average precision is r= 0,70247 ( p<<0,01) • Correlation between rank of approximation and Jfuzzis r= -0,831071 ( p<<0,01)

CRANFILD - Correlation (clustering quality and MAP) • 46 observations for rank of approximation k[1,100] • Correlation between mean average precision and Jfuzzis r=-0,988293 with significancep<<0,01 • Correlation between rank of approximation and mean average precision is r= 0,914489 ( p<<0,01) • Correlation between rank of approximation and Jfuzzis r= -0,904415 ( p<<0,01)

Regression line: clustering quality and MAP (MEDLINE)

Regression line: clustering quality and MAP (CRANFIELD)

Conclusion 1/3 • By SVD approximation term-document matrix is projected on the first k left singular vectors, which for orthogonal base for LSI space • By CD approximation term-document matrix is projected on the k centroids of groups (concept vectors) • Concept vectors form the base for CI space; they tend to orthogonality as k raises • Concept vectors obtained by fuzzy k-means algorithm tend to orthogonality faster then those obtained by spherical k-means algorithm • CI using CD by fuzzy k-means algorithm gives higher MAP of information retrieval then LSI on both collections we have used

Conclusion 2/3 • CI using CD by spherical k-means algorithm gives lower (but comparable) MAP of information retrieval then LSI on both collections we have used • According the results of MAP k=75 for MEDLINE collection, and k=200 for CRANFIELD collection is good choice of rank of approximation • By LSI and CI documents are presented in smaller matrices: • For MEDLINE collection term-document matrix is stored in 5940×1033 matrix – approximations of documents are stored in 75×1033 matrix • For CRANFIELD collection term-document matrix is stored in 4758×1400 matrix - approximations of documents are stored in 200×1400 matrix

Conclusion 3/3 • LSI and CI work better on MEDLINE collection • When evaluated for different ranks of approximation MAP is more stable for LSI then for CI • There is high correlation between MAP and clustering quality

Further work • To apply CI on the problem of classification in supervised setting • To propose solutions of problem adding new documents in collection for CI method • Adding new documents in collection requires recomputation of SVD or CD • It is computationally inefficient • 2 approximation methods are developed for adding new document in collection for LSI method

Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI)

Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI)

Presentation Transcript

Introduction to Information Retrieval and Web-based Searching Methods

Artificial Intelligence Approaches for Information Retrieval

Knowledge Discovery over the Deep Web, Semantic Web and XML

LATENT IMAGE FORMATION

Image Retrieval

Natural Language Processing for Information Retrieval

Introducing Information Retrieval and Web Search

THE INDEXING MECHANISM OF GEAR HOBBING MACHINE

Information Retrieval

Information Retrieval

Speech and Music Retrieval

Usability of Grouping of Retrieval Results

Aristotle University of Thessaloniki Department of Informatics

INDEXES AND INDEXING

Information Retrieval

CS 245: Database System Principles Notes 4: Indexing

Biomedical text mining

CS232A: Database System Principles Indexing

DBMS Storage and Indexing

Semantic Web Services