230 likes | 249 Views
This presentation explores the mathematics behind information retrieval, including the vector space model, query matching, and rank approximation using Singular Value Decomposition (SVD).
E N D
The Mathematics of Information Retrieval 11/21/2005 Presented by Jeremy Chapman, Grant Gelven and Ben Lakin
Acknowledgments • This presentation is based on the following paper: “Matrices, Vector Spaces, and Information Retrieval.” by Michael W. Barry, Zlatko Drmat, and Elizabeth R.Jessup.
Indexing of Scientific Works • Indexing primarily done by using the title, author list, abstract, key word list, and subject classification • These are created in large part to allow them to be found in a search of scientific documents • The use of automated information retrieval (IR) has improved consistency and speed
Vector Space Model for IR • The basic mechanism for this model is the encoding of a document as a vector • All documents’ vectors are stored in a single matrix • Latent Semantic Indexing (LSI) replaces the original matrix by a matrix of a smaller rank while maintaining similar information by use of Rank Reduction
Creating the Database Matrix • Each document is defined in a column of the matrix (d is the number of documents) • Each term is defined as a row (t is the number of terms) • This gives us a t x d matrix • The document vectors span the content
Simple Example The following are the d=5 documents D1: How to Bake Bread Without Recipes D2: The Classical Art of Viennese Pastry D3: Numerical Recipes: The Art of Scientific Computing D4: Breads, Pastries, Pies, and Cakes: Quantity Baking Recipes D5:Pastry: A Book of Best French Recipes • Let the six terms as follows: T1: bak(e, ing) T2: recipes T3: bread T4: cake T5: pastr(y, ies) T6: pie Thus the document matrix becomes: A =
The matrix A after Normalization Thus after the normalization of the columns of A we get the following:
Making a Query • Next we will use the document matrix to ease our search for related documents. • Referring to our example we will make the following query: Baking Bread • We will now format a query using our terms definitions given before: q= (1 0 1 0 0 0)T
Matching the Document to the Query • Matching the documents to a given query is typically done by using the cosine of the angle between the query and document vectors • The cosine is given as follows:
A Query • By using the cosine formula we would get: • We will set our lower limit on our cosine at .5. • Thus by conducting a query “baking bread” we get the following two articles: D1: How to Bake Bread Without Recipes D4: Breads, Pastries, Pies, and Cakes: Quantity Baking Recipes
Singular Value Decomposition • The Singular Value Decomposition (SVD) is used to reduce the rank of the matrix, while also giving a good approximation of the information stored in it • The decomposition is written in the following manner: Where U spans the column space of A, is the matrix with singular values of A along the main diagonal, and V spans the row space of A. U and V are also orthogonal.
SVD continued • Unlike the QR Factorization, SVD provides us with a lower rank representation of the column and row spaces • We know Ak is the best rank-k approximation to A by Eckert and Young’s Theorem that states: • Thus the rank-k approximation of A is given as follows: Ak= Uk kVkT • Where Uk=the first k columns of U k=a k x k matrix whose diagonal is a set of decreasing values, call them: VkT=is the k x d matrix whose rows are the first k rows of V
Interpretation • From the matrix given on the slide before we notice that if we take the rank-4 matrix has only four non-zero singular values • Also the two non-zero columns in tell us that the first four columns of U give us the basis for the column space of A
Analysis of the Rank-k Approximations • Using the following formula we can calculate the relative error from the original matrix to its rank-k approximation: ||A-Ak||F= Thus only a 19% relative error is needed to change from a rank-4 to a rank-3 matrix, however a 42% relative error is necessary to move to a rank-2 approximation from a rank-4 approximation • As expected these values are less than the rank-k approximations for the QR factorization
Using the SVD for Query Matching • Using the following formula we can calculate the cosine of the angles between the query and the columns of our rank-k approximation of A. • Using the rank-3 approximation we return the first and fourth books again using the cutoff of .5
Term-Term Comparison • It is possible to modify the vector space model for comparing queries with documents in order to compare terms with terms. • When this is added to a search engine it can act as a tool to refine the result • First we run our search as before and retrieve a certain number of documents in the following example we will have five documents retrieved. • We will then create another document matrix with the remaining information, call it G.
Another Example Terms Documents • T1:Run(ning) • T2:Bike • T3:Endurance • T4:Training • T5:Band • T6:Music • T7:Fishes D1:Complete Triathlon Endurance Training Manual:Swim, Bike, Run D2:Lake, River, and Sea-Run Fishes of Canada D3:Middle Distance Running, Training and Competition D4:Music Law: How to Run your Band’s Business D5:Running: Learning, Training Competing
Analysis of the Term-Term Comparison • For this we use the following formula:
Clustering • Clustering is the process by which terms are grouped if they are related such as bike, endurance and training • First the terms are split into groups which are related • The terms in each group are placed such that their vectors are almost parallel
Clusters • In this example the first cluster is running • The second cluster is bike, endurance and training • The third is band and music • And the fourth is fishes
Analyzing the term-term Comparison • We will again use the SVD rank-k approximation • Thus the cosine of the angles becomes:
Conclusion • Through the use of this model many libraries and smaller collections can index their documents • However, as the next presentation will show a different approach is used in large collections such as the internet