Information Retrieval in Text Part III

Information Retrieval in TextPart III • Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM 1999. • Reading Assignment: Chapter 4.

Outline • Matrix Decompositions • QR Factorization • Singular Value Decomposition • Updating Techniques

Matrix Decomposition • To produce a reduced-rank approximation of the mn term by document matrix A, one must identify the dependence between columns or rows of the matrix A. • For a rank-k matrix, the k basis vectors of its column space serve in place of its n column vectors to represent its column space.

QR Factorization • The QR factorization of matrix A is defined as where Q is an m  m orthogonal matrix • A square matrix is orthogonal if its columns are orthonormal. i.e., if qj denotes a column of the orthogonal matrix Q, then qj has unit Euclidean norm (|| qj ||2 = 1) for j = 1,2, …, m and it is orthogonal to all other columns of Q ((qjTqi)1/2 = 0 for all i ≠ j). • The rows of Q are also orthonormal, i.e. QTQ = QQT = I. • Such factorization exists for any matrix A. • There are many ways to do the factorization.

QR Factorization • Given A = QR, the columns of the matrix A are all linear combinations of the columns of Q. • Thus, a subset of k of the columns of Q form a basis for the column space of A, where k = rank(A)

QR Factorization: Example

QR Factorization: Example • QR Factorization of the previous example can be represented as • Note that the first 7 columns of Q, Q1, are orthonormal • And hence constitute a basis for the column space of A. • The bottom zero submatrix of R is not always guaranteed to be generated automatically from the QR factorization, and hence may need to apply column pivoting in order to guarantee the zero submatrix. • Q2 does not contribute to producing any nonzero value in A

QR Factorization • One motivation for using QR factorization is that the basis vectors can be used to describe the semantic content of the corresponding text collection. • The cosines of the anglesj between a query vector q and document vectors aj • Note that for the query “Child Proofing” it gives exactly the same cosines. Why?

Frobenius Matrix Norm • Definition: The Frobenius matrix norm of an mn matrix B = [bij], ||.||F is defined by

Low Rank Approximation for QR Factorization • Initially, the rank of A is not known. However, after performing the QR factorization, its rank is obviously the rank of _______ • With column pivoting, we know that there exists a permutation matrix P such that AP = QR where the larger entries of R are moved to the upper left corner. Such arrangement, if possible, partitions R where the smallest entries are isolated in the bottom submatrix.

Low Rank Approximation for QR Factorization

Low Rank Approximation for QR Factorization • Computing • Redefining R22 to be the 42 zero matrix, the modified upper triangular matrix R has rank 5 rather than 7. • Hence, the matrix has rank ____ • Show that ||E||F = ||R22||F. • Show that ||E||F/ ||A||F = || R22 ||F / ||R||F = 0.3237 • Therefore, the relative change in R, 32.37%, yields the same relative change in A. • With r=4, the relative change is 76%.

Low Rank Approximation for QR Factorization: Example

Comparing Cosine Similarities for the Query: “Child Proofing”

Comparing Cosine Similarities for the Query: “Child Home Safety”

Singular Value Decomposition • While QR factorization provides a reduced rank basis for the column space, no information is provided about the row space of A. • SVD can provide • reduced rank approximation for both spaces • rank-k approximation to A of minimal change for any value of k.

Singular Value Decomposition • A = UVT where U: mm orthogonal matrix whose columns define the left singular vectors of A V: nn orthogonal matrix whose columns define the right singular vectors of A : mn diagonal matrix containing singular values 1 2  …  min{m,n} • Such factorization exists for any matrix A.

Component Matrices of the SVD

SVD vs. QR • What is the relationship between the rank of A and the ranks of the matrices in both factorizations? • In QR, the first rA columns of Q form a basis for the column space, so do the first rA columns of U. The first rA rows of VT form a basis for the row space of A. • The low rank-k approximation in SVD can be done by setting all but the k largest singular values in  to zero.

SVD • Theorem: The low rank-k approximation of SVD is the closest rank-k approximation to A • Proven by Eckart and Young • It showed that the error in approximating A by Ak is given by where Ak = UkkVkT • Hence, the error in approximating the original matrix is determined by singular values (k+1,k+2,…,rank(A))

SVD: Example

SVD: Example • ||A – A6||F = …… • Hence, the relative change in the matrix A is … • Therefore, rank-5 approximation may be appropriate in our case. • Determining the best rank approximation for any database depends on empirical testing • For very large databases, the number could be between 100 and 300. • Computational feasibility, rather than accuracy, determines the rank reduction

Low Rank Approximations • Visual comparison of rank-reduced approximations to A can be misleading • Check rank-4 QR approximation vs. the more accurate rank-4 SVD approximation. • Rank-4 SVD approximation shows associations made with terms, not originally in the document title • e.g. Term 4 (Health) and Term 8 (Safety) in Document 1 (Infant & Toddler First Aid).

Query Matching • Given the query vector q, to be compared with the columns of the reduced-rank matrix Ak. • Let ej denotes the jth canonical vector in In. Then, Akej represents _______________ • It is easy to show that where

Query Matching • An alternate formula for the cosine computation is • Note that which means that the number of retrieved documents using this query matching technique is larger.

Information Retrieval in Text Part III