480 likes | 590 Views
Web Search and Data Mining. Lecture 4 Adapted from Manning, Raghavan and Schuetze. Recap of the last lecture. MapReduce and distributed indexing Scoring documents: linear comb/zone weighting tf idf term weighting and vector spaces Derivation of idf. This lecture. Vector space models
E N D
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze
Recap of the last lecture • MapReduce and distributed indexing • Scoring documents: linear comb/zone weighting • tfidf term weighting and vector spaces • Derivation of idf
This lecture • Vector space models • Dimension reduction: random projection • Review of linear algebra • Latent semantic indexing (LSI)
Documents as vectors • At the end of Lecture 3 we said: • Each doc d can now be viewed as a vector of wfidf values, one component for each term • So we have a vector space • terms are axes • docs live in this space • Dimension is usually very large
Why turn docs into vectors? • First application: Query-by-example • Given a doc d, find others “like” it. • Now that d is a vector, find vectors (docs) “near” it. • Natural setting for bag of words model • Dimension reduction
Intuition t3 d2 d3 d1 θ φ t1 d5 t2 d4 Postulate: Documents that are “close together” in the vector space talk about the same things.
Measuring Document Similarity • Idea: Distance between d1 and d2 is the length of the vector |d1 – d2|. • Euclidean distance • Why is this not a great idea? • We still haven’t dealt with the issue of length normalization • Short documents would be more similar to each other by virtue of length, not topic • However, we can implicitly normalize by looking at angles instead
t 3 d 2 d 1 θ t 1 t 2 Cosine similarity • Distance between vectors d1 and d2captured by the cosine of the angle x between them. • Note – this is similarity, not distance • No triangle inequality for similarity.
Cosine similarity • A vector can be normalized (given a length of 1) by dividing each of its components by its length – here we use the L2 norm • This maps vectors onto the unit sphere: • Then, • Longer documents don’t get more weight
Cosine similarity • Cosine of angle between two vectors • The denominator involves the lengths of the vectors. Normalization
Queries in the vector space model Central idea: the query as a vector: • We regard the query as short document • We return the documents ranked by the closeness of their vectors to the query, also represented as a vector. • Note that dq is very sparse!
Dimensionality reduction • What if we could take our vectors and “pack” them into fewer dimensions (say 50,000100) while preserving distances? • (Well, almost.) • Speeds up cosine computations. • Many possibilities including, • Random projection. • “Latent semantic indexing”.
Random projection onto k<<m axes • Choose a random direction x1 in the vector space. • For i = 2 to k, • Choose a random direction xi that is orthogonal to x1, x2, … xi–1. • Project each document vector into the subspace spanned by {x1, x2, …, xk}.
E.g., from 3 to 2 dimensions t 3 d2 x2 x2 d2 d1 d1 t 1 x1 x1 t 2 x1 is a random direction in (t1,t2,t3) space. x2is chosen randomly but orthogonal to x1. Dot product of x1 and x2 is zero.
Guarantee • With high probability, relative distances are (approximately) preserved by projection.
Computing the random projection • Projecting n vectors from m dimensions down to k dimensions: • Start with mn matrix of terms docs, A. • Find random km orthogonal projection matrix R. • Compute matrix product W = RA. • jth column of W is the vector corresponding to doc j, but now in k << m dimensions.
Cost of computation Why? • This takes a total of kmn multiplications. • Expensive – see Resources for ways to do essentially the same thing, quicker. • Other variations, using sparse random matrix, entries of R from {-1, 0, 1} with probabilities {1/6, 2/3, 1/6}.
Latent semantic indexing (LSI) • Another technique for dimension reduction • Random projection was data-independent • LSI on the other hand is data-dependent • Eliminate redundant axes • Pull together “related” axes – hopefully • car and automobile
Example only has a non-zero solution if this is a m-th order equation in λwhich can have at most m distinct solutions(roots of the characteristic polynomial) – can be complex even though S is real. Eigenvalues & Eigenvectors • Eigenvectors (for a square mm matrix S) • How many eigenvalues are there at most? (right) eigenvector eigenvalue
For symmetric matrices, eigenvectors for distinct eigenvalues are orthogonal All eigenvalues of a real symmetric matrix are real. All eigenvalues of a positive semidefinite matrix are non-negative Eigenvalues & Eigenvectors
Example • Let • Then • The eigenvalues are 1 and 3 (nonnegative, real). • The eigenvectors are orthogonal (and real): Real, symmetric. Plug in these values and solve for eigenvectors.
diagonal Eigen/diagonal Decomposition • Let be a squarematrix with mlinearly independent eigenvectors (a “non-defective” matrix) • Theorem: Exists an eigen decomposition • (cf. matrix diagonalization theorem) • Columns of U are eigenvectors of S • Diagonal elements of are eigenvalues of Unique for distinct eigen-values
Let U have the eigenvectors as columns: Then, SU can be written Diagonal decomposition: why/how Thus SU=U, or U–1SU= And S=UU–1.
Diagonal decomposition - example Recall The eigenvectors and form Recall UU–1 =1. Inverting, we have Then, S=UU–1 =
Example continued Let’s divide U (and multiply U–1)by Then, S= Q (Q-1= QT ) Why? Stay tuned …
Symmetric Eigen Decomposition • If is a symmetric matrix: • Theorem: Exists a (unique) eigen decomposition • where Q is orthogonal: • Q-1= QT • Columns of Q are normalized eigenvectors • Columns are orthogonal. • (everything is real)
Time out! • I came to this class to learn about text retrieval and mining, not have my linear algebra past dredged up again … • But if you want to dredge, Strang’s Applied Mathematics is a good place to start. • What do these matrices have to do with text? • Recall m n term-document matrices … • But everything so far needs square matrices – so …
mm mn V is nn Eigenvalues 1 … r of AAT are the eigenvalues of ATA. Singular values. Singular Value Decomposition For an m n matrix Aof rank rthere exists a factorization (Singular Value Decomposition = SVD) as follows: The columns of U are orthogonal eigenvectors of AAT. The columns of V are orthogonal eigenvectors of ATA.
Singular Value Decomposition • Illustration of SVD dimensions and sparseness
Thus m=3, n=2. Its SVD is SVD example Let Typically, the singular values arranged in decreasing order.
Frobenius norm Low-rank Approximation • SVD can be used to compute optimal low-rank approximations. • Approximation problem: Find Akof rank k such that Ak and X are both mn matrices. Typically, want k << r.
k column notation: sum of rank 1 matrices Low-rank Approximation • Solution via SVD set smallest r-k singular values to zero
Approximation error • How good (bad) is this approximation? • It’s the best possible, measured by the Frobenius norm of the error: where the i are ordered such that i i+1. Suggests why Frobenius error drops as k increased.
SVD Low-rank approximation • Whereas the term-doc matrix A may have m=50000, n=10 million (and rank close to 50000) • We can construct an approximation A100 with rank 100. • Of all rank 100 matrices, it would have the lowest Frobenius error. • Great … but why would we?? • Answer: Latent Semantic Indexing C. Eckart, G. Young, The approximation of a matrix by another of lower rank. Psychometrika, 1, 211-218, 1936.
What it is • From term-doc matrix A, we compute the approximation Ak. • There is a row for each term and a column for each doc in Ak • Thus docs live in a space of k<<r dimensions • These dimensions are not the original axes • But why?
Vector Space Model: Pros • Automatic selection of index terms • Partial matching of queries and documents (dealing with the case where no document contains all search terms) • Ranking according to similarity score(dealing with large result sets) • Term weighting schemes (improves retrieval performance) • Various extensions • Document clustering • Relevance feedback (modifying query vector) • Geometric foundation
Problems with Lexical Semantics • Ambiguity and association in natural language • Polysemy: Words often have a multitude of meanings and different types of usage (more severe in very heterogeneous collections). • The vector space model is unable to discriminate between different meanings of the same word.
Problems with Lexical Semantics • Synonymy: Different terms may have identical or a similar meaning (weaker: words indicating the same topic). • No associations between words are made in the vector space representation.
Latent Semantic Indexing (LSI) • Perform a low-rank approximation of document-term matrix (typical rank 100-300) • General idea • Map documents (and terms) to a low-dimensional representation. • Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space). • Compute document similarity based on the inner product in this latent semantic space
Goals of LSI • Similar terms map to similar location in low dimensional space • Noise reduction by dimension reduction
Latent Semantic Analysis • Latent semantic space: illustrating example courtesy of Susan Dumais
Performing the maps • Each row and column of A gets mapped into the k-dimensional LSI space, by the SVD. • Claim – this is not only the mapping with the best (Frobenius error) approximation to A, but in fact improves retrieval. • A query q is also mapped into this space, by • Query NOT a sparse vector.
Empirical evidence • Experiments on TREC 1/2/3 – Dumais • Lanczos SVD code (available on netlib) due to Berry used in these expts • Running times of ~ one day on tens of thousands of docs (old data) • Dimensions – various values 250-350 reported • (Under 200 reported unsatisfactory) • Generally expect recall to improve – what about precision?
Empirical evidence • Precision at or above median TREC precision • Top scorer on almost 20% of TREC topics • Slightly better on average than straight vector spaces • Effect of dimensionality:
Some wild extrapolation • The “dimensionality” of a corpus is the number of distinct topics represented in it. • More mathematical wild extrapolation: • if A has a rank k approximation of low Frobenius error, then there are no more than k distinct topics in the corpus. (Latent semantic indexing: A probabilistic analysis,'' )
LSI has many other applications • In many settings in pattern recognition and retrieval, we have a feature-object matrix. • For text, the terms are features and the docs are objects. • Could be opinions and users … • This matrix may be redundant in dimensionality. • Can work with low-rank approximation. • If entries are missing (e.g., users’ opinions), can recover if dimensionality is low. • Powerful general analytical technique • Close, principled analog to clustering methods.