270 likes | 453 Views
Homework. Define a loss function that compares two matrices (say mean square error) b = svd(bellcore ) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2] ) b3 = b$u[,1 :3] %*% diag(b$d[1 :3] ) %*% t(b$v[,1 :3]) More generally, for all possible r
E N D
Homework • Define a loss function that compares two matrices (say mean square error) • b = svd(bellcore) • b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2]) • b3 = b$u[,1:3] %*% diag(b$d[1:3]) %*% t(b$v[,1:3]) • More generally, for all possible r • Let b.r= b$u[,1:r] %*% diag(b$d[1:r]) %*% t(b$v[,1:r]) • Compute the loss between bellcore and b.r as a function of r • Plot the loss as a function of r
IR Models • Keywords (and Boolean combinations thereof) • Vector-Space ‘‘Model’’ (Salton, chap 10.1) • Represent the query and the documents as V- dimensional vectors • Sort vectors by • Probabilistic Retrieval Model • (Salton, chap 10.3) • Sort documents by
Information Retrieval and Web Search Alternative IR models Instructor: RadaMihalcea Some of the slides were adopted from a course tought at Cornell University by William Y. Arms
Latent Semantic Indexing Objective Replace indexes that use sets of index terms by indexes that use concepts. Approach Map the term vector space into a lower dimensional space, using singular value decomposition. Each dimension in the new space corresponds to a latent concept in the original data.
Deficiencies with Conventional Automatic Indexing Synonymy: Various words and phrases refer to the same concept (lowers recall). Polysemy: Individual words have more than one meaning (lowers precision) Independence: No significance is given to two terms that frequently appear together Latent semantic indexing addresses the first of these (synonymy), and the third (dependence)
Bellcore’s Examplehttp://en.wikipedia.org/wiki/Latent_semantic_analysis c1 Human machine interface for Lab ABC computer applications c2 A survey of user opinion of computer system response time c3 The EPS user interface management system c4 System and humansystem engineering testing of EPS c5 Relation of user-perceived responsetime to error measurement m1 The generation of random, binary, unordered trees m2 The intersection graph of paths in trees m3 Graph minors IV: Widths of trees and well-quasi-ordering m4 Graph minors: A survey
"bellcore"<- structure(.Data = c(1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1), .Dim = c( 12, 9), .Dimnames = list(c("human", "interface", "computer", "user", "system", "response", "time", "EPS", "survey", "trees", "graph", "minors"), c("c1", "c2", "c3", "c4", "c5", "m1", "m2", "m3", "m4"))) help(dump) help(source)
Query Expansion Query: Find documents relevant tohuman computer interaction Simple Term Matching: Matches c1, c2, and c4 Misses c3 and c5
How to compute correlations round(100 * cor(bellcore)) c1 c2 c3 c4 c5 m1 m2 m3 m4 c1 100 -19 0 0 -33 -17 -26 -33 -33 c2 -19 100 0 0 58 -30 -45 -58 -19 c3 0 0 100 47 0 -21 -32 -41 -41 c4 0 0 47 100 -31 -16 -24 -31 -31 c5 -33 58 0 -31 100 -17 -26 -33 -33 m1 -17 -30 -21 -16 -17 100 67 52 -17 m2 -26 -45 -32 -24 -26 67 100 77 26 m3 -33 -58 -41 -31 -33 52 77 100 56 m4 -33 -19 -41 -31 -33 -17 26 56 100 round(100 * cor(t(bellcore))) human interface computer user system response time EPS survey trees graph minors human 100 36 36 -38 43 -29 -29 36 -29 -38 -38 -29 interface 36 100 36 19 4 -29 -29 36 -29 -38 -38 -29 computer 36 36 100 19 4 36 36 -29 36 -38 -38 -29 user -38 19 19 100 23 76 76 19 19 -50 -50 -38 system 43 4 4 23 100 4 4 82 4 -46 -46 -35 response -29 -29 36 76 4 100 100 -29 36 -38 -38 -29 time -29 -29 36 76 4 100 100 -29 36 -38 -38 -29 EPS 36 36 -29 19 82 -29 -29 100 -29 -38 -38 -29 survey -29 -29 36 19 4 36 36 -29 100 -38 19 36 trees -38 -38 -38 -50 -46 -38 -38 -38 -38 100 50 19 graph -38 -38 -38 -50 -46 -38 -38 -38 19 50 100 76 minors -29 -29 -29 -38 -35 -29 -29 -29 36 19 76 100
Singular Value Decomposition (SVD)X = UDVT txd t x m m x m m x d D VT X = U • m is the rank of X< min(t, d) • D is diagonal • D2 are eigenvalues (sorted in descending order) • U UT = I and V VT = I • Columns of U are eigenvectors of X XT • Columns of V are eigenvectors of XT X
m is the rank of X< min(t, d) • D is diagonal • D2are eigenvalues (sorted in descending order) • U UT = I and V VT = I • Columns of U are eigenvectors of X XT • Columns of V are eigenvectors of XT X
Dimensionality Reduction t x d t x k k x k k x d D VT ^ = X U k is the number of latent concepts (typically 300 ~ 500)
Dimension Reduction in R b= svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2]) dimnames(b2) = dimnames(bellcore) par(mfrow=c(2,2)) plot(hclust(as.dist(-cor(bellcore)))) plot(hclust(as.dist(-cor(t(bellcore))))) plot(hclust(as.dist(-cor(b2)))) plot(hclust(as.dist(-cor(t(b2)))))
SVDB BT = U D2 UTBT B = V D2 VT Doc Term Latent
Dimension Reduction Block Structure round(100*cor(bellcore)) c1 c2 c3 c4 c5 m1 m2 m3 m4 c1 100 -19 0 0 -33 -17 -26 -33 -33 c2 -19 100 0 0 58 -30 -45 -58 -19 c3 0 0 100 47 0 -21 -32 -41 -41 c4 0 0 47 100 -31 -16 -24 -31 -31 c5 -33 58 0 -31 100 -17 -26 -33 -33 m1 -17 -30 -21 -16 -17 100 67 52 -17 m2 -26 -45 -32 -24 -26 67 100 77 26 m3 -33 -58 -41 -31 -33 52 77 100 56 m4 -33 -19 -41 -31 -33 -17 26 56 100 > round(100*cor(b2)) c1 c2 c3 c4 c5 m1 m2 m3 m4 c1 100 91 100 100 84 -86 -85 -85 -81 c2 91 100 91 88 99 -57 -56 -56 -50 c3 100 91 100 100 84 -86 -85 -85 -81 c4 100 88 100 100 81 -89 -88 -88 -84 c5 84 99 84 81 100 -44 -44 -43 -37 m1 -86 -57 -86 -89 -44 100 100 100 100 m2 -85 -56 -85 -88 -44 100 100 100 100 m3 -85 -56 -85 -88 -43 100 100 100 100 m4 -81 -50 -81 -84 -37 100 100 100 100
Dimension Reduction Block Structure round(100*cor(t(bellcore))) human interface computer user system response time EPS survey trees graph minors human 100 36 36 -38 43 -29 -29 36 -29 -38 -38 -29 interface 36 100 36 19 4 -29 -29 36 -29 -38 -38 -29 computer 36 36 100 19 4 36 36 -29 36 -38 -38 -29 user -38 19 19 100 23 76 76 19 19 -50 -50 -38 system 43 4 4 23 100 4 4 82 4 -46 -46 -35 response -29 -29 36 76 4 100 100 -29 36 -38 -38 -29 time -29 -29 36 76 4 100 100 -29 36 -38 -38 -29 EPS 36 36 -29 19 82 -29 -29 100 -29 -38 -38 -29 survey -29 -29 36 19 4 36 36 -29 100 -38 19 36 trees -38 -38 -38 -50 -46 -38 -38 -38 -38 100 50 19 graph -38 -38 -38 -50 -46 -38 -38 -38 19 50 100 76 minors -29 -29 -29 -38 -35 -29 -29 -29 36 19 76 100 > round(100*cor(t(b2))) human interface computer user system response time EPS survey trees graph minors human 100 100 93 94 99 82 82 100 -12 -85 -84 -83 interface 100 100 95 96 100 85 85 100 -7 -82 -80 -80 computer 93 95 100 100 96 98 98 93 26 -59 -57 -56 user 94 96 100 100 97 97 97 94 23 -62 -60 -59 system 99 100 96 97 100 88 88 100 -2 -79 -78 -77 response 82 85 98 97 88 100 100 83 46 -40 -38 -37 time 82 85 98 97 88 100 100 83 46 -40 -38 -37 EPS 100 100 93 94 100 83 83 100 -11 -84 -83 -82 survey -12 -7 26 23 -2 46 46 -11 100 63 65 66 trees -85 -82 -59 -62 -79 -40 -40 -84 63 100 100 100 graph -84 -80 -57 -60 -78 -38 -38 -83 65 100 100 100 minors -83 -80 -56 -59 -77 -37 -37 -82 66 100 100 100
The term vector space t3 The space has as many dimensions as there are terms in the word list. d1 d2 t2 t1
Latent concept vector space • term document query --- cosine > 0.9