230 likes | 526 Views
Why Spectral Retrieval Works. SIGIR 2005 in Salvador, Brazil, August 15 – 19. Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Debapriyo Majumdar. What we mean by spectral retrieval. Ranked retrieval in the term space. . 1.00. 1.00. 0.00.
E N D
Why Spectral Retrieval Works SIGIR 2005 in Salvador, Brazil, August 15 – 19 Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Debapriyo Majumdar
What we mean by spectral retrieval • Ranked retrieval in the term space 1.00 1.00 0.00 0.50 0.00 "true" similarities to query qTd2 ———|q||d2| qTd1 ———|q||d1| cosine similarities 0.82 0.00 0.00 0.38 0.00
What we mean by spectral retrieval • Ranked retrieval in the term space 1.00 1.00 0.00 0.50 0.00 "true" similarities to query cosine similarities 0.82 0.00 0.00 0.38 0.00 • Spectral retrieval = linear projection to an eigensubspace L q projection matrix L cosine similarities in the subspace (Lq)T(Ld1)——————|Lq| |Ld1| 0.98 0.98 -0.25 0.73 0.01 …
Why and when does this work? • Previous work: if the term-document matrix is a slight perturbation of a rank-k matrix then projection to ak-dimensional subspace works • Papadimitriou, Tamaki, Raghavan, Vempala PODS'98 • Ding SIGIR'99 • Ando and Lee SIGIR'01 • Azar, Fiat, Karlin, McSherry, Saia STOC'01 • Our explanation: spectral retrieval works through its ability to identify pairs of terms with similar co-occurrence patterns • no single subspace is appropriate for all term pairs • we fix that problem
Spectral retrieval — alternative view • Ranked retrieval in the term space • Spectral retrieval = linear projection to an eigensubspace L q projection matrix L (Lq)T(Ld1)——————|Lq||Ld1| cosine similarities in the subspace … = qT(LTLd1)——————|Lq||LTLd1|
Spectral retrieval — alternative view • Ranked retrieval in the term space expansion matrix LTL • Spectral retrieval = linear projection to an eigensubspace L q projection matrix L cosine similarities in the subspace … qT(LTLd1)——————|Lq||LTLd1|
Spectral retrieval — alternative view • Ranked retrieval in the term space expansion matrix LTL qT(LTLd1)——————|q||LTLd1| … similarities after document expansion • Spectral retrieval = linear projection to an eigensubspace L q projection matrix L qT(LTLd1)——————|Lq||LTLd1| cosine similarities in the subspace … Spectral retrieval = document expansion (not query expansion)
Why document "expansion" internet surfing beach web = · 0-1 expansion matrix
Why document "expansion" add "internet" if "web" is present internet surfing beach web = · 0-1 expansion matrix
Why document "expansion" • Ideal expansion matrix has • high scores for intuitively related terms • low scores for intuitively unrelated terms add "internet" if "web" is present internet surfing beach web = · matrix L projectingto 2 dimensions expansion matrix LTL expansion matrixdepends heavily on the subspace dimension!
Why document "expansion" • Ideal expansion matrix has • high scores for intuitively related terms • low scores for intuitively unrelated terms add "internet" if "web" is present internet surfing beach web = · matrix L projectingto 3 dimensions expansion matrix LTL expansion matrixdepends heavily on the subspace dimension!
logic / logics node / vertex logic / vertex 0 200 400 600 0 200 400 600 0 200 400 600 subspace dimension subspace dimension subspace dimension Our Key Observation • We studied how the entries in the expansion matrix depend on the dimension of the subspace to which documents are projected expansion matrix entry 0 no single dimension is appropriate for all term pairs
logic / logics node / vertex logic / vertex 0 200 400 600 0 200 400 600 0 200 400 600 subspace dimension subspace dimension subspace dimension Our Key Observation • We studied how the entries in the expansion matrix depend on the dimension of the subspace to which documents are projected expansion matrix entry 0 no single dimension is appropriate for all term pairs but the shape of the curve is a good indicator for relatedness!
0 200 400 600 0 0 200 200 400 400 600 600 subspace dimension subspace dimension subspace dimension Curves for related terms • We call two terms perfectly related if they have an identical co-occurrence pattern term 1 term 2 proven shape for perfectly related terms provably small change after slight perturbation half way to a real matrix expansion matrix entry 0 point of fall-off is different for every term pair! up-and-then-down shape remains
0 0 0 200 200 200 400 400 400 600 600 600 subspace dimension subspace dimension subspace dimension Curves for unrelated terms • Co-occurrence graph: • vertices = terms • edge = two terms co-occur • We call two terms perfectly unrelated if no path connects them in the graph provably small changeafter slight perturbation proven shape forperfectly unrelated terms half way to a real matrix expansion matrix entry 0 curves for unrelated terms are random oscillations around zero
Telling the shapes apart — TN • Normalize term-document matrix so that theoretical point of fall-off is equal for all term pairs • For each term pair: if curve is never negative before this point, set entry in expansion matrix to 1, otherwise to 0 expansion matrix entry 0 set entry to 1 set entry to 1 set entry to 0 0 200 400 600 0 200 400 600 0 200 400 600 subspace dimension subspace dimension subspace dimension a simple 0-1 classification, no fractional entries!
An alternative algorithm — TM • Again, normalize term-document matrix so that theoretical point of fall-off is equal for all term pairs • For each term pair compute the monotonicity of its initial curve (= 1 if perfectly monotone, 0 as number of turns increase) • If monotonicity is above some threshold, set entry in expansion matrix to 1, otherwise to 0 0.07 0.07 0.69 0.69 0.82 0.82 expansion matrix entry 0 set entry to 1 set entry to 1 set entry to 0 0 200 400 600 0 200 400 600 0 200 400 600 subspace dimension subspace dimension subspace dimension again: a simple 0-1 classification!
Experimental results (average precision) 425 docs3882 terms Baseline: cosine similarity in term space Latent Semantic Indexing Dumais et al. 1990 Term-normalized LSI Ding et al. 2001 Correlation-based LSI Dupret et al. 2001 Iterative Residual Rescaling Ando & Lee 2001 our non-negativity test our monotonicity test * the numbers for LSI, LSI-RN, CORR, IRR are for the best subspace dimension!
Experimental results (average precision) 425 docs3882 terms 21578 docs5701 terms 233445 docs99117 terms * the numbers for LSI, LSI-RN, CORR, IRR are for the best subspace dimension!
Conclusions • Main message: spectral retrieval works through its ability to identify pairs of termswith similar co-occurrence patterns • a simple 0-1 classification that considers a sequence of subspaces is at least as good as schemes that commit to a fixed subspace • Some useful corollaries … • new insights into the effect of term-weighting and other normalizations for spectral retrieval • straightforward integration of known word relationships • consequences for spectral link analysis?
Conclusions • Main message: spectral retrieval works through its ability to identify pairs of terms with similar co-occurrence patterns • a simple 0-1 classification that considers a sequence of subspaces is at least as good as schemes that commit to a fixed subspace • Some useful corollaries … • new insights into the effect of term-weighting and other normalizations for spectral retrieval • straightforward integration of known word relationships • consequences for spectral link analysis? Obrigado!
Why document "expansion" • Ideal expansion matrix has • high scores for related terms • low scores for unrelated terms • Expansion matrix LTL depends on the subspace dimension add "internet" if "web" is present internet surfing beach web = · matrix L projectingto 4 dimensions expansion matrix LTL