280 likes | 415 Views
Efficient and Effective Link Analysis with Precomputed SALSA Maps. Marc Najork (Microsoft Research, Mt View, CA, USA) Nick Craswell (Microsoft Live Search, Cambridge, UK). Outline. The problem Framework & previous results Review of SALSA; introduction of CS-SALSA
E N D
Efficient and Effective Link Analysis with Precomputed SALSA Maps Marc Najork (Microsoft Research, Mt View, CA, USA) Nick Craswell (Microsoft Live Search, Cambridge, UK)
Outline • The problem • Framework & previous results • Review of SALSA; introduction of CS-SALSA • Four pre-computed variants of SALSA: • Strawman: SS-SALSA-0 • Woodman: SS-SALSA-1 • Tinman: SS-SALSA-2 • Ironman: SS-SALSA-3 • Recap: Comparing old & new • Breakdown by query specificity • Related work • Critique
The problem we are addressing • Hyperlinks are a valuable feature for ranking of web search results • Combined with many other features (text, traffic) • Known query-dependent link-based ranking algorithms (SALSA & variants) provide better signal than known query-independent ones (PageRank, in-degree) • But: SALSA requires substantial query-time work; PageRanketc. is pre-computed • Can we pre-compute SALSA while preserving signal?
Our experimental framework • Large web graph • 464 million crawled pages • 2.9 billion distinct URLs • 17.7 billion distinct edges • Large test set • 28,043 queries (sampled from Live Search logs) • 66.8 million result URLs (~2838/query) • 485,656 judgments (~ 17.3/query); six-point scale • Standard performance measures: MAP, MRR, NDCG • Same data & measures as used in other work (SIGIR 2007, CIKM 2007, WAW 2007, WSDM 2009)
Previous results on this data set See CIKM 2007 (for SALSA), SIGIR 2007 (all other results)
Some notation • Web graph G=(V,E) E V V(eliminating intra-domain edges from E) • URLs u,v,w V • Parent/in-linker set I(v) = { uV : (u,v) E } • Children/out-linker set O(u) = { vV : (u,v) E } • Result set R V of a query q
Random vs. consistent sampling • Un (X) denotes uniform random sample of n elements from X • Cn(X) denotes consistent sample of n elements from X • Properties: • Deterministic • Unbiased • Preserves set similarity:
SALSA algorithm (Lempel & Moran 2000) • Input: Web graph (V,E); result set R of query q • Form neighborhood graph (B,N): • Expand R to base set B by including all children and n parents (sampled uniformly at random) of each result in R: • Neighborhood edge set N includes all edges in E with endpoints in B:
SALSA algorithm: Authority scores • Form two matrices based on (B,N): • Authority score vector = principal eigenvector of ITO Inverse-indegree matrix Inverse-outdegree matrix
CS-SALSA • “Consistent-sampling SALSA” (CS-SALSA) • Identical to standard SALSA, except: • Sample in-linkers as well as out-linkers • using consistent sampling (as opposed to random) • Two free sampling parameters a and b • What are the best settings?
Effectiveness of CS-SALSA NDCG@10 • CS-SALSA(2,1) more effective than standard SALSA (whose NDCG@10 was 0.158)
Basic ideas of “Singleton-seed SALSA” • Offline (at indexing time): • Pretend that each vV is a singleton result set • Form neighborhood graph around {v} • Compute SALSA scores on that graph • Online (at query time): • Look up pre-computed scores of each v R and use them
Strawman: SS-SALSA-0 • Offline: • Input: Web graph (V,E), sampling parameters a, b • Output: Score map g: VR • For each v V: • Assume R = {v} and fix neighborhood graph (B,N) as in CS-SALSA • Compute SALSA scores s[u] for each u B • Set g[v] := s[v] • Online, given result set R and score map g: • For each uR: Assign score g[u]
Effectiveness of SS-SALSA-0 NDCG@10 • Computed off-line, looking up one score per result at query-time (like in-degree, PageRank) • Substantially less effective than PageRank and in-degree
Woodman: SS-SALSA-1 • Offline: • Input: Web graph (V,E), sampling parameters a, b • Output: Score map g: V V R • For each v V: • Assume R = {v} and fix neighborhood graph (B,N) as in CS-SALSA • Compute SALSA scores s[u] for each u B; s[u]=0 for u B • Set g[v] := s (which is of type V R) • Online, given result set R and score map g: • For each uR: Assign score
Effectiveness of SS-SALSA-1 NDCG@10 • Looking up |B| (≤ a+b+1) scores per result at query-time • More effective than PageRank; less effective than CS-SALSA • Better to sample no parents, more children • Counter-intuitive when viewing hyperlinks as endorsements
Tinman: SS-SALSA-2 • Same as SS-SALSA-1, except that offline-step uses modified definition of B • Sample a parents and b children of the “result” (the seed vertex) as before • Additionally, include c children (“siblings”) of each sampled parent, and d parents (“mates”) of each sampled child • So, SS-SALSA-2 has four free parameters a,b,c,d • Neighborhood graph and score maps are potentially much larger
Effectiveness of SS-SALSA-2 • Effectiveness increases monotonically as b (number of sampled children per result) is increased • Increases further as d (number of sampled mates per sampled child) is increased • Setting a (number of sampled parents per results) to 0 is best, other values are fairly indistinguishable • SS-SALSA-2(0,,0,75) has NDCG@10 of 0.157 • Compared to 0.182 for CS-SALSA(2,1) • Huge space cost: ~7500 scores for every page in the corpus!
Ironman: SS-SALSA-3 • Idea: Bound size of score map • For every seed vertex v: • Fix neighborhood vertex set B and compute scores s in the same way as in SS-SALSA-2 • Set g[v] := topk(s), the vertex-to-score mapping of the k highest-scoring vertices in B • Note that v itself might not be part of topk(s) • SS-SALSA-3 has five free parameters a,b,c,d,k
Effectiveness of SS-SALSA-3 • Fixed a=0, b=, c=0, d=75 • SS-SALSA-3 outperforms PageRank starting at two-entry score maps
Recap: Comparing algorithms new, all-online new, pre-computed
Breakdown by query specificity • How do SALSA variants, PageRank, and BM25F perform for different classes of queries? • Different ways to classify queries: • Informational, navigational, transactional (Broder’s taxonomy) • Commercial vs. non-commercial intent • General vs. specific • How to measure specificity? • Ideally, by size of result set • Approximation: Sum of IDFs of query terms
Breakdown by query specificity • CS-SALSA >> SS-SALSA-* for general queries • SS-SALSA-3 as good as SS-SALSA-2 for general queries
Related work • The quest for correct information on the web: hyper search engines (Marchiori 1997) • The PageRank citation ranking: Bringing order to the web (Page, Brin, Motwani, Winograd 1998) • Authoritative sources in a hyperlinked environment (Kleinberg 1998) • The Stochastic Approach for Link-Structure Analysis (SALSA) and the TKC Effect (Lempel & Moran 2000) • Using Bloom filters to speed up HITS-like ranking algorithms (Gollapudi, Najork, Panigrahy 2007) • Less is More: Sampling the neighborhood graph makes SALSA better and faster (Najork, Gollapudi, Panigrahy 2009)
Critique • Data sets not publicly available “I have a serious problem with the data set used by the authors. It is large, apparently well built, and not publicly available. There is by now stream of papers using these data and making strong claims about the effectiveness of all ranking methods for the web at major conferences; for these papers no claim can be confirmed or evaluated.” (anonymous WSDM 2009 reviewer) Plan to repeat using standard collections.
Critique • Issues with data sets: • Web graph is old • Small fraction of results are judged • Intersection between graph & results is modest See above – plan to repeat using public collection • Examined only effectiveness of isolated features • Linear combination with BM25F still improves over PageRank & BM25F, but improvement much smaller • Use better methods for combining evidence? • Good point on speed/quality curve? • You be the judge …