320 likes | 481 Views
Supporting Efficient Top-k Queries in Type-A h ead Search. Guoliang Li 1 , Jiannan Wang 1 , Chen Li 2 , Jianhua Feng 1 1 Tsinghua University 2 UC Irvine, Bimaple Technology Inc. . SIGIR 2012, Portland, Oregon. Query suggestions. Type-ahead search (instant search).
E N D
Supporting Efficient Top-k Queries in Type-Ahead Search Guoliang Li1, Jiannan Wang1, Chen Li2, Jianhua Feng1 1 Tsinghua University • 2 UC Irvine, Bimaple Technology Inc. SIGIR 2012, Portland, Oregon
Query suggestions Tsinghua/UC Irvine/Bimaple
Type-ahead search (instant search) Finding answers instantly! Tsinghua/UC Irvine/Bimaple
ipubmed.ics.uci.edu Fuzzy search Tsinghua/UC Irvine/Bimaple
Advantages of instant fuzzy search • Save time • Correct errors • Mobile friendly Fat fingers! Tsinghua/UC Irvine/Bimaple
Challenges • Speed • “100ms rule” • Prefix matching • Fuzzy matching • Quality Tsinghua/UC Irvine/Bimaple
Contributions Techniques for computing top-k answers in instant fuzzy searchwithout generating all candidates • Ranking framework • Index Structures • Algorithms • Experimental evaluation Tsinghua/UC Irvine/Bimaple
Outline • Problem Formulation • Instant exact search • Instant fuzzy search • Experiments Tsinghua/UC Irvine/Bimaple
Problem Formulation • Data: records • Query: • w1, w2, …, wm • wmpartial keyword • Answers: k best records graphicdeli Prefix Tsinghua/UC Irvine/Bimaple
Ranking Framework Aggregate li Query icde graph Max Score(graph) Score(liu) Score(lin) Score(icde) graph, gray, gross, icde, lin, liu Record Tsinghua/UC Irvine/Bimaple
Index structures Trie i l g r c i u a o d n u i p y s u e m h s p Inverted Index Tsinghua/UC Irvine/Bimaple
Basic Solution {graph, icde, li} k=1 • Too many candidates i l g r c i u a o d n u i p y s u e m h s p icde graph lin liu Tsinghua/UC Irvine/Bimaple
Optimization 1: Heap-based Method Aggregate GetMax() icde graph Max Heap lin liu Tsinghua/UC Irvine/Bimaple
Optimization 2: Top-k List-Merging Algorithm Example: Threshold algorithm T = 15 Sorted Access Sorted Access = 17 = 14 = 12 = 12 Random Access Tsinghua/UC Irvine/Bimaple Early termination
Efficient Random Access: How? i l g r c i u a o d n u i p y s u e m h s p Tsinghua/UC Irvine/Bimaple
Forward index [Ji et al. WWW’09] [7, 9] [1,4] [5, 6] i l g [9, 9] [7, 8] [1, 4] [5, 6] r [3, 4] c i u [1, 2] [5, 6] a o d n u i [3,3] [4, 4] [1,1] [2,2] 7 8 9 p y s u e m 2 5 6 h s p 1 4 3 Keyword ID Weight Tsinghua/UC Irvine/Bimaple
Random Access Using Forward Index 7 ? [7, 9] [1,4] [5, 6] i l g [9, 9] [7, 8] [1, 4] [5, 6] r [3, 4] c i u [1, 2] [5, 6] a o d n u i [3,3] [4, 4] [1,1] [2,2] 7 8 9 p y s u e m 2 5 6 h s p 1 4 3 Tsinghua/UC Irvine/Bimaple
Outline • Problem Formulation • Instant exact search • Instant fuzzy search • Experiments Tsinghua/UC Irvine/Bimaple
Ranking Framework (Fuzzy matching) Aggregate li Query icde graph Max Sim(li,i) *Score(lin) Sim(icde,icdm) *Score(icdm) Score(graph) Score(liu) Score(lin) graph, gray, icdm, gross,lin, liu Record Tsinghua/UC Irvine/Bimaple
Computing Similar Prefixes [Ji et al. WWW’09] {graph, icde, li}, similarity threshold τ=0.45 i l g r c i u a o d n u i p y s u e m h s p Tsinghua/UC Irvine/Bimaple
Top-k Algorithm sum GetMax() GetMax() GetMax() li graph icde 2 3 Max Heap Max Heap 4 Max Heap ×0.5 ×0.5 ×1 ×1 ×0.5 ×1 ×1 ×0.5 similarity lui icde icde graph icdm lin icdm liu Tsinghua/UC Irvine/Bimaple
Efficient Random Access (method 1) • Probing on Forward Lists [7, 9] [1,4] [5, 6] i l g [9, 9] [7, 8] [1, 4] [5, 6] r [3, 4] c i u [1, 2] [5, 6] a o d n u i [3,3] [4, 4] [1,1] [2,2] 7 8 9 p y s u e m 2 5 6 h s p 1 4 3 Binary Search: [5,6], [7,9], [7,8], [9,9], 7, 8, 9 Tsinghua/UC Irvine/Bimaple
Efficient Random Access (method 2) • Probing on Trie Leaf Nodes [7,9] [1,4] [5,6] i l g [7,8] [1,4] [9,9] [5,6] r [3,4] c i u [1,2] [5,6] a o d n u i [3,3] [4,4] [1,1] [2,2] 7 8 9 p y s u l m 5 6 2 li, 0.5 h s p li, 1 1 4 3 li, 1 li, 0.5 li, 0.5 Traverse the forward list of Tsinghua/UC Irvine/Bimaple
Optimization by materializing union lists • Time/space tradeoff • Cost-based analysis for a space budget i l g r c i u a o d n u i p y s u e m h s p Tsinghua/UC Irvine/Bimaple
Outline • Problem Formulation • Instant exact search • Instant fuzzy search • Experiments Tsinghua/UC Irvine/Bimaple
Data sets and index costs Tsinghua/UC Irvine/Bimaple
Exact Search (DBLP) k=10, similarity threshold τ=0.6 Tsinghua/UC Irvine/Bimaple
Exact Search (DBLP) k=10, similarity threshold τ=0.6 Tsinghua/UC Irvine/Bimaple
Fuzzy Search DBLP, k=10, similarity threshold τ=0.6 TA NRA Tsinghua/UC Irvine/Bimaple
Other results (not included in the paper) • More general ranking (e.g., positional information) • Other languages • Location-based search Tsinghua/UC Irvine/Bimaple
Conclusions (ipubmed.ics.uci.edu) Efficient techniques for instant fuzzy search Tsinghua/UC Irvine/Bimaple
Acknowledgements • The authors have financial interest in Bimaple Technology Inc., a company currently commercializing some of the techniques described in this publication. • Chen Li was partially supported by NIH grant 1R21LM010143-01A1. • Guoliang Li, Jianan Wang, and Jianhua Feng were partly supported by the National Natural Science Foundation of China under Grant No. 61003004, the National Grand Fundamental Research 973 Program of China under Grant No. 2011CB302206, a project of Tsinghua University under Grant No. 20111081073, and the “NExTResearch Center” funded by MDA, Singapore, under the Grant No. WBS:R-252-300-001-490. Tsinghua/UC Irvine/Bimaple