VLDB 2013 Riva del Garda

VLDB 2013 Riva del Garda • Efficient Error-tolerant Query Autocompletion • ChuanXiao1, JianbinQin2, Wei Wang2, • Yoshiharu Ishikawa1, Koji Tsuda3, KunihikoSadakane4 • 1, Nagoya University, Japan • 2, University of New South Wales, Australia • 3, AIST and JST ERATO, Japan • 4, NII, Japan Presenter: Jianbin Qin jqin@cse.unsw.edu.au

Motivation: Error tolerant Query suggestion

Motivation: Error Tolerant Code completion

Preliminary: Edit Distance Prefix Search Target String set S = {s1, s2, …, sn}. Edit distance thresholdτ. User query string q Return a set of Result strings R contains all strings s ∈ S, such that ∃s′ ≼ s, ed(s′, q) ≤ τ Mobile Phone Browser q R q R Example: τ = 1, q = “abc”, S= {“acdefg”, “cda”, … } Then: R={“acdefg”} as ED(“abc”, “ac”) = 1 ≤ τ. Edit Distance Prefix Searcher Index Core Challenges: String set S usually very large. Query response time is critical. Target String set S

Existing Methods (CK09,JLLF09) Directly index string set S into a trie. ED = 0 Simulate edit distance calculation when traversing the trie. q = “a” q = “abc” q = “ab” q = “” Example: τ= 1 When user types in: ED = 1 0 ED > 1 Drawback: Tracking too many nodes during process. O(|Σ|τ ) ξ 1 7 a b Build Trie Index 8 2 b c 9 3 5 d c d 4 6 d c S1 S2 S4

Contribution: Space Trade Performance We offer another option to trade space for runtime performance. Up to X1000 Faster Error Tolerant Prefix Searcher Index Up to X20 larger Transforman Edit Prefix Search Problem into an Exact Prefix Search Problem Build Deletion Variants Trie One server can serve up to 1000 times more users simultaneously.

Deletion variant [T. Bocek et. al. 2007] Deletion Neighborhood Generation. s = abcd 2-Variants Family of s. V(s,2) 0-Variants abcd {} 1-Variants bcd {1} acd {2} abd {3} abc {4} 2-Variants cd {1,1} bd {1,2} bc {1,3} ad {2,2} ac {2,3} ab {3,3} ⟨x, Dx⟩ is called a variant-list pair, Dxis the deletionlist. V(s,K) is the union of 0~k-variant list pairs. Called k-variant family of s.

Deletion Variants Matching Principle s = abcd, V(s,2) abcd {} bcd {1} acd {2} abd {3} abc {4} cd {1,1} bc {1,3} ad {2,2} bd {1,2} ac {2,3} ab {3,3} q = abxd, V(q,2) abxd {} bxd {1} axd {2} abd {3} abx {4} xd {1,1} bx {1,3} ad {2,2} bd {1,2} ax {2,3} ab {3,3} Variants Matching Principle: Given two strings s and t ,ED(s,t) ≤ τ, iff there exist ⟨x,Dx⟩∈ V(s,τ) and ⟨y,Dy⟩ ∈ V(t,τ), such that x = y and |Dx ∪ Dy| ≤ τ. Two conditions need to satisfy: x = y Identical Check. (Efficiently process with index) |Dx U Dy| ≤ τDeletion list Union Size Check. (No efficient methods)

One more enumeration s = abcd, V(s,2) abcd {} bcd {1} acd {2} abd {3} abc {4} cd {1,1} bc {1,3} ad {2,2} bd {1,2} ac {2,3} ab {3,3} q = abxd, V(q,2) abxd {} bxd {1} axd {2} abd {3} abx {4} xd {1,1} bx {1,3} ad {2,2} bd {1,2} ax {2,3} ab {3,3} q = abxd, Enumerated 2-Variants Family of q. EnumV(q,2) abxd {} abxd {} abxd {1} abxd {2} …… abxd {3,4} abxd {4,4} …… abd {3} abd {} abd {1} abd {2} …… abd {3} abd {3,3} …… ab (3,3) ab {} ab {3} ab {3,3} Enumerated Variants Matching Principle: Given two strings s and q, ED(s,q) ≤ τ, iff there exist ⟨x,Dx⟩∈ V(s,τ) and ⟨y,Dy⟩ ∈ EnumV(q,τ) such that x = y and Dx = Dy.

Encode and build index Then we encode <x, Dx> together: s=“abcd” = {abcd, #bcd, a#cd, ab#d, abc#, ##cd, #b#d, …} bd {1,2} #b#d Generate Variants ξ a b # b # c # b c c d # c d d # d c d d d # c # c d d c d c S1 S1 S2 S2 S1 S2 S1 S2 S3 S3 S3 S1 S2 S3

enumerated Variant Size very large q = abc t = 2 abc #bc a#c ab# ##c #b# a## abc, #abc, a#bc, ab#c, abc#, ##abc, #a#bc,#ab#c, #abc#, a##bc, a#b#c, a#bc#, ab##c, ab#c#, abc## abc #bc bc, #bc, ##bc, #b#c, #bc# a#c ac, a#c, #a#c, a##c, a#c# ab# ab, ab#, #ab#, a#b#, ab## ##c c, #c, ##c #b# b, #b, b#, #b# a## a, a#, a##

Adaptive enumeration

Adaptive enumeration, Full example t=1 q = “” EnumV = {ξ, #} q = “a” EnumV = {a, a#,#} q = “ab” EnumV = {ab, ab#, a#, #b} q = “abc” EnumV= {abc, abc#, ab#c, ab#,a#c, #bc} ξ O(τ ·(|q|+τ)τ) a b # b # c # b c c d # c d d # d c d d d # c # c d d c d c S1 S1 S2 S2 S1 S2 S1 S2 S3 S3 S3 S1 S2 S3

Experiments Dataset: DBLP, 351,207 Terms. Average Length 8, |Σ| = 27. Prefix length is the query length. The time and size are all interval count. 1000 query average. Edit distance threshold τ = 3, IncNgTrie: Our algorithms ICAN and ICPAN: previous direct trie methods.

Experiments DirectTrie: Original trie. NoReduction: IncNGTrie before compression. StringMerge: Merge branches reaching the same string. SubtreeMerge: Merge subtrees with identical content.

Conclusion • An alternative way to solve edit prefix search Problem. • Our method is independent of character set size. • Gain up to 1000 times of query performance improvement. • Data adaptive enumeration method.

Q & A

Preliminary: Edit Distance Prefix Search • Core Component is the Prefix Edit Similarity Search. • A string Q is t-edit prefix matching another string S is that there exist one prefix of S, that the edit distance with Q is within t. • R = {s | s  S, s’  P(s) such that ed(s’, Q)  t} , P(s) denotes all the prefixes of s. User Client Q Result Ranker R Fuzzy Prefix Searcher Index Example: If t = 1, Q=“abc” t-Edit Prefix Match “acdefghtijk”, as “ac” is the prefix of “acdefghtijk” and ed(Q, “ac”) <= 1; Core Target String set

Index data strings into a trie(Radix Tree). Exact Prefix Match. T-edit prefix match. Previous ideas Q=“p” 10 11 12 9 3 8 2 6 4 0 1 7 5 ξ a d e b d b c c d c a d S1 S3 S2 S2

Adaptive enumeration, Full example t=1 q = “” EnumV = {ξ1, #1, #2} q = “a” EnumV = {a2, a#2, a#3,#2} q = “ab” EnumV = {ab3, ab#3, ab#4, a#3, #b3} q = “abc” EnumV= {abc4, abc#4, abc#5, ab#c4, ab#4,a#c4, #bc4} ξ a b # b # c # b c c d # c d d # d c d d d # c # c d d c d c S1 S1 S2 S2 S1 S2 S1 S2 S3 S3 S3 S1 S2 S3

Previous ideas cont’d Index data strings into a trie(Radix Tree). Keep active nodes while traversal the tree. For each query character Q[i] entered, traverse the trie and incrementally maintain all the nodes n such that ed(n, Q[1..i])  t (also called active nodes/states) Q=Ø Q=“p” 0 0 c m c m e e 1 4 7 1 4 7 a a a a a a 2 8 5 2 8 5 b t p b t p 3 6 9 3 6 9 s2 s3 s1 s2 s3 s1

Our Basic Ideas Embed the second condition into the first condition and efficiently process with Index. s=“abcd” 0-Variant-list = {<abcd>} 1-Variant-list = {<#bcd>, <a#cd>, <ab#d>, <abc#> 2-Variant-list = {<##cd>, <#b#d>, <#bc#>, … q=“abxd” 0-Variant-list = {<abxd, {}>} 1-Variant-list = {<bxd, {1}>, <axd, {2}>, <abd, {3}> … 2-Variant-list = {<xd, {1,1}>, <bd, {1,2}>, <bx, <1,3> …

Index data strings into a trie(Radix Tree). Exact Prefix Match. T-edit prefix match. Previous ideas: Exact Prefix Match Extended from Exact prefix search methods: Directly indexing strings S into a TRIE. Find the node that exactly match query q. Example: User Types: q = “a” q = “” q = “ab” q = “abc” 0 ξ 1 7 a b 11 Directly Indexing 2 8 c b a 3 5 9 12 c d d d 10 4 6 d c e S1 S2 S3 S4

Index data strings into a trie(Radix Tree). Exact Prefix Match. T-edit prefix match. Previous ideas: Fuzzy Prefix Match Simulate Edit distance Calculation During Traversal The TRIE. Directly indexing strings S into a TRIE. Example: When t = 1 User Types: ED = 0 q = “abc” q = “ab” q = “a” q = “” ED = 1 Draw Back: Tracking too many nodes during process. 0 ED > 1 ξ 1 7 a b 11 Directly Indexing 2 8 c b a 3 5 9 12 c d d d 10 4 6 d c e S1 S2 S3 S4

Index data strings into a trie(Radix Tree). Exact Prefix Match. T-edit prefix match. Previous ideas: Exact Prefix Match Extended from Exact prefix search methods: Directly indexing strings S into a TRIE. Find the node that exactly match query q. Example: User Types: q = “” q = “a” q = “ab” q = “abc” 0 ξ 1 7 a b Directly Indexing 8 2 b c 9 3 5 d c d 4 6 d c S1 S2 S4

Data Dependent Enumeration Cont’d ξ # b a # c # a b c b a d # d c # d d d c d d c d d d c # # c d e c e e # d c e S1 S2 S1 S3 S1 S3 S2 S2 S2 S1 S3 S2 S2 S2 S2 S2 S1

Conclusion

Data Dependent Enumeration Cont’d • K-Matching Variant • Given two i-deletion-marked variants(0ik) xand y, if y contains the same string content with x, (not count the mark symbol) and the size of the union of their deletion-position-lists k, y is called a k-matching variant of x. • Problem Transformation cb, c#b, #cb, cb#, c##b, #c#b, c#b# c#b

Result Fetching • K-Matching Variant • Given two i-deletion-marked variants(0ik) xand y, if y contains the same string content with x, (not count the mark symbol) and the size of the union of their deletion-position-lists k, y is called a k-matching variant of x. • Problem Transformation cb, c#b, #cb, cb#, c##b, #c#b, c#b# c#b

Motivation: Fuzzy Instance Search

VLDB 2013 Riva del Garda

VLDB 2013 Riva del Garda

Presentation Transcript

ORACLE VLDB

VLDB’2007 review

Reforming an Garda Síochána

VLDB 2008

RIVA DEI FRATI

Data Partitioning in VLDB

NeSy-2006, ECAI-06 Workshop 29 August, 2006, Riva del Garda, Italy

GARDA VETTING ARRANGEMENTS

Garda Lake

VLDB 2014 Industrial Track

Konark Riva