Efficient Approximate Search on String Collections Part I

Efficient Approximate Search on String CollectionsPart I Marios Hadjieleftheriou Chen Li

DBLP Author Search http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html

Try their names (good luck!) http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/index.html UCSD Case Western AT&T--Research Yannis Papakonstantinou Meral Ozsoyoglu Marios Hadjieleftheriou

Better system? http://dblp.ics.uci.edu/authors/

People Search at UC Irvine http://psearch.ics.uci.edu/

Web Search • Errors in queries • Errors in data • Bring query and meaningful results closer together Actual queries gathered by Google http://www.google.com/jobs/britney.html

Data Cleaning R S

Problem Formulation Find strings similar to a given string: dist(Q,D) <= δ Example: find strings similar to “hadjeleftheriou” • Performance is important! • 10 ms: 100 queries per second (QPS) • 5 ms: 200 QPS

Outline • Motivation • Preliminaries • Trie-based approach • Gram-based algorithms • Sketch-based algorithms • Compression • Selectivity estimation • Transformations/Synonyms • Conclusion Part I Part II

Next… Preliminaries

Similarity Functions • Similar to: • a domain-specific function • returns a similarity value between two strings • Examples: • Edit distance • Hamming distance • Jaccard similarity • Soundex • TF/IDF, BM25, DICE • See [KSS06] for an excellent survey

A widely used metric to define string similarity Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Edit Distance 13

Next… Gram-based algorithms • List-merging algorithms [LLL08] • Variable-length grams (VGRAM) [LWY07,YWL08]

“q-grams” of strings u n i v e r s a l 2-grams

Edit operation’s effect on grams k operations could affect k * q grams Fixed length: q u n i v e r s a l If ed(s1,s2) <= k, then their # of common grams >= (|s1|- q + 1) –k *q

id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 4 2 3 0 1 4 3 0 3 1 0 1 2 4 4 1 2 4 2 3 q-gram inverted lists

# of common grams >= 3 id strings at ch ck ic ri st ta ti tu uc 4 0 1 2 3 4 rich stick stich stuck static 2 0 1 3 0 1 2 4 0 4 2 3 1 4 1 2 4 3 3 Searching using inverted lists • Query: “shtick”, ED(shtick, ?)≤1 sh ht ti ic ck ic ck ti 2-grams

T-occurrence Problem Merge Ascending order Find elements whose occurrences ≥ T

Example T = 4 1 3 5 10 13 10 13 15 5 7 13 13 15 Result: 13

List-Merging Algorithms HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip

Heap-based Algorithm Push to heap …… Min-heap Count # of occurrences of each element using a heap

MergeOpt Algorithm [SK04] Binary search Long Lists: T-1 Short Lists

Example of MergeOpt 1 3 5 10 13 10 13 15 5 7 13 13 15 Long Lists: 3 Short Lists: 2 Count threshold T≥ 4

ScanCount String ids # of occurrences Increment by 1 1 2 3 … 1 0 1 3 5 10 13 10 13 15 5 7 13 13 15 0 1 0 13 4 0 Result! 14 0 15 2 0 Count threshold T≥ 4 25

List-Merging Algorithms HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip

MergeSkip algorithm [BK02, LLL08] Pop T-1 …… Min-heap Jump Greater or equals T-1

Example of MergeSkip 1 minHeap 5 10 13 15 1 3 5 10 10 15 5 7 13 15 13 13 Jump 17 17 15 15 Count threshold T≥ 4

DivideSkip Algorithm [LLL08] Binary search MergeSkip Long Lists Short Lists

How many lists are treated as long lists?

Length Filtering Length: 10 s: By length only! Ed(s,t) ≤ 2 t: Length: 19

Positional Filtering Ed(s,t) ≤ 2 s (ab,1) t (ab,12)

A filter tree Combine filters with list-merging algorithms [LLL08]

Next… Variable-length grams (VGRAM) [LWY07,YWL08]

ati ich ick ric sta sti stu tat tic tuc uck 4 id strings id strings id strings 2 0 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 rich stick stich stuck static rich stick stich stuck static rich stick stich stuck static 1 0 4 2 1 3 4 2 1 4 3 3 2-grams -> 3-grams? • Query: “shtick”, ED(shtick, ?)≤1 sht hti tic ick ick tic # of common grams >= 1 3-grams

id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 1 4 2 3 0 1 4 3 0 3 0 1 2 4 4 1 2 4 2 3 Observation 1: dilemma of choosing “q” • Increasing “q” causing: • Longer grams  Shorter lists • Smaller # of common grams of similar strings

Observation 2: skew distributions of gram frequencies • DBLP: 276,699 article titles • Popular 5-grams: ation (>114K times), tions, ystem, catio

VGRAM: Main idea • Grams with variable lengths (between qmin and qmax) • zebra • ze(123) • corrasion • co(5213), cor(859), corr(171) • Advantages • Reduce index size  • Reducing running time  • Adoptable by many algorithms 

Challenges • Generatingvariable-length grams? • Constructing a high-quality gram dictionary? • Relationship between string similarity and their gram-set similarity? • Adopting VGRAM in existing algorithms?

[2,4]-gram dictionary ni ivr sal uni vers Challenge 1: String  Variable-length grams? • Fixed-length 2-grams u n i v e r s a l • Variable-length grams u n i v e r s a l

Representing gram dictionary as a trie ni ivr sal uni vers

Step 2: Constructing a gram dictionary qmin=2 qmax=4 • Frequency-based [LYW07] • Cost-based [YLW08]

Challenge 3: Edit operation’s effect on grams k operations could affect k * q grams Fixed length: q u n i v e r s a l

Deletion affects variable-length grams Not affected Not affected Affected i i-qmax+1 i+qmax- 1 Deletion

Main idea • For a string, for each position, compute the number of grams that could be destroyed by an operation at this position • Compute number of grams possibly destroyed by k operations • Store these numbers (for all data strings) as part of the index Vector of s = <2,4,6,8,9> • Use this number to do count filtering With 2 edit operations, at most 4 grams can be affected

Summary of VGRAM index

Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: • String s  grams • String s1, s2 such that ed(s1,s2) <= k  min # of their common grams

Lower bound on # of common grams Fixed length (q) If ed(s1,s2) <= k, then their # of common grams >=: (|s1|- q + 1) –k *q u n i v e r s a l Variable lengths:# of grams of s1 – NAG(s1,k)

… ck ic ich … tic tick … id strings id strings id strings … ck ic … ti … 1 3 1 3 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 rich stick stich stuck static rich stick stich stuck static rich stick stich stuck static 4 1 4 1 2 0 2 0 1 2 4 2 4 1 Example: algorithm using inverted lists • Query: “shtick”, ED(shtick, ?)≤1 tick sh ht tick 2-grams 2-4 grams Lower bound = 3 Lower bound = 1

End of part I • Motivation • Preliminaries • Trie-based approach • Gram-based algorithms • Sketch-based algorithms • Compression • Selectivity estimation • Transformations/Synonyms • Conclusion Part I Part II

Efficient Approximate Search on String Collections Part I

Efficient Approximate Search on String Collections Part I

Presentation Transcript

Efficient Approximate Search on String Collections Part II

Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Efficient Merging and Filtering Algorithms for Approximate String Searches

Approximate String Matching

The Flamingo Software Package on Approximate String Queries

Efficient Merging and Filtering Algorithms for Approximate String Searches

Efficient Search on Encrypted Data

Rules for Approximate String Matching

PART 8 Approximate Reasoning

Exact String Search

Efficient Merging and Filtering Algorithms for Approximate String Searches

Efficient String Matching : An Aid to Bibliographic Search

Word Hashing for Efficient Search in Document Image Collections

Approximate Boyer-Moore String Matching

Filter Algorithms for Approximate String Matching

A fast algorithm for approximate string matching on gene sequences

Approximate String Matching

Word Hashing for Efficient Search in Document Image Collections

Efficient Approximate Search on String Collections Part II

Exact String Search