Efficient Approximate Search on String Collections

Efficient Approximate Search on String Collections Marios Hadjieleftheriou Chen Li

Outline Part 1: • Motivation and preliminaries • Inverted list based algorithms Part 2: • Gram signature algorithms • Length normalized algorithms • Selectivity estimation • Conclusion and future directions

Web Search • Errors in queries • Errors in data • Bring query and meaningful results closer together Actual queries gathered by Google http://www.google.com/jobs/britney.html

Record Linkage R S • Edit distance • Jaccard • Cosine • … Record linkage

Document Cleaning Should be “Niels Bohr” Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope

Demos • http://directory.uci.edu/ • http://psearch.ics.uci.edu/advanced/ • http://psearch.ics.uci.edu/

State-of-the-art: Oracle 10g and older • Supported by Oracle Text • CREATE TABLE engdict(word VARCHAR(20), len INT); • Create preferences for text indexing: begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; / • CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF'); • Usage: SELECT * FROM engdict WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0; • Limitation: cannot handle errors in the first letters: Katherine versus Catherine

Microsoft SQL Server [CGG+05] • Data cleaning tools available in SQL Server 2005 • Part of Integration Services • Supports fuzzy lookups • Uses data flow pipeline of transformations • Similarity function: tokens with TF/IDF scores 8

Lucene • Using Levenshtein Distance (Edit Distance). • Example: roam~0.8 • Prefix filtering followed by a scan (Efficiency?)

Problem Formulation Find strings similar to a given string • Performance is important! • 10 ms: 100 queries per second (QPS) • 5 ms: 200 QPS

Similarity Functions • Similar to: • a domain-specific function • returns a similarity value between two strings • Examples: • Edit distance • Hamming distance • Jaccard similarity • Soundex • TF/IDF, BM25, DICE • See [KSS06] for an excellent survey

A widely used metric to define string similarity Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Edit Distance 12

Outline • Motivation and preliminaries • Inverted list based algorithms • List-merging algorithms • VGRAM • List-compression techniques • Gram signature algorithms • Length normalized algorithms • Selectivity estimation • Conclusion and future directions

“q-grams” of strings u n i v e r s a l 2-grams

Edit operation’s effect on grams k operations could affect k * q grams Fixed length: q u n i v e r s a l

id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 4 2 3 0 1 4 3 0 3 1 0 1 2 4 4 1 2 4 2 3 q-gram inverted lists

# of common grams >= 3 id strings at ch ck ic ri st ta ti tu uc 4 0 1 2 3 4 rich stick stich stuck static 2 0 1 3 0 1 2 4 0 4 2 3 1 4 1 2 4 3 3 Searching using inverted lists • Query: “shtick”, ED(shtick, ?)≤1 sh ht ti ic ck ic ck ti 2-grams

T-occurrence Problem Merge Ascending order Find elements whose occurrences ≥ T

Example T = 4 1 3 5 10 13 10 13 15 5 7 13 13 15 Result: 13

List-Merging Algorithms HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip

Heap-based Algorithm Push to heap …… Min-heap Count # of occurrences of each element using a heap

MergeOpt Algorithm [SK04] Binary search Long Lists: T-1 Short Lists

Example of MergeOpt 1 3 5 10 13 10 13 15 5 7 13 13 15 Long Lists: 3 Short Lists: 2 Count threshold T≥ 4

ScanCount String ids # of occurrences Increment by 1 1 2 3 … 1 0 1 3 5 10 13 10 13 15 5 7 13 13 15 0 1 0 13 4 0 Result! 14 0 15 2 0 Count threshold T≥ 4 24

List-Merging Algorithms HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip

MergeSkip algorithm [BK02, LLL08] Pop T-1 …… Min-heap Jump Greater or equals T-1

Example of MergeSkip 1 minHeap 5 10 13 15 1 3 5 10 10 15 5 7 13 15 13 13 Jump 17 17 15 15 Count threshold T≥ 4

DivideSkip Algorithm [LLL08] Binary search MergeSkip Long Lists Short Lists

How many lists are treated as long lists? Long Lists Short Lists Lookup Merge ? A good balance in the tradeoff: # of long lists = T / ( μ logM +1)

Length Filtering Length: 10 s: By length only! Ed(s,t) ≤ 2 t: Length: 19

Positional Filtering Ed(s,t) ≤ 2 s (ab,1) t (ab,12)

root … 1 2 3 n … aa ab zy zz 1 2 m 5 12 17 28 44 Filter tree [LLL08] Length level Gram level … Position level Inverted list

Surprising experimental results (DBLP) Adding position filter could increase running time

Filters fragment inverts lists Merge Merge Merge Merge Applying filters Saving: reduce list size. Cost: - Tree traversal, - More merging

Outline • Motivation and preliminaries • Inverted list based algorithms • List-merging algorithms • VGRAM [LWY07,YWL08] • List-compression techniques • Gram signature algorithms • Length normalized algorithms • Selectivity estimation • Conclusion and future directions

ati ich ick ric sta sti stu tat tic tuc uck 4 id strings id strings id strings 2 0 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 rich stick stich stuck static rich stick stich stuck static rich stick stich stuck static 1 0 4 2 1 3 4 2 1 4 3 3 2-grams -> 3-grams? • Query: “shtick”, ED(shtick, ?)≤1 sht hti tic ick ick tic # of common grams >= 1 3-grams

id strings at ch ck ic ri st ta ti tu uc 0 1 2 3 4 rich stick stich stuck static 2-grams 1 4 2 3 0 1 4 3 0 3 0 1 2 4 4 1 2 4 2 3 Observation 1: dilemma of choosing “q” • Increasing “q” causing: • Longer grams  Shorter lists • Smaller # of common grams of similar strings

Observation 2: skew distributions of gram frequencies • DBLP: 276,699 article titles • Popular 5-grams: ation (>114K times), tions, ystem, catio

VGRAM: Main idea • Grams with variable lengths (between qmin and qmax) • zebra • ze(123) • corrasion • co(5213), cor(859), corr(171) • Advantages • Reduce index size  • Reducing running time  • Adoptable by many algorithms 

Challenges • Generatingvariable-length grams? • Constructing a high-quality gram dictionary? • Relationship between string similarity and their gram-set similarity? • Adopting VGRAM in existing algorithms?

[2,4]-gram dictionary ni ivr sal uni vers Challenge 1: String  Variable-length grams? • Fixed-length 2-grams u n i v e r s a l • Variable-length grams u n i v e r s a l

Representing gram dictionary as a trie ni ivr sal uni vers

Challenge 2: Constructing gram dictionary Step 1: Collecting frequencies of grams with length in [qmin, qmax] st  0, 1, 3 sti 0, 1 stu3 stic 0, 1 stuc3 Gram trie with frequencies

Step 2: selecting grams • Pruning trie using a frequency threshold F (e.g., 2)

Step 2: selecting grams (cont) Threshold T = 2

Final gram dictionary A cost-based approach to choosing a gram dictionary [YWL08]

Challenge 3: Edit operation’s effect on grams k operations could affect k * q grams Fixed length: q u n i v e r s a l

Deletion affects variable-length grams Not affected Not affected Affected i i-qmax+1 i+qmax- 1 Deletion

[2,4]-grams ni ivr sal uni vers Grams affected by a deletion Affected? i i-qmax+1 i+qmax- 1 Deletion Deletion u n i v e r s a l Affected?

Grams affected by a deletion (cont) Affected? i i-qmax+1 i+qmax- 1 Deletion Trie of grams Trie of reversed grams

Efficient Approximate Search on String Collections

Efficient Approximate Search on String Collections

Presentation Transcript

Efficient Approximate Search on String Collections Part II

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Efficient Merging and Filtering Algorithms for Approximate String Searches

Approximate String Matching

The Flamingo Software Package on Approximate String Queries

Efficient Merging and Filtering Algorithms for Approximate String Searches

Efficient Approximate Search on String Collections Part I

Efficient Search on Encrypted Data

Rules for Approximate String Matching

Exact String Search

Efficient Merging and Filtering Algorithms for Approximate String Searches

Efficient String Matching : An Aid to Bibliographic Search

Word Hashing for Efficient Search in Document Image Collections

Approximate Boyer-Moore String Matching

Developing Your Search String

Filter Algorithms for Approximate String Matching

A fast algorithm for approximate string matching on gene sequences

Approximate String Matching

Word Hashing for Efficient Search in Document Image Collections

Efficient Approximate Search on String Collections Part II

Exact String Search