260 likes | 460 Views
An Overview of Similarity Query Processing. 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부. Table of Contents. 01. Applications of similarity query processing 02. Problem Formulation 03. string Decomposition 04. Similarity Function 05. A naïve approach 06. Overlap Similarity
E N D
An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부
Table of Contents • 01. Applications of similarity query processing • 02. Problem Formulation • 03. string Decomposition • 04. Similarity Function • 05. A naïve approach • 06. Overlap Similarity • 07. Similarity Query Processing with Inverted lists • 08. Similarity Function Revisited • 09. Filter and Verification Framework • 10. Prefix Filtering based Approach • 11. Exploiting Document Frequency Ordering
Some examples and figures in this presentationare taken from the following materials MariosHadjieleftheriou and Chen Li, Efficient Approximate Search on String Collections (tutorial), ICDE 2009 and VLDB 2009 Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, Efficient Similarity Joins for Near Duplicate Detection, WWW 2008 (slide) Jongik Kim and Hongrae Lee, Efficient Exact Similarity Searches using Multiple Token Orderings, ICDE 2012 (slide)
Applications of similarity query processing (1/8) Web Search Actual queries gatheredby Google
Applications of similarity query processing (2/8) Data Integration and data cleaning Should be “Niels Bohr” R S
Applications of similarity query processing (3/8) Duplicate (Web) Documents Detection
Applications of similarity query processing (4/8) Identify Spams SPAM TEMPLATE Sir/Madam, We happily announce to you the draw of the EURO MILLIONS SPANISH LOTTERY INTERNATIONAL WINNINGS PROGRAM PROMOTIONS held on the 27TH MARCH 2008 in SPAIN. Your company or your personal e-mail address attached to ticket number 653-908-321-675 with serial main number <NUMBER> drew lucky star winning numbers <NUMBER> which consequently won in the 2ND category, you have therefore been approved for a lump sum pay out of 960.000.00 Euros. (NINE HUNDRED AND SIXTY THOUSAND EUROS). CONGRATULATIONS!!! Sincerely yours, <NAME> <AFFILIATION>
Applications of similarity query processing (5/8) Detect Plagiarism Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a unique check disk has been eliminated. 2. Read requests have a higher level of parallelism. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a dedicated check disk the check disk never participates in read. Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a single check disk has been eliminated. 2. Read requests have a higher level of parallelism on RAID5. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a check disk the check disk never participates in read.
Applications of similarity query processing (6/8) Recommendation of friends in an SNS service Friends of a person can be representation of a binary vector Friends vector: 1 0 0 1 1 0 0 1 Friends vector: 1 0 0 1 1 1 0 1
Applications of similarity query processing (7/8) Read (a fragment of genome sequence) Alignment Reference sequence GCTGATGTGCCGCCTCACTCCGGTGG … CACTCCTGTGG CTCACTCCTGTGG GCTGATGTGCCACCTCA Short reads GATGTGCCACCTCACTC GTGCCGCCTCACTCCTG CTCCTGTGG
Applications of similarity query processing (8/8) Query Relaxation • Supported by Oracle Text • CREATE TABLEengdict(word VARCHAR(20), len INT); • Create preferences for text indexing: begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; / • CREATE INDEXfuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF'); • Usage: SELECT * FROM engdict WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0; • Limitation: cannot handle errors in the first letters: Katherine versus Catherine
Problem Formulation (1/2) Find strings similar to a given string
Problem Formulation (2/2) • Similar to: • a domain-specific function • returns a similarity value between two strings • Common similarity functions: • Jaccard coefficient • Cosine similarity • Dice similarity • Edit distance Functions require set data
String Decomposition • Word tokens for long string (e.g. web page) • x= “yes as soon as possible” • y = “as soon as possible please” • x = {A, B, C, D, E} • y = {B, C, D, E, F} • q-gram tokens for short string (e.g. keyword query) • x= “universal” • G(x, 2) = {un, ni, iv, ve, er, rs, sa, al} u n i v e r s a l
Similarity Function x = {A, B, C, D, E} y = {B, C, D, E, F} • Jaccard Similarity • Cosine similarity • Dice similarity • Edit Distance ED(x, y) = minimum number of edit operationsto change x to y (insertion, deletion, substitution) • x: Tom Hanks • y: Ton Hank • ED(x, y) = 2
A naïve approach Given a collection of strings C, a query string x, and a threshold t of a similarity function sim, 1. decompose each string in C and the query string into tokens. 2. output those string y∈C such that sim(x, y) ≥ t. Since C contains a lot of strings, this approach is obviously inefficient.
Overlap Similarity (1/2) Overlap Similarity Given a similarity threshold t,
Overlap Similarity (2/2) Given an edit distance d, d edit operations could affect d xqgrams • or, d edit operations on x can mutate dx q grams of x u n i v e r s a l x = “universal” and G(x, 2) = {un, ni, iv, ve, er, rs, sa, al} 2 edit operations on x mutate 2 x 2 q-grams Hence, y should contains at least |G(x, 2)| - 2 x 2 = 4 q-grams in G(x, 2)
Similarity Query Processing with Inverted lists an 2 ar 1 2 3 sk 4 ar 1 ea Make Inverted Lists ar is 2 4 3 ar re 1 rt 2 3 sa 2 st 3 ti 2 3 4 rt ar st ti is Merge to count occurrences Query: “artist” Overlap threshold: 4 { , , , , } 2 1 Answers of the query 2: “artisan” 3: “artist” 4 2 5 3 2 4
Merge Algorithm – HeapMerge 1: count 2 < t (X) 2: count 3 = t (O) 2 1 minHeap 2 1 2 … 3 3 1 2 3 4 1 3 3 7 3 2 2 17 17 Count threshold t≥ 3
Similarity Function Revisited To determine the overlap threshold, we need to know the size of y, which varies according to each string in a collection. Given a query x with a similarity threshold t, FOR ALL y,
Filter and Verification Framework VERIFICATION FILTER Find those strings that shares at least α tokens with the query string, where α is an overlap lower bound. Verify each string found in filtering stage by directly applying a similarity function FILTER REFINEMENT Quickly generate initial candidates using a minimum constraint Refine candidates using α
Prefix Filtering based Approach Query x = “artist” {ar, rt, ti, is, st} and overlap threshold α = 4 Prefix Lists: the first |G(x, 2)| – α + 1 lists Inverted lists for the query Sort the listsby their sizes Sort the tokens by theirdocument frequencies ar 1 2 3 is 2 4 3 rt 2 3 st 3 is 2 3 4 st 3 rt 3 2 ar 1 2 3 ti 2 3 4 Document frequencyordering ti 2 3 4 Suffix Lists: remaining α – 1 lists • Filtering Phase (the prefix filtering) • Merge the prefix lists to generate candidates 1 2 3 4 2 candidates 2 3 4 5 3 • Refinement Phase • Search the suffix lists for each candidate • A candidate searches each suffix list to identify if it is contained in the list • Binary search is used because suffix lists are usually very long
Exploiting Document Frequency Ordering (1/2) • General Goal: minimize the number of candidates initially generated • by making use of the document frequency ordering Query x = “artist” {ar, rt, ti, is, st} and overlap threshold α = 4 Prefix Lists: the first |G(x, 2)| – α+ 1 lists Prefix Lists: the first |G(x, 2)| – α + 1 lists st 3 3 rt 2 ar 1 2 3 Sort the tokens by theirdocument frequencies is 2 4 3 is 2 3 4 rt 2 3 ar 1 2 3 st 3 ti 2 3 4 ti 2 3 4 Suffix Lists: remaining α – 1 lists Suffix Lists: remaining α – 1 lists • We can reduce • time for merging short lists • number of candidates time for verification candidates 1 2 candidates 2 candidates 3 3 4
Exploiting Document Frequency Ordering (2/2) • Observation • By partitioning a data set, we can artificially modify document frequencies of tokens in each partition. • We evaluate a query in each partition and take the union of the results. • We can reduce the number of candidates by utilizing different token orderings among partitions. • Because partitions have different token orderings, we need to sort tokens in a query record in each partition. Query x = {w1, w2} and overlap threshold α = 2 w2 is the prefix list # of candidates is 0 Partition w1 is the prefix list # of candidates is 0 w2 is the prefix list # of candidates is 5 Total number of candidates is 0
Q&A Thank you!