Fast Approximate Point Set Matching for Information Retrieval

Fast Approximate Point Set Matching for Information Retrieval Raphaël Clifford and Benjamin Sach ben.sach.05@bristol.ac.uk

Contents • What’s the problem? • What use is it? • Is it (3-SUM) hard? • How have we solved it? • How good is our solution?

Text(T) - of size n Pattern(P) - of size m The Maximal Subset Matching problem Given a pattern, P and a text, T: We want to find the largest “match” of P in T This is also referred to as the “constellation” problem (originally by B. Chazelle)

Text(T) Pattern(P) The Maximal Subset Matching problem What is a “match”?

v Text(T) Pattern(P) The Maximal Subset Matching problem What is a “match”? A point pi in Pmatches a point tj in T with a shift, v if: pi + v = tj

Text(T) v Pattern(P) The Maximal Subset Matching problem What is a “match”? A subset ofP, M is a subset match if: There exists a shift, v, with which all points in M match points in T The Maximal Subset Matching problem is… to find the size of the largest subset match for a given P, T

Application to Music Information Retrieval • Allows for matches shifted in time and pitch • Intrinsically handles polyphonic music which • traditional string based methods do not Other Applications: • Protein structure alignment • Pharmacophore identification • Image registration • Model-based object recognition

G T i t v e n a s e f i t o n n e g e r s : h l b I T i t t 2 s e r e a r p e a c ; ; h h b ? 0 t t + + s u c a a c = Is Maximal Subset Matching hard? 3-SUM is… • There is a simple algorithm to solve 3-SUM in O(n2) time • No lower complexity solutions are known • It is conjectured that this is a lower bound “Many fundamental geometric problems fall in this class” Maximum Subset Matching has been proven to be 3-SUM HARD

MSMBP Bit-parallel implementation O(nm) time O(n) space with very low constants Cross-correlation implemented via Bit-sets MSMFT FFT based implementation O(n*log(m)) time O(n) space Cross-correlation implemented via Fast Fourier Transforms The Algorithms The Structure • Randomly project the pattern and the text into 1D • “Length reduce” the data to decrease sparsity • Perform a cross-correlation at each alignment of the length reduced pattern and text • Find the shift in the length reduced pattern that gave the largest value in the • cross-correlation • Using the “improved estimate”, infer the shift in the original data. • Return the size of the match with this shift.

( ) ( ) ( ) ( ) ( ( ) ) d h d d h d 2 + g x a x m o q x g x m o s a n x g x q m o s = = = , (a) Randomised Projection and (b) Length Reduction • Projected pattern • points are mapped to h(x) in the pattern binary array • Projected text points are mapped to h(x) and h2(x) in the text binary array • Both arrays are of length r*n, where n is the number of text points binary array of length r*n Using hash functions: Where: q = a random prime in [2N,…,4N] (N is the maximum of the projected values of P’ and T’) a = a random in [1,…,q-1] s = r*n, where r>1 is a constant (See Cole and Hariharan [3])

( ( ( ( ( ( ( ) ( ( ( ( ) ) ) ) ) ( ( ) ( ) ) ( ) ) ) ( ) ) ) ( ) ) ( ) ( ) f h h h d d d d d h d d I A A ¸ t + + + + + + ¡ g s s x g g g x x x x x m g o y g y a g y x s x m m o m q m o g o o s y q s e s m n o g x s m y o s g x g y q ( ) ( ) ( ) ( ) ( ) = = = = = f h I t + + + < , . . g x g y q e n g x y g x g y = ( , . ( ) ( ) ( ) h f i + + < x y g x g y q , ( ( ) ( ) ) h h d + x y m o s = ( ) h h i 2 t + x y o e r w s e Why does this work? Lemma 1: Significance: • If some point matches so that p + v = tthen (h(p) +h(v)) mod s matches either h(t) or h2(t) • By counting the number of 1’s in commonat each alignment we can estimate the true subset match in the original data Proof:

Estimating the Size of the Largest Subset Match • Estimation based on projected and length reduced matches: high variance which grows linearly as the number of true matches decreases (discussed in paper) • An improved Estimate: • Find the best match of the length reduced pattern in the text. • Determine in O(m) time which points in the reduced pattern match the text at that shift. • Look up, by the use of a precalculated hash table, where each of the matching points where matched from in the 1D projection, P’ and T’. • Now we have a shift for each pair of points in P’ and T’. This may have rare-inconsistencies due to collisions. We therefore perform a count and take the most frequent shift. • Finally we return the size of the match at this shift. When does this work?

2 ( ( ) ( ) ) ( ) ( ) O O O O + n n n n = (Correlation) (Alignments) (Shift) Bit-Parallel Cross-correlation (MSMBP) We store the reduced pattern and text arrays as bitsets and perform a bit-parallel correlation using ANDs and counts: • Correlation of two architectural words can be found using an AND followed by a count of the number of 1’s in the result in constant time • Count implemented by use of a look-up table. • Each reduced array is of size r*n so the bitset has O(n) words so gives each correlation in O(n) time • We need to find the correlation at each shift. • To shift the text we must shift every word in the text so takes O(n) time again. Therefore, naively, this method takes O(n2) time

p[0] (p+1)[0] << (shift t left) p[0] (p+1)[0] Bit-Parallel Cross-correlation (MSMBP) We reduce this complexity by taking advantage of the sparseness of the reduced pattern array when m << n: • p has O(n) words but only O(m) non-zero values: • we only store these at worst m words. • this reduces each correlation computation to O(m) time However, we also need to reduce the number of shifts required: |01010010|01000100|01011011|10000100|10100100|10010010|… By use of pointer arithmetic, we can align the data to any constant*b alignment (where b is the byte-size) in constant time |10100100|10001000|10110111|00001001|01001001|00100100|… A single full shift of t gives us access to alignments c*b +1 for any c So by calculating the correlations out of order, we need to perform only b shifts This results in an O(nm) time complexity algorithm

( ) i ( ) h h l h b f b W i i i i i i i t t t t t t t e r e s e m e n g s u s r n g o e g n n n g a p o s o n , m d f X ( ) i e i 1 · · t t ¢ p p n = ( ) j i j 1 + ¡ ; ; j 1 = FFT Cross-correlation (MSMFT) Uses the same steps as MSMBP except the cross-correlation step is implemented using FFTs (Fast Fourier Transforms): This uses the property of the FFT that for numerical strings: This can be calculated accurately and efficiently in O(n*log(m)) time (thanks to the FFTW team for the implementation used, see [5])

Speed Comparisons (1) Increasing Text size with proportional Pattern size (25%,75%) (P3 is the queue based method of Ukkonen at Al. [7] with complexity O(n*m*log(m))

Speed Comparisons (2) Increasing Text size with fixed Pattern size (40 points) Constant Text size (960000 points) with increasing Pattern size

Accuracy Tests Match % - The percentage of the pattern that existed in the text Actual – The sizes of the actual best matches Run 1,2,3 – The sizes of the matches found by the algorithm in each test. Avr. Diff – The average percentage of the largest present match that was returned. The text used was 4000 points in both cases Only MSMBP was used for accuracy testing as the two algorithms differ only in performance

Conclusions • We have presented two algorithms, MSMBP with O(nm) and MSMFT with O(n*log(m)) time complexity, both with O(n) space • We have shown that these are efficient on large random point sets • We have also shown that the accuracy is very high, even in situations theorised in the paper to have a lower probability of success. • We have shown experimentally speed ups of several orders of magnitude in some cases without a significant decrease in accuracy The Authors would like to thank Manolis Christodoulakis for the original implementation of the MSMFT algorithm and the EPSRC for the funding of the second author.

Questions? (from xkcd.com)

Fast Approximate Point Set Matching for Information Retrieval

Fast Approximate Point Set Matching for Information Retrieval

Presentation Transcript

Indexing Mixed Types for Approximate Retrieval

Approximate String Matching

Information Retrieval through Various Approximate Matrix Decompositions

Object retrieval with large vocabularies and fast spatial matching

Rules for Approximate String Matching

Set-Based Model: A New Approach for Information Retrieval

A Hybrid Indexing Method for Approximate String Matching

Feature Point Matching

Graph Matching for Road Network Retrieval

Feature Point Matching

Lecture 18: Approximate Pattern Matching

Distributed Approximate Matching (DAM)

Engineering a Set Intersection Algorithm for Information Retrieval

Approximate Matching of Polygonal Shapes

Approximate Boyer-Moore String Matching

Information Retrieval through Various Approximate Matrix Decompositions

Filter Algorithms for Approximate String Matching

A fast algorithm for approximate string matching on gene sequences

Approximate String Matching

Information Retrieval through Various Approximate Matrix Decompositions

Fast Pattern Matching