Deterministic Length Reduction: Fast Convolution in Sparse Data and Applications

Deterministic Length Reduction: Fast Convolution in Sparse Dataand Applications Written by: Amihood Amir, Oren Kapah and Ely Porat

Motivation – Point Set Matching • Integer 1-D Point Set Matching: • T: (t1,t2,…,tn) • P: (p1,p2,…,pm) • Where ti and pi are integers. • Let N=tn, M=pm. (the maximal index) • Time: O(nm), O(N·log(M))

Motivation – Point Set Matching • 2-D Point Set Matching – Searching in Music: • T: (i1,j1),(i2,j2),…,(in,jn) • P: (i1,j1),(i2,j2),…,(im,jm) Pattern Text • Dimension Reduction: (i,j) →i·N + j

Motivation – Generalized Case • The generalized case of these problems is the d-Dimensional sparse wildcard matchingproblem. • Problem Definition: Given d-Dimensional text T with zeros and non-zeros, and a d-Dimensional pattern P with wildcards and non-zeros. Find all the locations where P matches T. • Applications:d-Dimensional point set matching, searching in music, protein activity research, etc.

Length Reduction • Goal: Given two vectors V1&V2, obtain two vectors V’1&V’2 of size O(n1) such that all non-zero in V1 and in V2 will appear as singletons in respectively while maintaining the distance property. • The Distance Property: If V’2[f(0)] is aligned with V’1[f(i)], then V’2[f(j)] will be aligned with V’1[f(i + j)]. • Using the reduced size vectors, matching can be done in time O(n1log(n1)) using convolutions.

Example: Length Reduction The vectors are given as sets of pairs:(index, value). V1:(0, 5), (6, 2), (13, 3), (19, 1) V2:(0, 2), (7, 3) Length Reduction Function:mod(5) V’1: V’2:

The Randomized Algorithm(Cole & Hariharan – STOC02) • Idea: Find a set of log(n) short vectors, in which with high probability, each non-zero in V, appears as a singleton in at least one of the vectors. • Hash functions: (ax mod(q))mod(s). Where q is a large prime number, and s is O(n). • If s is c·n, then the probability of a non-zero appearing as a multiple is constant. • Using log(n) different hash functions will reduce the failure probability exponentially.

The Randomized AlgorithmSources of Errors • Some non-zeros may appear only as multiples in all the set of vectors. • The non-zero from the text which was aligned with the non-zero from the pattern came from a different index (false matches). • This algorithm was created for matching, but in convolution each non-zero should be calculated only once.

Deterministic Length Reduction • Our Goal: Find a set of log(n) hash functions, which will ensure that each non-zero appears as a singleton at least once. • Finding the hash functions is done in a preprocessing step based on V1. • The algorithm distinguish between 2 cases: • N1 is polynomial in n1. • N1 is exponential in n1.

The Polynomial case: N<nc • Let q be a prime number of size O(n), and mod(q) be the suggested hash function. • Let i,j be the indices of two non-zeros. • Observation: If i and j are mapped into the same location, it means that q divides dij. • Observation: There are at most c prime numbers of size O(n), which divides dij. • Corollary: A non-zero can appear as a multiple in at most c·n prime numbers.

Choosing Prime Numbers • Test 2c·n prime numbers (of size O(nlogn) ), and build the following table: • Each column represents a non-zero (n columns). • Each row represents a prime number (2c·n rows). • Reminder: Each non-zero can appear as a multiple at most c·n times. • Corollary: The table is at least half full with ones.

Choosing Prime Numbers: Cont. • Select a prime number which generates a row that is at least half full. (for example P2) • Delete the row and all the columns in which there was 1 in the deleted row. • Repeat steps 1 and 2 until the whole table is deleted Slected Primes: P2, P4, Time: O(n2)

The Exponential Case: n<2n • Idea: Reduce the length of the vector to polynomial and continue with the previous algorithm. • Any distance dij can be divided by at most n prime numbers. • There are at most n2 different distances. • Corollary: There are at most n3 prime numbers which generates multiples.

The Reduction Algorithm. • Choose a prime number q of size O(n4). • Create the reduced size vector using the mod(q) hash function. • Repeat steps 1&2 if a multiple was created. • Duplicate the obtained vector (create a vector of size 2q), to allow further reduction of the vector. Time: O(n4)

The Randomized AlgorithmSources of Errors • Some non-zeros may appear only as multiples in all the set of vectors. • The non-zero from the text which was aligned with the non-zero from the pattern came from a different index (false matches). • This algorithm was created for matching, but in convolution each non-zero should be calculated only once.

The Convolution Algorithm • For each prime number Pi: • Create the reduced size vectors V’1,i &V’2,i using the indices of the non-zeros and perform shift matching. • Create the reduced size vectors V’1,i &V’2,i using 1’s instead of the non-zeros and perform convolution. • Create the reduced size vectors V’1,i &V’2,i using the values of the non-zeros and perform convolution. • Zero the value of the non-zeros appeared as singletons. • For all indices where shift matching was found: • Sum the results of the 1’s convolutions. • If the result is n2 then sum the results of the values convolutions and report the result. Time: O(nlog3(n))

Example V1:(0, 5), (5, 2), (13, 3), (20, 1) V2:(0, 2), (8, 3) Prime Numbers:5,7 V’1,1: V’2,1: (5, 1, 9), (13, 1, 6) V’1,2: V’2,2: (0, 1, 10), (5, 1, 4)

Conclusions and Open Problems • A deterministic algorithm for length reductionand fast convolution was presented. • Preprocessing time: O(n2) – Polynomial case, O(n4) – Exponential case. • Running time: O(nlog2n) • Open problems: • Can the preprocessing time be reduced? • Can the size of the vectors be reduced? • Can the number of vectors be reduced?

THE END Thank You!

Questions?

Deterministic Length Reduction: Fast Convolution in Sparse Data and Applications

Deterministic Length Reduction: Fast Convolution in Sparse Data and Applications

Presentation Transcript

Facility Access and Shipment Tracking (FAST) Overview

Fast Food and Obesity

Arc-length computation and arc-length parameterization

Perceptual Categories: Old and gradient, young and sparse.

Facility Access and Shipment Tracking (FAST) – Overview Presentation

Introduction to GIS

STATISTIK DESKRIPTIF

Benchmark Data and Laundry Applications Presented by Janice Carter Larson, CLLM

Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

CSE 326: Data Structures Part Four: Trees

Drill:

Chapter 6 Applications

Multimedia Indexing and Dimensionality Reduction

Perceptual Categories: Old and gradient, young and sparse.

Global Data Services Developing Data-Intensive Applications Using Globus Software

Data Mining Algorithms for Recommendation Systems

Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects

CS267 – Lecture 14 Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

Fast Algorithms for Time Series with applications to Finance, Physics, Music and other Suspects